dbt for Analytics Engineering: Transform Your Data Warehouse

    Learn dbt from scratch — models, materializations, testing, documentation, macros, incremental models, and project structure best practices.

    By Adriano Sanges--16 min read
    dbt
    analytics engineering
    data transformation
    data warehouse
    SQL
    data modeling

    TL;DR: dbt (data build tool) brings software engineering practices — version control, testing, documentation, and modularity — to SQL transformations in the data warehouse. It is the standard tool for analytics engineering, enabling teams to build reliable, testable transformation layers organized in staging, intermediate, and marts models with built-in data quality checks.

    dbt for Analytics Engineering: Transform Your Data Warehouse

    dbt (data build tool) has fundamentally changed how data teams build and maintain transformation logic in the warehouse. Before dbt, transformation code lived in stored procedures, custom Python scripts, or proprietary ETL tools with limited version control and testing. dbt brought software engineering best practices — version control, testing, documentation, modularity — to SQL transformations.

    If you are a data engineer building or maintaining a data warehouse, dbt is a tool you need to know. This guide covers everything from the basics to advanced patterns.

    What Is Analytics Engineering?

    Analytics engineering sits at the intersection of data engineering and data analytics. While data engineers build the infrastructure and pipelines that move data, and analysts answer business questions, analytics engineers own the transformation layer — the logic that turns raw ingested data into clean, modeled, trustworthy datasets.

    The analytics engineer's primary responsibilities:

    • Data modeling: Designing dimensional models, OBT (One Big Table), or other schemas
    • Transformation logic: Writing SQL that cleans, joins, aggregates, and reshapes data
    • Data quality: Testing data for correctness, completeness, and freshness
    • Documentation: Making datasets discoverable and understandable
    • Maintenance: Managing schema changes, handling upstream breakages

    dbt is the tool that makes all of this manageable.

    dbt Core vs dbt Cloud

    dbt Core is the open-source command-line tool. You install it locally, write your models, and run them against your warehouse. It is free, extensible, and integrates with any CI/CD system.

    dbt Cloud is the managed platform built on top of Core. It provides a web IDE, job scheduling, environment management, documentation hosting, and a semantic layer. Pricing is tiered, with a free developer plan available.

    For learning and small teams, dbt Core is sufficient. For larger organizations that want managed scheduling, a shared IDE, and enterprise features, dbt Cloud adds significant value. The SQL you write is identical in both.

    Core Concepts

    Models

    A dbt model is simply a SQL SELECT statement saved as a .sql file. dbt handles the DDL (CREATE TABLE, CREATE VIEW) for you. You write the transformation logic; dbt manages the materialization.

    -- models/staging/stg_orders.sql
    -- Staging model: clean and rename raw order data
    
    SELECT
        id          AS order_id,
        user_id     AS customer_id,
        status      AS order_status,
        amount      AS order_amount_cents,
        created_at  AS ordered_at,
        updated_at
    FROM {{ source('raw', 'orders') }}
    WHERE id IS NOT NULL
    

    The {{ source() }} function references a declared source table (defined in a YAML file) and enables lineage tracking.

    Materializations

    Materializations control how dbt builds a model in the warehouse:

    Materialization What It Creates When to Use
    view SQL view Lightweight models, staging layers, rarely queried directly
    table Physical table (full refresh) Small-to-medium models, frequently queried
    incremental Table with append/merge logic Large fact tables, event data, high-volume streams
    ephemeral CTE (not materialized) Reusable logic that does not need its own table

    You set the materialization in the model config:

    -- models/marts/fct_orders.sql
    {{ config(materialized='table') }}
    
    SELECT
        o.order_id,
        o.customer_id,
        c.customer_name,
        o.order_amount_cents / 100.0 AS order_amount,
        o.ordered_at,
        DATE_TRUNC('month', o.ordered_at) AS order_month
    FROM {{ ref('stg_orders') }} o
    LEFT JOIN {{ ref('stg_customers') }} c
        ON o.customer_id = c.customer_id
    

    The {{ ref() }} function is key — it creates a dependency between models. dbt uses these references to build a DAG (directed acyclic graph) and run models in the correct order.

    Project Structure Best Practices

    A well-organized dbt project follows a layered architecture:

    models/
    ├── staging/              # 1:1 with source tables, light cleaning
    │   ├── _stg_sources.yml  # Source definitions
    │   ├── _stg_models.yml   # Model configs and tests
    │   ├── stg_orders.sql
    │   ├── stg_customers.sql
    │   └── stg_products.sql
    ├── intermediate/         # Business logic joins, complex transformations
    │   ├── int_orders_with_products.sql
    │   └── int_customer_lifetime.sql
    └── marts/                # Final business-facing models
        ├── core/
        │   ├── _core_models.yml
        │   ├── fct_orders.sql
        │   ├── dim_customers.sql
        │   └── dim_products.sql
        └── marketing/
            ├── _marketing_models.yml
            └── fct_campaign_performance.sql
    

    Layer Guidelines

    Staging models (stg_):

    • One model per source table
    • Rename columns to consistent conventions
    • Cast data types
    • Filter out obviously invalid records
    • Materialized as views (cheap, always fresh)

    Intermediate models (int_):

    • Join multiple staging models
    • Apply business logic
    • Not exposed to end users
    • Materialized as ephemeral or views

    Marts models (fct_ and dim_):

    • Business-facing, consumption-ready
    • Follow dimensional modeling conventions (star schema or snowflake schema)
    • Well-documented with descriptions
    • Materialized as tables or incremental

    Testing

    dbt's testing framework is one of its most valuable features. Tests run after models build and catch data quality issues before they reach dashboards.

    Generic Tests

    Built-in tests that you declare in YAML:

    # models/staging/_stg_models.yml
    version: 2
    
    models:
      - name: stg_orders
        description: "Cleaned orders from the raw schema"
        columns:
          - name: order_id
            description: "Primary key"
            tests:
              - unique
              - not_null
    
          - name: order_status
            description: "Current status of the order"
            tests:
              - accepted_values:
                  values: ['pending', 'shipped', 'delivered', 'cancelled', 'returned']
    
          - name: customer_id
            description: "Foreign key to stg_customers"
            tests:
              - not_null
              - relationships:
                  to: ref('stg_customers')
                  field: customer_id
    

    The four built-in generic tests are: unique, not_null, accepted_values, and relationships. Packages like dbt-utils and dbt-expectations add dozens more.

    Singular Tests

    Custom SQL queries that return failing rows. If the query returns any rows, the test fails.

    -- tests/assert_orders_amount_positive.sql
    -- All orders should have a positive amount
    
    SELECT order_id, order_amount_cents
    FROM {{ ref('stg_orders') }}
    WHERE order_amount_cents <= 0
    

    Custom Generic Tests

    You can write reusable test macros:

    -- macros/test_is_positive.sql
    {% test is_positive(model, column_name) %}
    
    SELECT {{ column_name }}
    FROM {{ model }}
    WHERE {{ column_name }} < 0
    
    {% endtest %}
    

    Then use it in YAML like any built-in test:

    columns:
      - name: order_amount
        tests:
          - is_positive
    

    Documentation

    dbt generates a documentation website from your YAML descriptions, model SQL, and DAG. Running dbt docs generate and dbt docs serve gives you a browsable, searchable documentation site with lineage graphs.

    Good documentation practices:

    • Write descriptions for every model and every column in your marts layer
    • Use doc blocks for longer descriptions that need markdown formatting
    • Document sources thoroughly — they are the entry point for understanding data flow
    # models/marts/core/_core_models.yml
    version: 2
    
    models:
      - name: fct_orders
        description: >
          Fact table containing one row per order. Includes order amounts in dollars,
          customer information, and temporal dimensions. Used by the finance and
          marketing teams for revenue reporting.
        columns:
          - name: order_id
            description: "Unique identifier for each order. Primary key."
          - name: order_amount
            description: "Order total in USD (converted from cents in staging)."
          - name: order_month
            description: "Month of the order, truncated. Used for monthly aggregation."
    

    Macros and Packages

    Macros

    Macros are Jinja templates that generate SQL. They reduce duplication and standardize logic across your project.

    -- macros/cents_to_dollars.sql
    {% macro cents_to_dollars(column_name, precision=2) %}
        ROUND({{ column_name }} / 100.0, {{ precision }})
    {% endmacro %}
    

    Use it in any model:

    SELECT
        order_id,
        {{ cents_to_dollars('order_amount_cents') }} AS order_amount
    FROM {{ ref('stg_orders') }}
    

    Packages

    dbt packages are reusable collections of macros and models. Install them via packages.yml:

    # packages.yml
    packages:
      - package: dbt-labs/dbt_utils
        version: [">=1.0.0", "<2.0.0"]
      - package: calogica/dbt_expectations
        version: [">=0.10.0", "<1.0.0"]
      - package: dbt-labs/codegen
        version: [">=0.12.0", "<1.0.0"]
    

    Run dbt deps to install. Essential packages:

    • dbt_utils: surrogate_key, pivot, unpivot, date_spine, and many more utilities
    • dbt_expectations: Great Expectations-style data quality tests
    • codegen: Generate YAML and staging model boilerplate from sources

    Incremental Models

    Incremental models are critical for performance. Instead of rebuilding an entire table on every run, they process only new or changed records.

    -- models/marts/fct_page_views.sql
    {{ config(
        materialized='incremental',
        unique_key='page_view_id',
        incremental_strategy='merge',
        on_schema_change='sync_all_columns'
    ) }}
    
    SELECT
        page_view_id,
        user_id,
        page_url,
        referrer_url,
        session_id,
        viewed_at
    FROM {{ ref('stg_page_views') }}
    
    {% if is_incremental() %}
        -- Only process records newer than the last run
        WHERE viewed_at > (SELECT MAX(viewed_at) FROM {{ this }})
    {% endif %}
    

    The {% if is_incremental() %} block only applies during incremental runs. On the first run (or when you run dbt run --full-refresh), the entire table is built.

    Incremental Strategies

    Strategy How It Works Best For
    append INSERT new rows only Immutable event data
    merge MERGE/UPSERT on unique_key Mutable data that updates
    delete+insert DELETE matching keys, INSERT new Warehouses without MERGE support
    insert_overwrite Replace entire partitions Partition-based tables (BigQuery, Spark)

    Late-Arriving Data

    The simple WHERE viewed_at > MAX(viewed_at) pattern misses late-arriving records. A safer approach uses a lookback window:

    {% if is_incremental() %}
        WHERE viewed_at > (
            SELECT DATEADD('day', -3, MAX(viewed_at))
            FROM {{ this }}
        )
    {% endif %}
    

    Combined with unique_key and a merge strategy, this catches records that arrived up to 3 days late without creating duplicates.

    Snapshots (SCD Type 2)

    Snapshots track how data changes over time by implementing Slowly Changing Dimension Type 2. dbt automatically adds dbt_valid_from and dbt_valid_to columns.

    -- snapshots/snap_customers.sql
    {% snapshot snap_customers %}
    
    {{ config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
    ) }}
    
    SELECT
        customer_id,
        customer_name,
        email,
        plan_type,
        updated_at
    FROM {{ source('raw', 'customers') }}
    
    {% endsnapshot %}
    

    When a customer's plan_type changes from "free" to "pro", the snapshot table will have two rows for that customer:

    customer_id plan_type dbt_valid_from dbt_valid_to
    123 free 2025-01-01 2025-06-15
    123 pro 2025-06-15 NULL

    The row with dbt_valid_to = NULL is the current version. This is invaluable for historical reporting and auditing.

    Common dbt Commands

    # Run all models
    dbt run
    
    # Run a specific model and its upstream dependencies
    dbt run --select +fct_orders
    
    # Run all models in the marts directory
    dbt run --select marts.*
    
    # Run tests for specific models
    dbt test --select stg_orders
    
    # Run everything: build models, test, snapshot
    dbt build
    
    # Full refresh an incremental model
    dbt run --select fct_page_views --full-refresh
    
    # Generate and serve documentation
    dbt docs generate && dbt docs serve
    
    # Compile SQL without executing (useful for debugging)
    dbt compile --select fct_orders
    

    Advanced Patterns

    Surrogate Keys

    Use dbt_utils.generate_surrogate_key to create deterministic surrogate keys:

    SELECT
        {{ dbt_utils.generate_surrogate_key(['order_id', 'product_id']) }} AS order_item_key,
        order_id,
        product_id,
        quantity,
        unit_price
    FROM {{ ref('stg_order_items') }}
    

    Date Spines

    Generate a continuous date dimension using dbt_utils.date_spine:

    -- models/intermediate/int_date_spine.sql
    {{ dbt_utils.date_spine(
        datepart="day",
        start_date="cast('2020-01-01' as date)",
        end_date="cast('2030-12-31' as date)"
    ) }}
    

    Pre/Post Hooks

    Execute SQL before or after a model runs:

    {{ config(
        materialized='table',
        post_hook=[
            "GRANT SELECT ON {{ this }} TO ROLE reporting",
            "ALTER TABLE {{ this }} SET TAG pii = 'false'"
        ]
    ) }}
    

    dbt in Practice

    A typical dbt development workflow:

    1. Branch: Create a feature branch in Git
    2. Develop: Write or modify models in your IDE
    3. Test locally: Run dbt build --select +your_model to build and test
    4. Review: Open a pull request; CI runs dbt build in a staging environment
    5. Merge: After approval, merge to main
    6. Deploy: Production dbt run executes on a schedule (hourly, daily)

    This workflow means your transformation logic has the same rigor as application code: version controlled, reviewed, tested, and deployed through CI/CD.

    Getting Started

    For hands-on practice with dbt, check out our Analytics Engineering with dbt project where you build a complete transformation layer from raw data to business-ready models. If you want to strengthen your data modeling foundations first, start with our data modeling fundamentals.

    dbt has a vibrant community, excellent documentation, and an annual conference (Coalesce) worth attending. Whether you are building your first staging model or designing a multi-team dbt mesh deployment, the fundamentals covered here will serve as your foundation.

    Frequently Asked Questions

    What is dbt?

    dbt (data build tool) is an open-source transformation framework that enables data teams to write, test, and document SQL-based transformations inside the data warehouse. A dbt model is a SQL SELECT statement saved as a file — dbt handles all DDL (CREATE TABLE, CREATE VIEW) automatically. It uses Jinja templating for dynamic SQL, builds a dependency DAG from model references, and brings software engineering practices like version control and CI/CD to analytics workflows.

    What is analytics engineering?

    Analytics engineering is the discipline that sits between data engineering and data analytics, focused on the transformation layer that turns raw ingested data into clean, modeled, trustworthy datasets. Analytics engineers design data models, write transformation logic in SQL, implement data quality tests, maintain documentation, and manage schema changes — all typically using dbt as their primary tool.

    What is the difference between dbt Core and dbt Cloud?

    dbt Core is the free, open-source command-line tool that you install locally and run against your warehouse. dbt Cloud is the managed platform built on top of Core that adds a web IDE, job scheduling, environment management, hosted documentation, and a semantic layer. The SQL you write is identical in both — the difference is in the operational tooling around it. dbt Core is sufficient for learning and small teams; dbt Cloud adds value for larger organizations.

    What are dbt materializations?

    Materializations control how dbt builds a model in the warehouse. The four types are: view (creates a SQL view, lightweight and always fresh), table (creates a physical table with full refresh), incremental (appends or merges only new/changed records for large tables), and ephemeral (compiles to a CTE, not materialized in the warehouse). Choose based on model size, query frequency, and data freshness needs.

    What are dbt tests?

    dbt tests are assertions that validate data quality after models are built. Generic tests are declared in YAML and include built-in checks like unique, not_null, accepted_values, and relationships (foreign key integrity). Singular tests are custom SQL queries that return failing rows — if any rows are returned, the test fails. Packages like dbt-utils and dbt-expectations add dozens of additional reusable test types.

    About the Author

    Adriano Sanges is a data engineering professional and the creator of dataskew.io. With years of experience building data platforms at scale, he shares practical insights and hands-on guides to help aspiring data engineers advance their careers.

    Ready to Apply What You Learned?

    Take the next step in your data engineering journey with structured roadmaps and hands-on projects designed for real-world experience.