TL;DR: dbt (data build tool) brings software engineering practices — version control, testing, documentation, and modularity — to SQL transformations in the data warehouse. It is the standard tool for analytics engineering, enabling teams to build reliable, testable transformation layers organized in staging, intermediate, and marts models with built-in data quality checks.

dbt for Analytics Engineering: Transform Your Data Warehouse

dbt (data build tool) has fundamentally changed how data teams build and maintain transformation logic in the warehouse. Before dbt, transformation code lived in stored procedures, custom Python scripts, or proprietary ETL tools with limited version control and testing. dbt brought software engineering best practices — version control, testing, documentation, modularity — to SQL transformations.

If you are a data engineer building or maintaining a data warehouse, dbt is a tool you need to know. This guide covers everything from the basics to advanced patterns.

What Is Analytics Engineering?

Analytics engineering sits at the intersection of data engineering and data analytics. While data engineers build the infrastructure and pipelines that move data, and analysts answer business questions, analytics engineers own the transformation layer — the logic that turns raw ingested data into clean, modeled, trustworthy datasets.

The analytics engineer's primary responsibilities:

Data modeling: Designing dimensional models, OBT (One Big Table), or other schemas
Transformation logic: Writing SQL that cleans, joins, aggregates, and reshapes data
Data quality: Testing data for correctness, completeness, and freshness
Documentation: Making datasets discoverable and understandable
Maintenance: Managing schema changes, handling upstream breakages

dbt is the tool that makes all of this manageable.

dbt Core vs dbt Cloud

dbt Core is the open-source command-line tool. You install it locally, write your models, and run them against your warehouse. It is free, extensible, and integrates with any CI/CD system.

dbt Cloud is the managed platform built on top of Core. It provides a web IDE, job scheduling, environment management, documentation hosting, and a semantic layer. Pricing is tiered, with a free developer plan available.

For learning and small teams, dbt Core is sufficient. For larger organizations that want managed scheduling, a shared IDE, and enterprise features, dbt Cloud adds significant value. The SQL you write is identical in both.

Core Concepts

Models

A dbt model is simply a SQL SELECT statement saved as a .sql file. dbt handles the DDL (CREATE TABLE, CREATE VIEW) for you. You write the transformation logic; dbt manages the materialization.

-- models/staging/stg_orders.sql
-- Staging model: clean and rename raw order data

SELECT
    id          AS order_id,
    user_id     AS customer_id,
    status      AS order_status,
    amount      AS order_amount_cents,
    created_at  AS ordered_at,
    updated_at
FROM {{ source('raw', 'orders') }}
WHERE id IS NOT NULL

The {{ source() }} function references a declared source table (defined in a YAML file) and enables lineage tracking.

Materializations

Materializations control how dbt builds a model in the warehouse:

Materialization	What It Creates	When to Use
view	SQL view	Lightweight models, staging layers, rarely queried directly
table	Physical table (full refresh)	Small-to-medium models, frequently queried
incremental	Table with append/merge logic	Large fact tables, event data, high-volume streams
ephemeral	CTE (not materialized)	Reusable logic that does not need its own table

You set the materialization in the model config:

-- models/marts/fct_orders.sql
{{ config(materialized='table') }}

SELECT
    o.order_id,
    o.customer_id,
    c.customer_name,
    o.order_amount_cents / 100.0 AS order_amount,
    o.ordered_at,
    DATE_TRUNC('month', o.ordered_at) AS order_month
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_customers') }} c
    ON o.customer_id = c.customer_id

The {{ ref() }} function is key — it creates a dependency between models. dbt uses these references to build a DAG (directed acyclic graph) and run models in the correct order.

Project Structure Best Practices

A well-organized dbt project follows a layered architecture:

models/
├── staging/              # 1:1 with source tables, light cleaning
│   ├── _stg_sources.yml  # Source definitions
│   ├── _stg_models.yml   # Model configs and tests
│   ├── stg_orders.sql
│   ├── stg_customers.sql
│   └── stg_products.sql
├── intermediate/         # Business logic joins, complex transformations
│   ├── int_orders_with_products.sql
│   └── int_customer_lifetime.sql
└── marts/                # Final business-facing models
    ├── core/
    │   ├── _core_models.yml
    │   ├── fct_orders.sql
    │   ├── dim_customers.sql
    │   └── dim_products.sql
    └── marketing/
        ├── _marketing_models.yml
        └── fct_campaign_performance.sql

Layer Guidelines

Staging models (stg_):

One model per source table
Rename columns to consistent conventions
Cast data types
Filter out obviously invalid records
Materialized as views (cheap, always fresh)

Intermediate models (int_):

Join multiple staging models
Apply business logic
Not exposed to end users
Materialized as ephemeral or views

Marts models (fct_ and dim_):

Business-facing, consumption-ready
Follow dimensional modeling conventions (star schema or snowflake schema)
Well-documented with descriptions
Materialized as tables or incremental

Testing

dbt's testing framework is one of its most valuable features. Tests run after models build and catch data quality issues before they reach dashboards.

Generic Tests

Built-in tests that you declare in YAML:

# models/staging/_stg_models.yml
version: 2

models:
  - name: stg_orders
    description: "Cleaned orders from the raw schema"
    columns:
      - name: order_id
        description: "Primary key"
        tests:
          - unique
          - not_null

      - name: order_status
        description: "Current status of the order"
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled', 'returned']

      - name: customer_id
        description: "Foreign key to stg_customers"
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

The four built-in generic tests are: unique, not_null, accepted_values, and relationships. Packages like dbt-utils and dbt-expectations add dozens more.

Singular Tests

Custom SQL queries that return failing rows. If the query returns any rows, the test fails.

-- tests/assert_orders_amount_positive.sql
-- All orders should have a positive amount

SELECT order_id, order_amount_cents
FROM {{ ref('stg_orders') }}
WHERE order_amount_cents <= 0

Custom Generic Tests

You can write reusable test macros:

-- macros/test_is_positive.sql
{% test is_positive(model, column_name) %}

SELECT {{ column_name }}
FROM {{ model }}
WHERE {{ column_name }} < 0

{% endtest %}

Then use it in YAML like any built-in test:

columns:
  - name: order_amount
    tests:
      - is_positive

Documentation

dbt generates a documentation website from your YAML descriptions, model SQL, and DAG. Running dbt docs generate and dbt docs serve gives you a browsable, searchable documentation site with lineage graphs.

Good documentation practices:

Write descriptions for every model and every column in your marts layer
Use doc blocks for longer descriptions that need markdown formatting
Document sources thoroughly — they are the entry point for understanding data flow

# models/marts/core/_core_models.yml
version: 2

models:
  - name: fct_orders
    description: >
      Fact table containing one row per order. Includes order amounts in dollars,
      customer information, and temporal dimensions. Used by the finance and
      marketing teams for revenue reporting.
    columns:
      - name: order_id
        description: "Unique identifier for each order. Primary key."
      - name: order_amount
        description: "Order total in USD (converted from cents in staging)."
      - name: order_month
        description: "Month of the order, truncated. Used for monthly aggregation."

Macros and Packages

Macros

Macros are Jinja templates that generate SQL. They reduce duplication and standardize logic across your project.

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, precision=2) %}
    ROUND({{ column_name }} / 100.0, {{ precision }})
{% endmacro %}

Use it in any model:

SELECT
    order_id,
    {{ cents_to_dollars('order_amount_cents') }} AS order_amount
FROM {{ ref('stg_orders') }}

Packages

dbt packages are reusable collections of macros and models. Install them via packages.yml:

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0", "<2.0.0"]
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<1.0.0"]
  - package: dbt-labs/codegen
    version: [">=0.12.0", "<1.0.0"]

Run dbt deps to install. Essential packages:

dbt_utils: surrogate_key, pivot, unpivot, date_spine, and many more utilities
dbt_expectations: Great Expectations-style data quality tests
codegen: Generate YAML and staging model boilerplate from sources

Incremental Models

Incremental models are critical for performance. Instead of rebuilding an entire table on every run, they process only new or changed records.

-- models/marts/fct_page_views.sql
{{ config(
    materialized='incremental',
    unique_key='page_view_id',
    incremental_strategy='merge',
    on_schema_change='sync_all_columns'
) }}

SELECT
    page_view_id,
    user_id,
    page_url,
    referrer_url,
    session_id,
    viewed_at
FROM {{ ref('stg_page_views') }}

{% if is_incremental() %}
    -- Only process records newer than the last run
    WHERE viewed_at > (SELECT MAX(viewed_at) FROM {{ this }})
{% endif %}

The {% if is_incremental() %} block only applies during incremental runs. On the first run (or when you run dbt run --full-refresh), the entire table is built.

Incremental Strategies

Strategy	How It Works	Best For
append	INSERT new rows only	Immutable event data
merge	MERGE/UPSERT on unique_key	Mutable data that updates
delete+insert	DELETE matching keys, INSERT new	Warehouses without MERGE support
insert_overwrite	Replace entire partitions	Partition-based tables (BigQuery, Spark)

Late-Arriving Data

The simple WHERE viewed_at > MAX(viewed_at) pattern misses late-arriving records. A safer approach uses a lookback window:

{% if is_incremental() %}
    WHERE viewed_at > (
        SELECT DATEADD('day', -3, MAX(viewed_at))
        FROM {{ this }}
    )
{% endif %}

Combined with unique_key and a merge strategy, this catches records that arrived up to 3 days late without creating duplicates.

Snapshots (SCD Type 2)

Snapshots track how data changes over time by implementing Slowly Changing Dimension Type 2. dbt automatically adds dbt_valid_from and dbt_valid_to columns.

-- snapshots/snap_customers.sql
{% snapshot snap_customers %}

{{ config(
    target_schema='snapshots',
    unique_key='customer_id',
    strategy='timestamp',
    updated_at='updated_at',
) }}

SELECT
    customer_id,
    customer_name,
    email,
    plan_type,
    updated_at
FROM {{ source('raw', 'customers') }}

{% endsnapshot %}

When a customer's plan_type changes from "free" to "pro", the snapshot table will have two rows for that customer:

customer_id	plan_type	dbt_valid_from	dbt_valid_to
123	free	2025-01-01	2025-06-15
123	pro	2025-06-15	NULL

The row with dbt_valid_to = NULL is the current version. This is invaluable for historical reporting and auditing.

Common dbt Commands

# Run all models
dbt run

# Run a specific model and its upstream dependencies
dbt run --select +fct_orders

# Run all models in the marts directory
dbt run --select marts.*

# Run tests for specific models
dbt test --select stg_orders

# Run everything: build models, test, snapshot
dbt build

# Full refresh an incremental model
dbt run --select fct_page_views --full-refresh

# Generate and serve documentation
dbt docs generate && dbt docs serve

# Compile SQL without executing (useful for debugging)
dbt compile --select fct_orders

Advanced Patterns

Surrogate Keys

Use dbt_utils.generate_surrogate_key to create deterministic surrogate keys:

SELECT
    {{ dbt_utils.generate_surrogate_key(['order_id', 'product_id']) }} AS order_item_key,
    order_id,
    product_id,
    quantity,
    unit_price
FROM {{ ref('stg_order_items') }}

Date Spines

Generate a continuous date dimension using dbt_utils.date_spine:

-- models/intermediate/int_date_spine.sql
{{ dbt_utils.date_spine(
    datepart="day",
    start_date="cast('2020-01-01' as date)",
    end_date="cast('2030-12-31' as date)"
) }}

Pre/Post Hooks

Execute SQL before or after a model runs:

{{ config(
    materialized='table',
    post_hook=[
        "GRANT SELECT ON {{ this }} TO ROLE reporting",
        "ALTER TABLE {{ this }} SET TAG pii = 'false'"
    ]
) }}

dbt in Practice

A typical dbt development workflow:

Branch: Create a feature branch in Git
Develop: Write or modify models in your IDE
Test locally: Run dbt build --select +your_model to build and test
Review: Open a pull request; CI runs dbt build in a staging environment
Merge: After approval, merge to main
Deploy: Production dbt run executes on a schedule (hourly, daily)

This workflow means your transformation logic has the same rigor as application code: version controlled, reviewed, tested, and deployed through CI/CD.

Getting Started

For hands-on practice with dbt, check out our Analytics Engineering with dbt project where you build a complete transformation layer from raw data to business-ready models. If you want to strengthen your data modeling foundations first, start with our data modeling fundamentals.

dbt has a vibrant community, excellent documentation, and an annual conference (Coalesce) worth attending. Whether you are building your first staging model or designing a multi-team dbt mesh deployment, the fundamentals covered here will serve as your foundation.

Frequently Asked Questions

What is dbt?

dbt (data build tool) is an open-source transformation framework that enables data teams to write, test, and document SQL-based transformations inside the data warehouse. A dbt model is a SQL SELECT statement saved as a file — dbt handles all DDL (CREATE TABLE, CREATE VIEW) automatically. It uses Jinja templating for dynamic SQL, builds a dependency DAG from model references, and brings software engineering practices like version control and CI/CD to analytics workflows.

What is analytics engineering?

Analytics engineering is the discipline that sits between data engineering and data analytics, focused on the transformation layer that turns raw ingested data into clean, modeled, trustworthy datasets. Analytics engineers design data models, write transformation logic in SQL, implement data quality tests, maintain documentation, and manage schema changes — all typically using dbt as their primary tool.

What is the difference between dbt Core and dbt Cloud?

dbt Core is the free, open-source command-line tool that you install locally and run against your warehouse. dbt Cloud is the managed platform built on top of Core that adds a web IDE, job scheduling, environment management, hosted documentation, and a semantic layer. The SQL you write is identical in both — the difference is in the operational tooling around it. dbt Core is sufficient for learning and small teams; dbt Cloud adds value for larger organizations.

What are dbt materializations?

Materializations control how dbt builds a model in the warehouse. The four types are: view (creates a SQL view, lightweight and always fresh), table (creates a physical table with full refresh), incremental (appends or merges only new/changed records for large tables), and ephemeral (compiles to a CTE, not materialized in the warehouse). Choose based on model size, query frequency, and data freshness needs.

What are dbt tests?

dbt tests are assertions that validate data quality after models are built. Generic tests are declared in YAML and include built-in checks like unique, not_null, accepted_values, and relationships (foreign key integrity). Singular tests are custom SQL queries that return failing rows — if any rows are returned, the test fails. Packages like dbt-utils and dbt-expectations add dozens of additional reusable test types.

dbt for Analytics Engineering: Transform Your Data Warehouse

dbt for Analytics Engineering: Transform Your Data Warehouse

What Is Analytics Engineering?

dbt Core vs dbt Cloud

Core Concepts

Models

Materializations

Project Structure Best Practices

Layer Guidelines

Testing

Generic Tests

Singular Tests

Custom Generic Tests

Documentation

Macros and Packages

Macros

Packages

Incremental Models

Incremental Strategies

Late-Arriving Data

Snapshots (SCD Type 2)

Common dbt Commands

Advanced Patterns

Surrogate Keys

Date Spines

Pre/Post Hooks

dbt in Practice

Getting Started

Frequently Asked Questions

What is dbt?

What is analytics engineering?

What is the difference between dbt Core and dbt Cloud?

What are dbt materializations?

What are dbt tests?

About the Author

Related Articles

ETL vs ELT: A Complete Guide for Data Engineers

Data Pipeline Design Patterns Every Engineer Should Know

Data Warehouse vs Data Lake vs Data Lakehouse: Choosing the Right Architecture

Ready to Apply What You Learned?