TL;DR: dbt (data build tool) brings software engineering practices — version control, testing, documentation, and modularity — to SQL transformations in the data warehouse. It is the standard tool for analytics engineering, enabling teams to build reliable, testable transformation layers organized in staging, intermediate, and marts models with built-in data quality checks.
dbt for Analytics Engineering: Transform Your Data Warehouse
dbt (data build tool) has fundamentally changed how data teams build and maintain transformation logic in the warehouse. Before dbt, transformation code lived in stored procedures, custom Python scripts, or proprietary ETL tools with limited version control and testing. dbt brought software engineering best practices — version control, testing, documentation, modularity — to SQL transformations.
If you are a data engineer building or maintaining a data warehouse, dbt is a tool you need to know. This guide covers everything from the basics to advanced patterns.
What Is Analytics Engineering?
Analytics engineering sits at the intersection of data engineering and data analytics. While data engineers build the infrastructure and pipelines that move data, and analysts answer business questions, analytics engineers own the transformation layer — the logic that turns raw ingested data into clean, modeled, trustworthy datasets.
The analytics engineer's primary responsibilities:
- Data modeling: Designing dimensional models, OBT (One Big Table), or other schemas
- Transformation logic: Writing SQL that cleans, joins, aggregates, and reshapes data
- Data quality: Testing data for correctness, completeness, and freshness
- Documentation: Making datasets discoverable and understandable
- Maintenance: Managing schema changes, handling upstream breakages
dbt is the tool that makes all of this manageable.
dbt Core vs dbt Cloud
dbt Core is the open-source command-line tool. You install it locally, write your models, and run them against your warehouse. It is free, extensible, and integrates with any CI/CD system.
dbt Cloud is the managed platform built on top of Core. It provides a web IDE, job scheduling, environment management, documentation hosting, and a semantic layer. Pricing is tiered, with a free developer plan available.
For learning and small teams, dbt Core is sufficient. For larger organizations that want managed scheduling, a shared IDE, and enterprise features, dbt Cloud adds significant value. The SQL you write is identical in both.
Core Concepts
Models
A dbt model is simply a SQL SELECT statement saved as a .sql file. dbt handles the DDL (CREATE TABLE, CREATE VIEW) for you. You write the transformation logic; dbt manages the materialization.
-- models/staging/stg_orders.sql
-- Staging model: clean and rename raw order data
SELECT
id AS order_id,
user_id AS customer_id,
status AS order_status,
amount AS order_amount_cents,
created_at AS ordered_at,
updated_at
FROM {{ source('raw', 'orders') }}
WHERE id IS NOT NULL
The {{ source() }} function references a declared source table (defined in a YAML file) and enables lineage tracking.
Materializations
Materializations control how dbt builds a model in the warehouse:
| Materialization | What It Creates | When to Use |
|---|---|---|
| view | SQL view | Lightweight models, staging layers, rarely queried directly |
| table | Physical table (full refresh) | Small-to-medium models, frequently queried |
| incremental | Table with append/merge logic | Large fact tables, event data, high-volume streams |
| ephemeral | CTE (not materialized) | Reusable logic that does not need its own table |
You set the materialization in the model config:
-- models/marts/fct_orders.sql
{{ config(materialized='table') }}
SELECT
o.order_id,
o.customer_id,
c.customer_name,
o.order_amount_cents / 100.0 AS order_amount,
o.ordered_at,
DATE_TRUNC('month', o.ordered_at) AS order_month
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_customers') }} c
ON o.customer_id = c.customer_id
The {{ ref() }} function is key — it creates a dependency between models. dbt uses these references to build a DAG (directed acyclic graph) and run models in the correct order.
Project Structure Best Practices
A well-organized dbt project follows a layered architecture:
models/
├── staging/ # 1:1 with source tables, light cleaning
│ ├── _stg_sources.yml # Source definitions
│ ├── _stg_models.yml # Model configs and tests
│ ├── stg_orders.sql
│ ├── stg_customers.sql
│ └── stg_products.sql
├── intermediate/ # Business logic joins, complex transformations
│ ├── int_orders_with_products.sql
│ └── int_customer_lifetime.sql
└── marts/ # Final business-facing models
├── core/
│ ├── _core_models.yml
│ ├── fct_orders.sql
│ ├── dim_customers.sql
│ └── dim_products.sql
└── marketing/
├── _marketing_models.yml
└── fct_campaign_performance.sql
Layer Guidelines
Staging models (stg_):
- One model per source table
- Rename columns to consistent conventions
- Cast data types
- Filter out obviously invalid records
- Materialized as views (cheap, always fresh)
Intermediate models (int_):
- Join multiple staging models
- Apply business logic
- Not exposed to end users
- Materialized as ephemeral or views
Marts models (fct_ and dim_):
- Business-facing, consumption-ready
- Follow dimensional modeling conventions (star schema or snowflake schema)
- Well-documented with descriptions
- Materialized as tables or incremental
Testing
dbt's testing framework is one of its most valuable features. Tests run after models build and catch data quality issues before they reach dashboards.
Generic Tests
Built-in tests that you declare in YAML:
# models/staging/_stg_models.yml
version: 2
models:
- name: stg_orders
description: "Cleaned orders from the raw schema"
columns:
- name: order_id
description: "Primary key"
tests:
- unique
- not_null
- name: order_status
description: "Current status of the order"
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled', 'returned']
- name: customer_id
description: "Foreign key to stg_customers"
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
The four built-in generic tests are: unique, not_null, accepted_values, and relationships. Packages like dbt-utils and dbt-expectations add dozens more.
Singular Tests
Custom SQL queries that return failing rows. If the query returns any rows, the test fails.
-- tests/assert_orders_amount_positive.sql
-- All orders should have a positive amount
SELECT order_id, order_amount_cents
FROM {{ ref('stg_orders') }}
WHERE order_amount_cents <= 0
Custom Generic Tests
You can write reusable test macros:
-- macros/test_is_positive.sql
{% test is_positive(model, column_name) %}
SELECT {{ column_name }}
FROM {{ model }}
WHERE {{ column_name }} < 0
{% endtest %}
Then use it in YAML like any built-in test:
columns:
- name: order_amount
tests:
- is_positive
Documentation
dbt generates a documentation website from your YAML descriptions, model SQL, and DAG. Running dbt docs generate and dbt docs serve gives you a browsable, searchable documentation site with lineage graphs.
Good documentation practices:
- Write descriptions for every model and every column in your marts layer
- Use
docblocks for longer descriptions that need markdown formatting - Document sources thoroughly — they are the entry point for understanding data flow
# models/marts/core/_core_models.yml
version: 2
models:
- name: fct_orders
description: >
Fact table containing one row per order. Includes order amounts in dollars,
customer information, and temporal dimensions. Used by the finance and
marketing teams for revenue reporting.
columns:
- name: order_id
description: "Unique identifier for each order. Primary key."
- name: order_amount
description: "Order total in USD (converted from cents in staging)."
- name: order_month
description: "Month of the order, truncated. Used for monthly aggregation."
Macros and Packages
Macros
Macros are Jinja templates that generate SQL. They reduce duplication and standardize logic across your project.
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, precision=2) %}
ROUND({{ column_name }} / 100.0, {{ precision }})
{% endmacro %}
Use it in any model:
SELECT
order_id,
{{ cents_to_dollars('order_amount_cents') }} AS order_amount
FROM {{ ref('stg_orders') }}
Packages
dbt packages are reusable collections of macros and models. Install them via packages.yml:
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: [">=1.0.0", "<2.0.0"]
- package: calogica/dbt_expectations
version: [">=0.10.0", "<1.0.0"]
- package: dbt-labs/codegen
version: [">=0.12.0", "<1.0.0"]
Run dbt deps to install. Essential packages:
- dbt_utils: surrogate_key, pivot, unpivot, date_spine, and many more utilities
- dbt_expectations: Great Expectations-style data quality tests
- codegen: Generate YAML and staging model boilerplate from sources
Incremental Models
Incremental models are critical for performance. Instead of rebuilding an entire table on every run, they process only new or changed records.
-- models/marts/fct_page_views.sql
{{ config(
materialized='incremental',
unique_key='page_view_id',
incremental_strategy='merge',
on_schema_change='sync_all_columns'
) }}
SELECT
page_view_id,
user_id,
page_url,
referrer_url,
session_id,
viewed_at
FROM {{ ref('stg_page_views') }}
{% if is_incremental() %}
-- Only process records newer than the last run
WHERE viewed_at > (SELECT MAX(viewed_at) FROM {{ this }})
{% endif %}
The {% if is_incremental() %} block only applies during incremental runs. On the first run (or when you run dbt run --full-refresh), the entire table is built.
Incremental Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| append | INSERT new rows only | Immutable event data |
| merge | MERGE/UPSERT on unique_key | Mutable data that updates |
| delete+insert | DELETE matching keys, INSERT new | Warehouses without MERGE support |
| insert_overwrite | Replace entire partitions | Partition-based tables (BigQuery, Spark) |
Late-Arriving Data
The simple WHERE viewed_at > MAX(viewed_at) pattern misses late-arriving records. A safer approach uses a lookback window:
{% if is_incremental() %}
WHERE viewed_at > (
SELECT DATEADD('day', -3, MAX(viewed_at))
FROM {{ this }}
)
{% endif %}
Combined with unique_key and a merge strategy, this catches records that arrived up to 3 days late without creating duplicates.
Snapshots (SCD Type 2)
Snapshots track how data changes over time by implementing Slowly Changing Dimension Type 2. dbt automatically adds dbt_valid_from and dbt_valid_to columns.
-- snapshots/snap_customers.sql
{% snapshot snap_customers %}
{{ config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
) }}
SELECT
customer_id,
customer_name,
email,
plan_type,
updated_at
FROM {{ source('raw', 'customers') }}
{% endsnapshot %}
When a customer's plan_type changes from "free" to "pro", the snapshot table will have two rows for that customer:
| customer_id | plan_type | dbt_valid_from | dbt_valid_to |
|---|---|---|---|
| 123 | free | 2025-01-01 | 2025-06-15 |
| 123 | pro | 2025-06-15 | NULL |
The row with dbt_valid_to = NULL is the current version. This is invaluable for historical reporting and auditing.
Common dbt Commands
# Run all models
dbt run
# Run a specific model and its upstream dependencies
dbt run --select +fct_orders
# Run all models in the marts directory
dbt run --select marts.*
# Run tests for specific models
dbt test --select stg_orders
# Run everything: build models, test, snapshot
dbt build
# Full refresh an incremental model
dbt run --select fct_page_views --full-refresh
# Generate and serve documentation
dbt docs generate && dbt docs serve
# Compile SQL without executing (useful for debugging)
dbt compile --select fct_orders
Advanced Patterns
Surrogate Keys
Use dbt_utils.generate_surrogate_key to create deterministic surrogate keys:
SELECT
{{ dbt_utils.generate_surrogate_key(['order_id', 'product_id']) }} AS order_item_key,
order_id,
product_id,
quantity,
unit_price
FROM {{ ref('stg_order_items') }}
Date Spines
Generate a continuous date dimension using dbt_utils.date_spine:
-- models/intermediate/int_date_spine.sql
{{ dbt_utils.date_spine(
datepart="day",
start_date="cast('2020-01-01' as date)",
end_date="cast('2030-12-31' as date)"
) }}
Pre/Post Hooks
Execute SQL before or after a model runs:
{{ config(
materialized='table',
post_hook=[
"GRANT SELECT ON {{ this }} TO ROLE reporting",
"ALTER TABLE {{ this }} SET TAG pii = 'false'"
]
) }}
dbt in Practice
A typical dbt development workflow:
- Branch: Create a feature branch in Git
- Develop: Write or modify models in your IDE
- Test locally: Run
dbt build --select +your_modelto build and test - Review: Open a pull request; CI runs
dbt buildin a staging environment - Merge: After approval, merge to main
- Deploy: Production dbt run executes on a schedule (hourly, daily)
This workflow means your transformation logic has the same rigor as application code: version controlled, reviewed, tested, and deployed through CI/CD.
Getting Started
For hands-on practice with dbt, check out our Analytics Engineering with dbt project where you build a complete transformation layer from raw data to business-ready models. If you want to strengthen your data modeling foundations first, start with our data modeling fundamentals.
dbt has a vibrant community, excellent documentation, and an annual conference (Coalesce) worth attending. Whether you are building your first staging model or designing a multi-team dbt mesh deployment, the fundamentals covered here will serve as your foundation.
Frequently Asked Questions
What is dbt?
dbt (data build tool) is an open-source transformation framework that enables data teams to write, test, and document SQL-based transformations inside the data warehouse. A dbt model is a SQL SELECT statement saved as a file — dbt handles all DDL (CREATE TABLE, CREATE VIEW) automatically. It uses Jinja templating for dynamic SQL, builds a dependency DAG from model references, and brings software engineering practices like version control and CI/CD to analytics workflows.
What is analytics engineering?
Analytics engineering is the discipline that sits between data engineering and data analytics, focused on the transformation layer that turns raw ingested data into clean, modeled, trustworthy datasets. Analytics engineers design data models, write transformation logic in SQL, implement data quality tests, maintain documentation, and manage schema changes — all typically using dbt as their primary tool.
What is the difference between dbt Core and dbt Cloud?
dbt Core is the free, open-source command-line tool that you install locally and run against your warehouse. dbt Cloud is the managed platform built on top of Core that adds a web IDE, job scheduling, environment management, hosted documentation, and a semantic layer. The SQL you write is identical in both — the difference is in the operational tooling around it. dbt Core is sufficient for learning and small teams; dbt Cloud adds value for larger organizations.
What are dbt materializations?
Materializations control how dbt builds a model in the warehouse. The four types are: view (creates a SQL view, lightweight and always fresh), table (creates a physical table with full refresh), incremental (appends or merges only new/changed records for large tables), and ephemeral (compiles to a CTE, not materialized in the warehouse). Choose based on model size, query frequency, and data freshness needs.
What are dbt tests?
dbt tests are assertions that validate data quality after models are built. Generic tests are declared in YAML and include built-in checks like unique, not_null, accepted_values, and relationships (foreign key integrity). Singular tests are custom SQL queries that return failing rows — if any rows are returned, the test fails. Packages like dbt-utils and dbt-expectations add dozens of additional reusable test types.