CI/CD for Data Pipelines
Build a complete CI/CD pipeline for a data engineering project using GitHub Actions, dbt, Airflow DAG testing, and Terraform infrastructure deployment.
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master GitHub Actions, dbt, Airflow and 3 more technologies through hands-on implementation. Rated intermediate level with comprehensive documentation and starter code.
CI/CD for Data Pipelines
Project Overview
In this project, you'll build a production-ready CI/CD pipeline for a data engineering project. You will create automated workflows that lint SQL code, run data tests, validate Airflow DAGs, and deploy infrastructure with Terraform. The goal is to ensure code quality, catch errors early, and automate deployments for data pipelines.
Most data teams treat CI/CD as an afterthought, but in production environments, automated testing and deployment are what separate fragile pipelines from reliable ones. By the end of this project, you'll have a reusable CI/CD framework that mirrors what top data teams at companies like Netflix, Airbnb, and Spotify use to ship data products with confidence.
Learning Objectives
Data Engineering CI/CD
- Understand why CI/CD matters for data projects and how it differs from traditional software CI/CD
- Design GitHub Actions workflows tailored for data engineering tasks
- Automate SQL linting and formatting with SQLFluff
- Run data tests in CI environments using dbt
- Manage Infrastructure as Code deployment patterns with Terraform
Practical Skills
- Configure pre-commit hooks for data projects (SQLFluff, Black, yamllint)
- Build dbt test automation pipelines in GitHub Actions
- Write Airflow DAG tests (unit tests, integration tests, import validation)
- Implement Terraform plan/apply workflows in CI with remote state management
- Handle secrets securely in CI/CD pipelines using GitHub Secrets
Best Practices
- Set up branch protection rules and code review workflows
- Design environment-based deployment strategies (dev / staging / prod)
- Implement rollback and failure handling strategies
- Configure pipeline monitoring, notifications, and audit trails
Prerequisites
- A GitHub account with access to GitHub Actions (free tier is sufficient)
- Python 3.9+ installed locally
- Basic familiarity with Git, SQL, and YAML
- Docker installed (for running Airflow locally)
- (Optional) A GCP or AWS account for Terraform deployment targets
Estimated Duration
Duration: 6-10 hours
Difficulty: Intermediate
Project Structure
cicd-data-pipelines/
├── .github/
│ └── workflows/
│ ├── dbt-ci.yml
│ ├── airflow-tests.yml
│ ├── terraform-deploy.yml
│ └── pre-commit-check.yml
├── .pre-commit-config.yaml
├── dbt_project/
│ ├── dbt_project.yml
│ ├── profiles.yml
│ ├── models/
│ │ ├── staging/
│ │ │ ├── stg_orders.sql
│ │ │ └── stg_customers.sql
│ │ └── marts/
│ │ └── fct_daily_revenue.sql
│ └── tests/
│ └── assert_positive_revenue.sql
├── airflow/
│ ├── dags/
│ │ └── etl_pipeline_dag.py
│ └── tests/
│ ├── test_dag_integrity.py
│ ├── test_dag_structure.py
│ └── test_task_callables.py
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ ├── backend.tf
│ └── environments/
│ ├── dev.tfvars
│ ├── staging.tfvars
│ └── prod.tfvars
├── scripts/
│ └── post_pr_comment.py
├── pyproject.toml
├── requirements.txt
├── README.md
└── .gitignore
Step-by-Step Guide
1. Pre-commit Hooks Setup
Pre-commit hooks run checks on your code before it is committed to version control. This is your first line of defense against bad code reaching your repository.
Install pre-commit
pip install pre-commit
Create .pre-commit-config.yaml
repos:
# General file hygiene
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
name: Trim trailing whitespace
- id: end-of-file-fixer
name: Fix end of file
- id: check-yaml
name: Validate YAML syntax
- id: check-added-large-files
name: Check for large files
args: ['--maxkb=1000']
# Python formatting with Black
- repo: https://github.com/psf/black
rev: 24.3.0
hooks:
- id: black
name: Format Python with Black
language_version: python3
types: [python]
# YAML linting
- repo: https://github.com/adrienverge/yamllint
rev: v1.35.1
hooks:
- id: yamllint
name: Lint YAML files
args: ['-d', '{extends: relaxed, rules: {line-length: {max: 120}}}']
# SQL linting with SQLFluff
- repo: https://github.com/sqlfluff/sqlfluff
rev: 3.0.7
hooks:
- id: sqlfluff-lint
name: Lint SQL with SQLFluff
args: ['--dialect', 'bigquery']
additional_dependencies: ['sqlfluff-templater-dbt']
- id: sqlfluff-fix
name: Fix SQL with SQLFluff
args: ['--dialect', 'bigquery', '--force']
additional_dependencies: ['sqlfluff-templater-dbt']
Hook-by-hook explanation
| Hook | Purpose |
|---|---|
trailing-whitespace |
Removes unnecessary whitespace at end of lines |
end-of-file-fixer |
Ensures files end with a newline character |
check-yaml |
Validates YAML syntax to catch typos early |
check-added-large-files |
Prevents accidental commits of large data files |
black |
Formats Python code consistently across the team |
yamllint |
Lints YAML files (Airflow DAGs, dbt configs, CI workflows) |
sqlfluff-lint |
Checks SQL style and formatting against configurable rules |
sqlfluff-fix |
Auto-fixes SQL issues (aliasing, indentation, keywords) |
Configure SQLFluff rules
Create a .sqlfluff configuration file:
[sqlfluff]
dialect = bigquery
templater = dbt
max_line_length = 120
exclude_rules = LT05
[sqlfluff:indentation]
indent_unit = space
tab_space_size = 4
[sqlfluff:rules:capitalisation.keywords]
capitalisation_policy = upper
[sqlfluff:rules:capitalisation.identifiers]
capitalisation_policy = lower
[sqlfluff:rules:aliasing.table]
aliasing = explicit
[sqlfluff:rules:aliasing.column]
aliasing = explicit
Activate the hooks
pre-commit install
pre-commit run --all-files # Run on all existing files
From now on, every git commit will automatically run these checks. If any hook fails, the commit is blocked until you fix the issue.
2. GitHub Actions: SQL & dbt Pipeline
This workflow runs on every pull request targeting main. It sets up a Python environment, lints your SQL code, and runs dbt build against a CI database to validate your models and tests.
Create .github/workflows/dbt-ci.yml
name: dbt CI Pipeline
on:
pull_request:
branches: [main]
paths:
- 'dbt_project/**'
- '.github/workflows/dbt-ci.yml'
env:
DBT_PROFILES_DIR: ./dbt_project
PYTHON_VERSION: '3.11'
jobs:
sqlfluff-lint:
name: SQL Lint Check
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
- name: Install dependencies
run: |
pip install sqlfluff sqlfluff-templater-dbt dbt-bigquery
- name: Run SQLFluff lint
run: |
sqlfluff lint dbt_project/models/ \
--dialect bigquery \
--format github-annotation \
--annotation-level warning
dbt-build:
name: dbt Build & Test
runs-on: ubuntu-latest
needs: sqlfluff-lint
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ env.PYTHON_VERSION }}
cache: 'pip'
- name: Install dependencies
run: |
pip install dbt-bigquery
- name: Authenticate to GCP
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Run dbt deps
working-directory: dbt_project
run: dbt deps
- name: Run dbt build (compile + run + test)
working-directory: dbt_project
run: |
dbt build \
--target ci \
--select state:modified+ \
--defer \
--state ./prod-manifest/ \
--full-refresh
env:
DBT_CI_SCHEMA: ci_pr_${{ github.event.pull_request.number }}
- name: Post test results as PR comment
if: always()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const runResults = JSON.parse(
fs.readFileSync('dbt_project/target/run_results.json', 'utf8')
);
const total = runResults.results.length;
const passed = runResults.results.filter(r => r.status === 'pass').length;
const failed = runResults.results.filter(r => r.status === 'fail').length;
const errored = runResults.results.filter(r => r.status === 'error').length;
const icon = failed > 0 || errored > 0 ? ':x:' : ':white_check_mark:';
const body = `## ${icon} dbt Build Results
| Metric | Count |
|--------|-------|
| Total | ${total} |
| Passed | ${passed} |
| Failed | ${failed} |
| Errors | ${errored} |
**Target schema**: \`ci_pr_${{ github.event.pull_request.number }}\`
**Commit**: \`${{ github.sha }}\`
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
- name: Clean up CI schema
if: always()
working-directory: dbt_project
run: |
dbt run-operation drop_ci_schema \
--args '{"schema": "ci_pr_${{ github.event.pull_request.number }}"}'
Key concepts explained
pathsfilter: The workflow only triggers when dbt project files change, saving CI minutes.state:modified+: Only builds models that changed in the PR, plus their downstream dependencies. This dramatically reduces CI run time for large projects.--defer: Uses production manifest as a reference, so unchanged models read from production instead of being rebuilt.- CI schema isolation: Each PR gets its own schema (
ci_pr_42), preventing test interference between concurrent PRs. - PR comment: Posts a summary of test results directly on the pull request for easy review.
Create the CI profile
Add a ci target to your profiles.yml:
my_project:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: my-gcp-project
dataset: dev_analytics
location: US
ci:
type: bigquery
method: service-account
project: my-gcp-project
dataset: "{{ env_var('DBT_CI_SCHEMA', 'ci_default') }}"
location: US
keyfile_json: "{{ env_var('GCP_SA_KEY') }}"
3. Airflow DAG Testing
Testing Airflow DAGs is critical. A broken DAG can silently prevent all other DAGs in your Airflow instance from being scheduled. This section covers three levels of testing: import validation, structural tests, and unit tests.
Create .github/workflows/airflow-tests.yml
name: Airflow DAG Tests
on:
pull_request:
branches: [main]
paths:
- 'airflow/**'
- '.github/workflows/airflow-tests.yml'
jobs:
dag-tests:
name: Airflow DAG Validation
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install apache-airflow==2.9.0 pytest pytest-cov
pip install -r requirements.txt
- name: Set Airflow home
run: |
echo "AIRFLOW_HOME=${{ github.workspace }}/airflow" >> $GITHUB_ENV
echo "AIRFLOW__CORE__DAGS_FOLDER=${{ github.workspace }}/airflow/dags" >> $GITHUB_ENV
echo "AIRFLOW__CORE__LOAD_EXAMPLES=False" >> $GITHUB_ENV
- name: Initialize Airflow DB
run: airflow db init
- name: Run DAG tests
run: |
pytest airflow/tests/ \
-v \
--tb=short \
--junitxml=test-results/airflow-results.xml
- name: Upload test results
if: always()
uses: actions/upload-artifact@v4
with:
name: airflow-test-results
path: test-results/
Test 1: DAG Import Validation (test_dag_integrity.py)
This is the most important test. If a DAG file has an import error, Airflow's scheduler will fail to parse it and potentially skip scheduling for all DAGs.
"""Test that all DAG files can be imported without errors."""
import os
import pytest
from airflow.models import DagBag
DAGS_FOLDER = os.path.join(os.path.dirname(__file__), "..", "dags")
@pytest.fixture(scope="session")
def dagbag():
"""Load all DAGs from the dags folder."""
return DagBag(dag_folder=DAGS_FOLDER, include_examples=False)
def test_no_import_errors(dagbag):
"""Verify that no DAG files have import errors."""
assert len(dagbag.import_errors) == 0, (
f"DAG import errors found: {dagbag.import_errors}"
)
def test_dagbag_not_empty(dagbag):
"""Verify that at least one DAG was loaded."""
assert len(dagbag.dags) > 0, "No DAGs found in the dags folder"
@pytest.mark.parametrize("dag_id,dag", []) # Dynamically populated below
def test_dag_has_tags(dag_id, dag):
"""Verify every DAG has at least one tag for organization."""
assert len(dag.tags) > 0, f"DAG '{dag_id}' has no tags"
def pytest_generate_tests(metafunc):
"""Dynamically parametrize DAG-level tests."""
if "dag_id" in metafunc.fixturenames:
dagbag = DagBag(dag_folder=DAGS_FOLDER, include_examples=False)
metafunc.parametrize(
"dag_id,dag",
[(dag_id, dag) for dag_id, dag in dagbag.dags.items()],
)
Test 2: DAG Structure Tests (test_dag_structure.py)
Validate scheduling, timeouts, retries, and other structural properties of your DAGs.
"""Test DAG structural properties and best practices."""
import os
import pytest
from datetime import timedelta
from airflow.models import DagBag
DAGS_FOLDER = os.path.join(os.path.dirname(__file__), "..", "dags")
MAX_DAG_TIMEOUT_MINUTES = 360 # 6 hours
@pytest.fixture(scope="session")
def dagbag():
return DagBag(dag_folder=DAGS_FOLDER, include_examples=False)
@pytest.fixture(scope="session")
def all_dags(dagbag):
return list(dagbag.dags.values())
class TestDagDefaults:
"""Test default arguments and configuration for all DAGs."""
def test_default_retries(self, all_dags):
"""Every DAG should have at least 1 retry configured."""
for dag in all_dags:
retries = dag.default_args.get("retries", 0)
assert retries >= 1, (
f"DAG '{dag.dag_id}' has {retries} retries. "
f"Production DAGs should have at least 1 retry."
)
def test_default_retry_delay(self, all_dags):
"""Retry delay should be at least 1 minute."""
for dag in all_dags:
retry_delay = dag.default_args.get(
"retry_delay", timedelta(seconds=0)
)
assert retry_delay >= timedelta(minutes=1), (
f"DAG '{dag.dag_id}' retry_delay is too short: {retry_delay}"
)
def test_owner_is_set(self, all_dags):
"""Every DAG should have an owner set (not 'airflow' default)."""
for dag in all_dags:
owner = dag.default_args.get("owner", "airflow")
assert owner != "airflow", (
f"DAG '{dag.dag_id}' uses default owner 'airflow'. "
f"Set a team or person as owner."
)
def test_dag_has_description(self, all_dags):
"""Every DAG should have a description for documentation."""
for dag in all_dags:
assert dag.description is not None and len(dag.description) > 0, (
f"DAG '{dag.dag_id}' is missing a description."
)
class TestDagSchedule:
"""Test scheduling configuration."""
def test_schedule_is_set(self, all_dags):
"""Every DAG should have an explicit schedule."""
for dag in all_dags:
assert dag.schedule_interval is not None, (
f"DAG '{dag.dag_id}' has no schedule_interval set."
)
def test_catchup_is_disabled(self, all_dags):
"""Catchup should be disabled unless explicitly needed."""
for dag in all_dags:
assert dag.catchup is False, (
f"DAG '{dag.dag_id}' has catchup=True. "
f"Disable unless backfilling is intentional."
)
Test 3: Unit Tests for Task Callables (test_task_callables.py)
Test the actual Python functions that your tasks execute, independently from Airflow.
"""Unit tests for individual task callables."""
import json
import pytest
from unittest.mock import patch, MagicMock
# Import the callable functions from your DAG module
# Adjust the import path based on your project structure
from dags.etl_pipeline_dag import (
extract_data,
transform_data,
validate_data,
)
class TestExtractData:
"""Test the extract_data task callable."""
@patch("dags.etl_pipeline_dag.requests.get")
def test_extract_returns_data(self, mock_get):
"""Extract should return parsed JSON data."""
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.json.return_value = {
"results": [{"id": 1, "value": 100}]
}
mock_get.return_value = mock_response
result = extract_data(endpoint="/api/data", date="2025-01-01")
assert "results" in result
assert len(result["results"]) == 1
@patch("dags.etl_pipeline_dag.requests.get")
def test_extract_handles_api_error(self, mock_get):
"""Extract should raise on non-200 responses."""
mock_response = MagicMock()
mock_response.status_code = 500
mock_response.raise_for_status.side_effect = Exception("Server Error")
mock_get.return_value = mock_response
with pytest.raises(Exception, match="Server Error"):
extract_data(endpoint="/api/data", date="2025-01-01")
class TestTransformData:
"""Test the transform_data task callable."""
def test_transform_adds_computed_fields(self):
"""Transform should add derived columns."""
raw_data = [
{"id": 1, "amount_cents": 1500, "currency": "USD"},
{"id": 2, "amount_cents": 2300, "currency": "USD"},
]
result = transform_data(raw_data)
assert all("amount_dollars" in row for row in result)
assert result[0]["amount_dollars"] == 15.00
assert result[1]["amount_dollars"] == 23.00
def test_transform_handles_empty_input(self):
"""Transform should return empty list for empty input."""
result = transform_data([])
assert result == []
class TestValidateData:
"""Test the validate_data task callable."""
def test_validate_passes_for_valid_data(self):
"""Validation should pass when data meets quality rules."""
data = [
{"id": 1, "amount_dollars": 15.00, "date": "2025-01-01"},
{"id": 2, "amount_dollars": 23.00, "date": "2025-01-01"},
]
# Should not raise
validate_data(data, min_rows=1, max_null_pct=0.0)
def test_validate_fails_for_too_few_rows(self):
"""Validation should fail when row count is below minimum."""
with pytest.raises(ValueError, match="Expected at least"):
validate_data([], min_rows=1, max_null_pct=0.0)
4. Terraform Infrastructure Deployment
This workflow runs terraform plan on pull requests (and posts the plan output as a PR comment for review) and runs terraform apply when changes are merged to main.
Create .github/workflows/terraform-deploy.yml
name: Terraform Infrastructure
on:
pull_request:
branches: [main]
paths:
- 'terraform/**'
- '.github/workflows/terraform-deploy.yml'
push:
branches: [main]
paths:
- 'terraform/**'
env:
TF_VERSION: '1.7.0'
TF_WORKING_DIR: './terraform'
permissions:
contents: read
pull-requests: write
jobs:
terraform-plan:
name: Terraform Plan
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Authenticate to GCP
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Terraform Init
working-directory: ${{ env.TF_WORKING_DIR }}
run: terraform init -backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}"
- name: Terraform Format Check
working-directory: ${{ env.TF_WORKING_DIR }}
run: terraform fmt -check -recursive
- name: Terraform Validate
working-directory: ${{ env.TF_WORKING_DIR }}
run: terraform validate
- name: Terraform Plan
id: plan
working-directory: ${{ env.TF_WORKING_DIR }}
run: |
terraform plan \
-var-file=environments/staging.tfvars \
-no-color \
-out=tfplan \
2>&1 | tee plan_output.txt
- name: Post plan to PR
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync(
'${{ env.TF_WORKING_DIR }}/plan_output.txt', 'utf8'
);
// Truncate if too long for GitHub comment
const maxLength = 60000;
const truncated = plan.length > maxLength
? plan.substring(0, maxLength) + '\n\n... (truncated)'
: plan;
const body = `## Terraform Plan Output
<details>
<summary>Click to expand plan</summary>
\`\`\`hcl
${truncated}
\`\`\`
</details>
**Terraform Version**: \`${{ env.TF_VERSION }}\`
**Environment**: staging
**Commit**: \`${{ github.sha }}\`
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body
});
terraform-apply:
name: Terraform Apply
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
environment: production
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Authenticate to GCP
uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- name: Terraform Init
working-directory: ${{ env.TF_WORKING_DIR }}
run: terraform init -backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}"
- name: Terraform Apply
working-directory: ${{ env.TF_WORKING_DIR }}
run: |
terraform apply \
-var-file=environments/prod.tfvars \
-auto-approve
Remote Backend Configuration (backend.tf)
Store Terraform state in a cloud bucket so your team can collaborate safely:
terraform {
backend "gcs" {
bucket = "my-project-tf-state"
prefix = "data-platform"
}
}
Example Terraform Configuration (main.tf)
provider "google" {
project = var.project_id
region = var.region
}
# BigQuery dataset for the data warehouse
resource "google_bigquery_dataset" "analytics" {
dataset_id = "${var.environment}_analytics"
friendly_name = "Analytics Dataset (${var.environment})"
description = "Main analytics dataset for the data platform"
location = var.region
default_table_expiration_ms = var.environment == "dev" ? 86400000 : null
labels = {
environment = var.environment
managed_by = "terraform"
}
}
# Cloud Storage bucket for raw data landing
resource "google_storage_bucket" "data_lake" {
name = "${var.project_id}-${var.environment}-data-lake"
location = var.region
force_destroy = var.environment == "dev" ? true : false
lifecycle_rule {
condition {
age = var.environment == "dev" ? 7 : 90
}
action {
type = "Delete"
}
}
versioning {
enabled = var.environment == "prod" ? true : false
}
labels = {
environment = var.environment
managed_by = "terraform"
}
}
# Cloud Composer (managed Airflow) environment
resource "google_composer_environment" "airflow" {
count = var.deploy_composer ? 1 : 0
name = "${var.environment}-airflow"
region = var.region
config {
software_config {
image_version = "composer-2.7.0-airflow-2.7.3"
pypi_packages = {
dbt-bigquery = ">=1.7.0"
}
}
environment_size = var.environment == "prod" ? "ENVIRONMENT_SIZE_MEDIUM" : "ENVIRONMENT_SIZE_SMALL"
}
}
Environment Variables (variables.tf)
variable "project_id" {
description = "GCP project ID"
type = string
}
variable "region" {
description = "GCP region for resources"
type = string
default = "us-central1"
}
variable "environment" {
description = "Deployment environment (dev, staging, prod)"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be one of: dev, staging, prod."
}
}
variable "deploy_composer" {
description = "Whether to deploy Cloud Composer (Airflow)"
type = bool
default = false
}
Environment-Specific Variables (environments/dev.tfvars)
project_id = "my-gcp-project"
region = "us-central1"
environment = "dev"
deploy_composer = false
Secret Management
Store sensitive values in GitHub Secrets (Settings > Secrets and variables > Actions):
| Secret Name | Purpose |
|---|---|
GCP_SA_KEY |
Service account JSON key for GCP authentication |
TF_STATE_BUCKET |
Name of the GCS bucket storing Terraform state |
DBT_CI_SERVICE_ACCOUNT |
Service account for dbt CI runs |
Never commit secrets to your repository. Use .gitignore to exclude:
# Secrets and credentials
*.json
!package.json
.env
.env.*
# Terraform
*.tfstate
*.tfstate.backup
.terraform/
5. Environment Strategy
Design a deployment strategy that safely promotes changes from development through staging to production.
Environment Overview
| Environment | Purpose | Trigger | Approval |
|---|---|---|---|
| dev | Rapid iteration and testing | Push to develop branch |
None (automatic) |
| staging | Pre-production validation | PR to main |
PR review required |
| prod | Live production systems | Merge to main |
Manual approval gate |
Branch Protection Rules
Configure these rules on the main branch in GitHub (Settings > Branches > Branch protection rules):
- Require pull request reviews (at least 1 reviewer)
- Require status checks to pass (dbt CI, Airflow tests, Terraform plan)
- Require branches to be up to date before merging
- Do not allow force pushes
- Do not allow deletions
GitHub Environment Configuration
Create environments in GitHub (Settings > Environments) with protection rules:
# In terraform-deploy.yml, the 'production' environment
# requires manual approval before terraform apply runs
terraform-apply:
environment: production # <-- This triggers the approval gate
For the production environment, add:
- Required reviewers (1-2 senior engineers)
- Wait timer (optional, e.g., 5 minutes to allow cancellation)
- Deployment branch restrictions (only
main)
dbt Environment Profiles
Each environment should have its own dbt target with isolated schemas:
# profiles.yml
my_project:
target: dev
outputs:
dev:
type: bigquery
project: my-gcp-project
dataset: dev_analytics
location: US
method: oauth
staging:
type: bigquery
project: my-gcp-project
dataset: staging_analytics
location: US
method: service-account
keyfile_json: "{{ env_var('GCP_SA_KEY') }}"
prod:
type: bigquery
project: my-gcp-project
dataset: prod_analytics
location: US
method: service-account
keyfile_json: "{{ env_var('GCP_SA_KEY') }}"
Promotion Workflow
The full promotion flow looks like this:
Developer pushes to feature branch
|
v
Open PR to main
|
+--> SQLFluff lint runs
+--> dbt build runs against CI schema
+--> Airflow DAG tests run
+--> Terraform plan runs (output posted to PR)
|
v
Code review + all checks pass
|
v
Merge to main
|
+--> Terraform apply to staging (automatic)
+--> dbt run against staging (automatic)
+--> Validation tests in staging
|
v
Manual approval for production
|
+--> Terraform apply to production
+--> dbt run against production
+--> Post-deployment validation
Rollback Strategy
When a deployment fails in production, you need a clear rollback plan:
dbt rollback: Re-run the previous version's models using Git tags:
git checkout v1.2.3 # Previous release tag dbt run --target prodTerraform rollback: Revert the merge commit and re-apply:
git revert <merge-commit-sha> # This triggers a new PR -> plan -> apply cycleAirflow rollback: If a DAG causes issues, pause it immediately via the Airflow UI, then deploy the fixed version through the normal CI/CD pipeline.
Stretch Goals
Once you have the core CI/CD pipeline working, consider these advanced extensions:
Slack Notifications
Add a notification step to your workflows:
- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": ":red_circle: *${{ github.workflow }}* failed on `${{ github.ref_name }}`",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Workflow*: ${{ github.workflow }}\n*Branch*: `${{ github.ref_name }}`\n*Commit*: `${{ github.sha }}`\n*Author*: ${{ github.actor }}\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Data Diff in PR Workflow
Use a tool like datafold/data-diff to compare query results before and after model changes:
- name: Run data diff
run: |
pip install data-diff
data-diff \
bigquery://project/prod_analytics.fct_daily_revenue \
bigquery://project/ci_pr_${{ github.event.pull_request.number }}.fct_daily_revenue \
--stats \
--json > diff_results.json
Schema Migration Validation
Add a step to verify database schema changes are backward-compatible:
- name: Validate schema changes
run: |
dbt run --target ci --select state:modified
dbt test --target ci --select state:modified
# Compare column types and counts against production
python scripts/validate_schema_compat.py \
--prod-manifest prod-manifest/manifest.json \
--ci-manifest target/manifest.json
Blue-Green Deployment for Airflow
Implement zero-downtime DAG deployments by maintaining two Airflow environments and switching traffic between them using a load balancer or DNS swap.
Deliverables
At the end of this project, you should have:
- A
.pre-commit-config.yamlwith SQLFluff, Black, yamllint, and standard hooks - A
.sqlfluffconfiguration file with project-specific rules - A
dbt-ci.ymlGitHub Actions workflow that lints SQL and runs dbt build on PRs - An
airflow-tests.ymlworkflow with DAG import validation, structural tests, and unit tests - A
terraform-deploy.ymlworkflow with plan-on-PR and apply-on-merge patterns - Terraform configuration with environment-specific variable files (dev/staging/prod)
- Remote state backend configuration for Terraform
- Branch protection rules documented and configured
- A
README.mdexplaining the CI/CD architecture, setup instructions, and how to onboard new team members
Tips and Resources
Documentation
- GitHub Actions Documentation - Official reference for workflow syntax and features
- SQLFluff Documentation - Rules reference and configuration guide
- dbt CI/CD Guide - Official dbt recommendations for CI
- Terraform GitHub Actions - HashiCorp's guide to Terraform in CI
- Airflow Testing Guide - Apache Airflow best practices for testing
- Pre-commit Documentation - Hook framework setup and configuration
Tips
- Start small: Begin with pre-commit hooks and SQLFluff linting before adding dbt and Terraform workflows. Each layer builds on the previous one.
- Use
actfor local testing: The nektos/act tool lets you run GitHub Actions workflows locally, saving time during development. - Cache dependencies: Use
actions/cacheor the built-incacheoption inactions/setup-pythonto speed up your CI runs significantly. - Monitor CI costs: GitHub Actions offers 2,000 free minutes/month for private repos. Use
pathsfilters andconcurrencygroups to avoid unnecessary runs. - Version pin everything: Pin your Action versions (
@v4, not@main), Terraform version, Python version, and pip packages to avoid unexpected breakages. - Test your CI locally first: Run
pre-commit run --all-files,dbt build, andpytestlocally before pushing. CI should catch what you missed, not be your primary testing environment.
Common Pitfalls
- Forgetting to set
AIRFLOW__CORE__LOAD_EXAMPLES=Falsein CI, which loads example DAGs and causes unexpected test failures - Not isolating CI database schemas, leading to conflicts when multiple PRs run tests simultaneously
- Hardcoding secrets in workflow files instead of using GitHub Secrets
- Missing
pathsfilters on workflows, causing every push to trigger all CI jobs - Not cleaning up CI resources (temporary schemas, plan files) after workflow completion