CI/CD for Data Pipelines

Project Overview

In this project, you'll build a production-ready CI/CD pipeline for a data engineering project. You will create automated workflows that lint SQL code, run data tests, validate Airflow DAGs, and deploy infrastructure with Terraform. The goal is to ensure code quality, catch errors early, and automate deployments for data pipelines.

Most data teams treat CI/CD as an afterthought, but in production environments, automated testing and deployment are what separate fragile pipelines from reliable ones. By the end of this project, you'll have a reusable CI/CD framework that mirrors what top data teams at companies like Netflix, Airbnb, and Spotify use to ship data products with confidence.

Learning Objectives

Data Engineering CI/CD

Understand why CI/CD matters for data projects and how it differs from traditional software CI/CD
Design GitHub Actions workflows tailored for data engineering tasks
Automate SQL linting and formatting with SQLFluff
Run data tests in CI environments using dbt
Manage Infrastructure as Code deployment patterns with Terraform

Practical Skills

Configure pre-commit hooks for data projects (SQLFluff, Black, yamllint)
Build dbt test automation pipelines in GitHub Actions
Write Airflow DAG tests (unit tests, integration tests, import validation)
Implement Terraform plan/apply workflows in CI with remote state management
Handle secrets securely in CI/CD pipelines using GitHub Secrets

Best Practices

Set up branch protection rules and code review workflows
Design environment-based deployment strategies (dev / staging / prod)
Implement rollback and failure handling strategies
Configure pipeline monitoring, notifications, and audit trails

Prerequisites

A GitHub account with access to GitHub Actions (free tier is sufficient)
Python 3.9+ installed locally
Basic familiarity with Git, SQL, and YAML
Docker installed (for running Airflow locally)
(Optional) A GCP or AWS account for Terraform deployment targets

Estimated Duration

Duration: 6-10 hours
Difficulty: Intermediate

Project Structure

cicd-data-pipelines/
├── .github/
│   └── workflows/
│       ├── dbt-ci.yml
│       ├── airflow-tests.yml
│       ├── terraform-deploy.yml
│       └── pre-commit-check.yml
├── .pre-commit-config.yaml
├── dbt_project/
│   ├── dbt_project.yml
│   ├── profiles.yml
│   ├── models/
│   │   ├── staging/
│   │   │   ├── stg_orders.sql
│   │   │   └── stg_customers.sql
│   │   └── marts/
│   │       └── fct_daily_revenue.sql
│   └── tests/
│       └── assert_positive_revenue.sql
├── airflow/
│   ├── dags/
│   │   └── etl_pipeline_dag.py
│   └── tests/
│       ├── test_dag_integrity.py
│       ├── test_dag_structure.py
│       └── test_task_callables.py
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   ├── backend.tf
│   └── environments/
│       ├── dev.tfvars
│       ├── staging.tfvars
│       └── prod.tfvars
├── scripts/
│   └── post_pr_comment.py
├── pyproject.toml
├── requirements.txt
├── README.md
└── .gitignore

Step-by-Step Guide

1. Pre-commit Hooks Setup

Pre-commit hooks run checks on your code before it is committed to version control. This is your first line of defense against bad code reaching your repository.

Install pre-commit

pip install pre-commit

Create `.pre-commit-config.yaml`

repos:
  # General file hygiene
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
        name: Trim trailing whitespace
      - id: end-of-file-fixer
        name: Fix end of file
      - id: check-yaml
        name: Validate YAML syntax
      - id: check-added-large-files
        name: Check for large files
        args: ['--maxkb=1000']

  # Python formatting with Black
  - repo: https://github.com/psf/black
    rev: 24.3.0
    hooks:
      - id: black
        name: Format Python with Black
        language_version: python3
        types: [python]

  # YAML linting
  - repo: https://github.com/adrienverge/yamllint
    rev: v1.35.1
    hooks:
      - id: yamllint
        name: Lint YAML files
        args: ['-d', '{extends: relaxed, rules: {line-length: {max: 120}}}']

  # SQL linting with SQLFluff
  - repo: https://github.com/sqlfluff/sqlfluff
    rev: 3.0.7
    hooks:
      - id: sqlfluff-lint
        name: Lint SQL with SQLFluff
        args: ['--dialect', 'bigquery']
        additional_dependencies: ['sqlfluff-templater-dbt']
      - id: sqlfluff-fix
        name: Fix SQL with SQLFluff
        args: ['--dialect', 'bigquery', '--force']
        additional_dependencies: ['sqlfluff-templater-dbt']

Hook-by-hook explanation

Hook	Purpose
`trailing-whitespace`	Removes unnecessary whitespace at end of lines
`end-of-file-fixer`	Ensures files end with a newline character
`check-yaml`	Validates YAML syntax to catch typos early
`check-added-large-files`	Prevents accidental commits of large data files
`black`	Formats Python code consistently across the team
`yamllint`	Lints YAML files (Airflow DAGs, dbt configs, CI workflows)
`sqlfluff-lint`	Checks SQL style and formatting against configurable rules
`sqlfluff-fix`	Auto-fixes SQL issues (aliasing, indentation, keywords)

Configure SQLFluff rules

Create a .sqlfluff configuration file:

[sqlfluff]
dialect = bigquery
templater = dbt
max_line_length = 120
exclude_rules = LT05

[sqlfluff:indentation]
indent_unit = space
tab_space_size = 4

[sqlfluff:rules:capitalisation.keywords]
capitalisation_policy = upper

[sqlfluff:rules:capitalisation.identifiers]
capitalisation_policy = lower

[sqlfluff:rules:aliasing.table]
aliasing = explicit

[sqlfluff:rules:aliasing.column]
aliasing = explicit

Activate the hooks

pre-commit install
pre-commit run --all-files  # Run on all existing files

From now on, every git commit will automatically run these checks. If any hook fails, the commit is blocked until you fix the issue.

2. GitHub Actions: SQL & dbt Pipeline

This workflow runs on every pull request targeting main. It sets up a Python environment, lints your SQL code, and runs dbt build against a CI database to validate your models and tests.

Create `.github/workflows/dbt-ci.yml`

name: dbt CI Pipeline

on:
  pull_request:
    branches: [main]
    paths:
      - 'dbt_project/**'
      - '.github/workflows/dbt-ci.yml'

env:
  DBT_PROFILES_DIR: ./dbt_project
  PYTHON_VERSION: '3.11'

jobs:
  sqlfluff-lint:
    name: SQL Lint Check
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: |
          pip install sqlfluff sqlfluff-templater-dbt dbt-bigquery

      - name: Run SQLFluff lint
        run: |
          sqlfluff lint dbt_project/models/ \
            --dialect bigquery \
            --format github-annotation \
            --annotation-level warning

  dbt-build:
    name: dbt Build & Test
    runs-on: ubuntu-latest
    needs: sqlfluff-lint
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install dbt-bigquery

      - name: Authenticate to GCP
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Run dbt deps
        working-directory: dbt_project
        run: dbt deps

      - name: Run dbt build (compile + run + test)
        working-directory: dbt_project
        run: |
          dbt build \
            --target ci \
            --select state:modified+ \
            --defer \
            --state ./prod-manifest/ \
            --full-refresh
        env:
          DBT_CI_SCHEMA: ci_pr_${{ github.event.pull_request.number }}

      - name: Post test results as PR comment
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const runResults = JSON.parse(
              fs.readFileSync('dbt_project/target/run_results.json', 'utf8')
            );
            const total = runResults.results.length;
            const passed = runResults.results.filter(r => r.status === 'pass').length;
            const failed = runResults.results.filter(r => r.status === 'fail').length;
            const errored = runResults.results.filter(r => r.status === 'error').length;

            const icon = failed > 0 || errored > 0 ? ':x:' : ':white_check_mark:';

            const body = `## ${icon} dbt Build Results

            | Metric | Count |
            |--------|-------|
            | Total  | ${total} |
            | Passed | ${passed} |
            | Failed | ${failed} |
            | Errors | ${errored} |

            **Target schema**: \`ci_pr_${{ github.event.pull_request.number }}\`
            **Commit**: \`${{ github.sha }}\`
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

      - name: Clean up CI schema
        if: always()
        working-directory: dbt_project
        run: |
          dbt run-operation drop_ci_schema \
            --args '{"schema": "ci_pr_${{ github.event.pull_request.number }}"}'

Key concepts explained

paths filter: The workflow only triggers when dbt project files change, saving CI minutes.
state:modified+: Only builds models that changed in the PR, plus their downstream dependencies. This dramatically reduces CI run time for large projects.
--defer: Uses production manifest as a reference, so unchanged models read from production instead of being rebuilt.
CI schema isolation: Each PR gets its own schema (ci_pr_42), preventing test interference between concurrent PRs.
PR comment: Posts a summary of test results directly on the pull request for easy review.

Create the CI profile

Add a ci target to your profiles.yml:

my_project:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: my-gcp-project
      dataset: dev_analytics
      location: US

    ci:
      type: bigquery
      method: service-account
      project: my-gcp-project
      dataset: "{{ env_var('DBT_CI_SCHEMA', 'ci_default') }}"
      location: US
      keyfile_json: "{{ env_var('GCP_SA_KEY') }}"

3. Airflow DAG Testing

Testing Airflow DAGs is critical. A broken DAG can silently prevent all other DAGs in your Airflow instance from being scheduled. This section covers three levels of testing: import validation, structural tests, and unit tests.

Create `.github/workflows/airflow-tests.yml`

name: Airflow DAG Tests

on:
  pull_request:
    branches: [main]
    paths:
      - 'airflow/**'
      - '.github/workflows/airflow-tests.yml'

jobs:
  dag-tests:
    name: Airflow DAG Validation
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
          cache: 'pip'

      - name: Install dependencies
        run: |
          pip install apache-airflow==2.9.0 pytest pytest-cov
          pip install -r requirements.txt

      - name: Set Airflow home
        run: |
          echo "AIRFLOW_HOME=${{ github.workspace }}/airflow" >> $GITHUB_ENV
          echo "AIRFLOW__CORE__DAGS_FOLDER=${{ github.workspace }}/airflow/dags" >> $GITHUB_ENV
          echo "AIRFLOW__CORE__LOAD_EXAMPLES=False" >> $GITHUB_ENV

      - name: Initialize Airflow DB
        run: airflow db init

      - name: Run DAG tests
        run: |
          pytest airflow/tests/ \
            -v \
            --tb=short \
            --junitxml=test-results/airflow-results.xml

      - name: Upload test results
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: airflow-test-results
          path: test-results/

Test 1: DAG Import Validation (`test_dag_integrity.py`)

This is the most important test. If a DAG file has an import error, Airflow's scheduler will fail to parse it and potentially skip scheduling for all DAGs.

"""Test that all DAG files can be imported without errors."""
import os
import pytest
from airflow.models import DagBag


DAGS_FOLDER = os.path.join(os.path.dirname(__file__), "..", "dags")


@pytest.fixture(scope="session")
def dagbag():
    """Load all DAGs from the dags folder."""
    return DagBag(dag_folder=DAGS_FOLDER, include_examples=False)


def test_no_import_errors(dagbag):
    """Verify that no DAG files have import errors."""
    assert len(dagbag.import_errors) == 0, (
        f"DAG import errors found: {dagbag.import_errors}"
    )


def test_dagbag_not_empty(dagbag):
    """Verify that at least one DAG was loaded."""
    assert len(dagbag.dags) > 0, "No DAGs found in the dags folder"


@pytest.mark.parametrize("dag_id,dag", [])  # Dynamically populated below
def test_dag_has_tags(dag_id, dag):
    """Verify every DAG has at least one tag for organization."""
    assert len(dag.tags) > 0, f"DAG '{dag_id}' has no tags"


def pytest_generate_tests(metafunc):
    """Dynamically parametrize DAG-level tests."""
    if "dag_id" in metafunc.fixturenames:
        dagbag = DagBag(dag_folder=DAGS_FOLDER, include_examples=False)
        metafunc.parametrize(
            "dag_id,dag",
            [(dag_id, dag) for dag_id, dag in dagbag.dags.items()],
        )

Test 2: DAG Structure Tests (`test_dag_structure.py`)

Validate scheduling, timeouts, retries, and other structural properties of your DAGs.

"""Test DAG structural properties and best practices."""
import os
import pytest
from datetime import timedelta
from airflow.models import DagBag


DAGS_FOLDER = os.path.join(os.path.dirname(__file__), "..", "dags")
MAX_DAG_TIMEOUT_MINUTES = 360  # 6 hours


@pytest.fixture(scope="session")
def dagbag():
    return DagBag(dag_folder=DAGS_FOLDER, include_examples=False)


@pytest.fixture(scope="session")
def all_dags(dagbag):
    return list(dagbag.dags.values())


class TestDagDefaults:
    """Test default arguments and configuration for all DAGs."""

    def test_default_retries(self, all_dags):
        """Every DAG should have at least 1 retry configured."""
        for dag in all_dags:
            retries = dag.default_args.get("retries", 0)
            assert retries >= 1, (
                f"DAG '{dag.dag_id}' has {retries} retries. "
                f"Production DAGs should have at least 1 retry."
            )

    def test_default_retry_delay(self, all_dags):
        """Retry delay should be at least 1 minute."""
        for dag in all_dags:
            retry_delay = dag.default_args.get(
                "retry_delay", timedelta(seconds=0)
            )
            assert retry_delay >= timedelta(minutes=1), (
                f"DAG '{dag.dag_id}' retry_delay is too short: {retry_delay}"
            )

    def test_owner_is_set(self, all_dags):
        """Every DAG should have an owner set (not 'airflow' default)."""
        for dag in all_dags:
            owner = dag.default_args.get("owner", "airflow")
            assert owner != "airflow", (
                f"DAG '{dag.dag_id}' uses default owner 'airflow'. "
                f"Set a team or person as owner."
            )

    def test_dag_has_description(self, all_dags):
        """Every DAG should have a description for documentation."""
        for dag in all_dags:
            assert dag.description is not None and len(dag.description) > 0, (
                f"DAG '{dag.dag_id}' is missing a description."
            )


class TestDagSchedule:
    """Test scheduling configuration."""

    def test_schedule_is_set(self, all_dags):
        """Every DAG should have an explicit schedule."""
        for dag in all_dags:
            assert dag.schedule_interval is not None, (
                f"DAG '{dag.dag_id}' has no schedule_interval set."
            )

    def test_catchup_is_disabled(self, all_dags):
        """Catchup should be disabled unless explicitly needed."""
        for dag in all_dags:
            assert dag.catchup is False, (
                f"DAG '{dag.dag_id}' has catchup=True. "
                f"Disable unless backfilling is intentional."
            )

Test 3: Unit Tests for Task Callables (`test_task_callables.py`)

Test the actual Python functions that your tasks execute, independently from Airflow.

"""Unit tests for individual task callables."""
import json
import pytest
from unittest.mock import patch, MagicMock


# Import the callable functions from your DAG module
# Adjust the import path based on your project structure
from dags.etl_pipeline_dag import (
    extract_data,
    transform_data,
    validate_data,
)


class TestExtractData:
    """Test the extract_data task callable."""

    @patch("dags.etl_pipeline_dag.requests.get")
    def test_extract_returns_data(self, mock_get):
        """Extract should return parsed JSON data."""
        mock_response = MagicMock()
        mock_response.status_code = 200
        mock_response.json.return_value = {
            "results": [{"id": 1, "value": 100}]
        }
        mock_get.return_value = mock_response

        result = extract_data(endpoint="/api/data", date="2025-01-01")
        assert "results" in result
        assert len(result["results"]) == 1

    @patch("dags.etl_pipeline_dag.requests.get")
    def test_extract_handles_api_error(self, mock_get):
        """Extract should raise on non-200 responses."""
        mock_response = MagicMock()
        mock_response.status_code = 500
        mock_response.raise_for_status.side_effect = Exception("Server Error")
        mock_get.return_value = mock_response

        with pytest.raises(Exception, match="Server Error"):
            extract_data(endpoint="/api/data", date="2025-01-01")


class TestTransformData:
    """Test the transform_data task callable."""

    def test_transform_adds_computed_fields(self):
        """Transform should add derived columns."""
        raw_data = [
            {"id": 1, "amount_cents": 1500, "currency": "USD"},
            {"id": 2, "amount_cents": 2300, "currency": "USD"},
        ]
        result = transform_data(raw_data)

        assert all("amount_dollars" in row for row in result)
        assert result[0]["amount_dollars"] == 15.00
        assert result[1]["amount_dollars"] == 23.00

    def test_transform_handles_empty_input(self):
        """Transform should return empty list for empty input."""
        result = transform_data([])
        assert result == []


class TestValidateData:
    """Test the validate_data task callable."""

    def test_validate_passes_for_valid_data(self):
        """Validation should pass when data meets quality rules."""
        data = [
            {"id": 1, "amount_dollars": 15.00, "date": "2025-01-01"},
            {"id": 2, "amount_dollars": 23.00, "date": "2025-01-01"},
        ]
        # Should not raise
        validate_data(data, min_rows=1, max_null_pct=0.0)

    def test_validate_fails_for_too_few_rows(self):
        """Validation should fail when row count is below minimum."""
        with pytest.raises(ValueError, match="Expected at least"):
            validate_data([], min_rows=1, max_null_pct=0.0)

4. Terraform Infrastructure Deployment

This workflow runs terraform plan on pull requests (and posts the plan output as a PR comment for review) and runs terraform apply when changes are merged to main.

Create `.github/workflows/terraform-deploy.yml`

name: Terraform Infrastructure

on:
  pull_request:
    branches: [main]
    paths:
      - 'terraform/**'
      - '.github/workflows/terraform-deploy.yml'
  push:
    branches: [main]
    paths:
      - 'terraform/**'

env:
  TF_VERSION: '1.7.0'
  TF_WORKING_DIR: './terraform'

permissions:
  contents: read
  pull-requests: write

jobs:
  terraform-plan:
    name: Terraform Plan
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Authenticate to GCP
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Terraform Init
        working-directory: ${{ env.TF_WORKING_DIR }}
        run: terraform init -backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}"

      - name: Terraform Format Check
        working-directory: ${{ env.TF_WORKING_DIR }}
        run: terraform fmt -check -recursive

      - name: Terraform Validate
        working-directory: ${{ env.TF_WORKING_DIR }}
        run: terraform validate

      - name: Terraform Plan
        id: plan
        working-directory: ${{ env.TF_WORKING_DIR }}
        run: |
          terraform plan \
            -var-file=environments/staging.tfvars \
            -no-color \
            -out=tfplan \
            2>&1 | tee plan_output.txt

      - name: Post plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync(
              '${{ env.TF_WORKING_DIR }}/plan_output.txt', 'utf8'
            );

            // Truncate if too long for GitHub comment
            const maxLength = 60000;
            const truncated = plan.length > maxLength
              ? plan.substring(0, maxLength) + '\n\n... (truncated)'
              : plan;

            const body = `## Terraform Plan Output

            <details>
            <summary>Click to expand plan</summary>

            \`\`\`hcl
            ${truncated}
            \`\`\`

            </details>

            **Terraform Version**: \`${{ env.TF_VERSION }}\`
            **Environment**: staging
            **Commit**: \`${{ github.sha }}\`
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  terraform-apply:
    name: Terraform Apply
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    environment: production
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Authenticate to GCP
        uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}

      - name: Terraform Init
        working-directory: ${{ env.TF_WORKING_DIR }}
        run: terraform init -backend-config="bucket=${{ secrets.TF_STATE_BUCKET }}"

      - name: Terraform Apply
        working-directory: ${{ env.TF_WORKING_DIR }}
        run: |
          terraform apply \
            -var-file=environments/prod.tfvars \
            -auto-approve

Remote Backend Configuration (`backend.tf`)

Store Terraform state in a cloud bucket so your team can collaborate safely:

terraform {
  backend "gcs" {
    bucket = "my-project-tf-state"
    prefix = "data-platform"
  }
}

Example Terraform Configuration (`main.tf`)

provider "google" {
  project = var.project_id
  region  = var.region
}

# BigQuery dataset for the data warehouse
resource "google_bigquery_dataset" "analytics" {
  dataset_id    = "${var.environment}_analytics"
  friendly_name = "Analytics Dataset (${var.environment})"
  description   = "Main analytics dataset for the data platform"
  location      = var.region

  default_table_expiration_ms = var.environment == "dev" ? 86400000 : null

  labels = {
    environment = var.environment
    managed_by  = "terraform"
  }
}

# Cloud Storage bucket for raw data landing
resource "google_storage_bucket" "data_lake" {
  name          = "${var.project_id}-${var.environment}-data-lake"
  location      = var.region
  force_destroy = var.environment == "dev" ? true : false

  lifecycle_rule {
    condition {
      age = var.environment == "dev" ? 7 : 90
    }
    action {
      type = "Delete"
    }
  }

  versioning {
    enabled = var.environment == "prod" ? true : false
  }

  labels = {
    environment = var.environment
    managed_by  = "terraform"
  }
}

# Cloud Composer (managed Airflow) environment
resource "google_composer_environment" "airflow" {
  count  = var.deploy_composer ? 1 : 0
  name   = "${var.environment}-airflow"
  region = var.region

  config {
    software_config {
      image_version = "composer-2.7.0-airflow-2.7.3"
      pypi_packages = {
        dbt-bigquery = ">=1.7.0"
      }
    }

    environment_size = var.environment == "prod" ? "ENVIRONMENT_SIZE_MEDIUM" : "ENVIRONMENT_SIZE_SMALL"
  }
}

Environment Variables (`variables.tf`)

variable "project_id" {
  description = "GCP project ID"
  type        = string
}

variable "region" {
  description = "GCP region for resources"
  type        = string
  default     = "us-central1"
}

variable "environment" {
  description = "Deployment environment (dev, staging, prod)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be one of: dev, staging, prod."
  }
}

variable "deploy_composer" {
  description = "Whether to deploy Cloud Composer (Airflow)"
  type        = bool
  default     = false
}

Environment-Specific Variables (`environments/dev.tfvars`)

project_id      = "my-gcp-project"
region          = "us-central1"
environment     = "dev"
deploy_composer = false

Secret Management

Store sensitive values in GitHub Secrets (Settings > Secrets and variables > Actions):

Secret Name	Purpose
`GCP_SA_KEY`	Service account JSON key for GCP authentication
`TF_STATE_BUCKET`	Name of the GCS bucket storing Terraform state
`DBT_CI_SERVICE_ACCOUNT`	Service account for dbt CI runs

Never commit secrets to your repository. Use .gitignore to exclude:

# Secrets and credentials
*.json
!package.json
.env
.env.*

# Terraform
*.tfstate
*.tfstate.backup
.terraform/

5. Environment Strategy

Design a deployment strategy that safely promotes changes from development through staging to production.

Environment Overview

Environment	Purpose	Trigger	Approval
dev	Rapid iteration and testing	Push to `develop` branch	None (automatic)
staging	Pre-production validation	PR to `main`	PR review required
prod	Live production systems	Merge to `main`	Manual approval gate

Branch Protection Rules

Configure these rules on the main branch in GitHub (Settings > Branches > Branch protection rules):

Require pull request reviews (at least 1 reviewer)
Require status checks to pass (dbt CI, Airflow tests, Terraform plan)
Require branches to be up to date before merging
Do not allow force pushes
Do not allow deletions

GitHub Environment Configuration

Create environments in GitHub (Settings > Environments) with protection rules:

# In terraform-deploy.yml, the 'production' environment
# requires manual approval before terraform apply runs
terraform-apply:
  environment: production  # <-- This triggers the approval gate

For the production environment, add:

Required reviewers (1-2 senior engineers)
Wait timer (optional, e.g., 5 minutes to allow cancellation)
Deployment branch restrictions (only main)

dbt Environment Profiles

Each environment should have its own dbt target with isolated schemas:

# profiles.yml
my_project:
  target: dev
  outputs:
    dev:
      type: bigquery
      project: my-gcp-project
      dataset: dev_analytics
      location: US
      method: oauth

    staging:
      type: bigquery
      project: my-gcp-project
      dataset: staging_analytics
      location: US
      method: service-account
      keyfile_json: "{{ env_var('GCP_SA_KEY') }}"

    prod:
      type: bigquery
      project: my-gcp-project
      dataset: prod_analytics
      location: US
      method: service-account
      keyfile_json: "{{ env_var('GCP_SA_KEY') }}"

Promotion Workflow

The full promotion flow looks like this:

Developer pushes to feature branch
    |
    v
Open PR to main
    |
    +--> SQLFluff lint runs
    +--> dbt build runs against CI schema
    +--> Airflow DAG tests run
    +--> Terraform plan runs (output posted to PR)
    |
    v
Code review + all checks pass
    |
    v
Merge to main
    |
    +--> Terraform apply to staging (automatic)
    +--> dbt run against staging (automatic)
    +--> Validation tests in staging
    |
    v
Manual approval for production
    |
    +--> Terraform apply to production
    +--> dbt run against production
    +--> Post-deployment validation

Rollback Strategy

When a deployment fails in production, you need a clear rollback plan:

dbt rollback: Re-run the previous version's models using Git tags:

git checkout v1.2.3  # Previous release tag
dbt run --target prod

Terraform rollback: Revert the merge commit and re-apply:

git revert <merge-commit-sha>
# This triggers a new PR -> plan -> apply cycle

Airflow rollback: If a DAG causes issues, pause it immediately via the Airflow UI, then deploy the fixed version through the normal CI/CD pipeline.

Stretch Goals

Once you have the core CI/CD pipeline working, consider these advanced extensions:

Slack Notifications

Add a notification step to your workflows:

- name: Notify Slack on failure
  if: failure()
  uses: slackapi/slack-github-action@v1.26.0
  with:
    payload: |
      {
        "text": ":red_circle: *${{ github.workflow }}* failed on `${{ github.ref_name }}`",
        "blocks": [
          {
            "type": "section",
            "text": {
              "type": "mrkdwn",
              "text": "*Workflow*: ${{ github.workflow }}\n*Branch*: `${{ github.ref_name }}`\n*Commit*: `${{ github.sha }}`\n*Author*: ${{ github.actor }}\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Run>"
            }
          }
        ]
      }
  env:
    SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}

Data Diff in PR Workflow

Use a tool like datafold/data-diff to compare query results before and after model changes:

- name: Run data diff
  run: |
    pip install data-diff
    data-diff \
      bigquery://project/prod_analytics.fct_daily_revenue \
      bigquery://project/ci_pr_${{ github.event.pull_request.number }}.fct_daily_revenue \
      --stats \
      --json > diff_results.json

Schema Migration Validation

Add a step to verify database schema changes are backward-compatible:

- name: Validate schema changes
  run: |
    dbt run --target ci --select state:modified
    dbt test --target ci --select state:modified
    # Compare column types and counts against production
    python scripts/validate_schema_compat.py \
      --prod-manifest prod-manifest/manifest.json \
      --ci-manifest target/manifest.json

Blue-Green Deployment for Airflow

Implement zero-downtime DAG deployments by maintaining two Airflow environments and switching traffic between them using a load balancer or DNS swap.

Deliverables

At the end of this project, you should have:

A .pre-commit-config.yaml with SQLFluff, Black, yamllint, and standard hooks
A .sqlfluff configuration file with project-specific rules
A dbt-ci.yml GitHub Actions workflow that lints SQL and runs dbt build on PRs
An airflow-tests.yml workflow with DAG import validation, structural tests, and unit tests
A terraform-deploy.yml workflow with plan-on-PR and apply-on-merge patterns
Terraform configuration with environment-specific variable files (dev/staging/prod)
Remote state backend configuration for Terraform
Branch protection rules documented and configured
A README.md explaining the CI/CD architecture, setup instructions, and how to onboard new team members

Tips and Resources

Documentation

GitHub Actions Documentation - Official reference for workflow syntax and features
SQLFluff Documentation - Rules reference and configuration guide
dbt CI/CD Guide - Official dbt recommendations for CI
Terraform GitHub Actions - HashiCorp's guide to Terraform in CI
Airflow Testing Guide - Apache Airflow best practices for testing
Pre-commit Documentation - Hook framework setup and configuration

Tips

Start small: Begin with pre-commit hooks and SQLFluff linting before adding dbt and Terraform workflows. Each layer builds on the previous one.
Use act for local testing: The nektos/act tool lets you run GitHub Actions workflows locally, saving time during development.
Cache dependencies: Use actions/cache or the built-in cache option in actions/setup-python to speed up your CI runs significantly.
Monitor CI costs: GitHub Actions offers 2,000 free minutes/month for private repos. Use paths filters and concurrency groups to avoid unnecessary runs.
Version pin everything: Pin your Action versions (@v4, not @main), Terraform version, Python version, and pip packages to avoid unexpected breakages.
Test your CI locally first: Run pre-commit run --all-files, dbt build, and pytest locally before pushing. CI should catch what you missed, not be your primary testing environment.

Common Pitfalls

Forgetting to set AIRFLOW__CORE__LOAD_EXAMPLES=False in CI, which loads example DAGs and causes unexpected test failures
Not isolating CI database schemas, leading to conflicts when multiple PRs run tests simultaneously
Hardcoding secrets in workflow files instead of using GitHub Secrets
Missing paths filters on workflows, causing every push to trigger all CI jobs
Not cleaning up CI resources (temporary schemas, plan files) after workflow completion

CI/CD for Data Pipelines

CI/CD for Data Pipelines

Project Overview

Learning Objectives

Data Engineering CI/CD

Practical Skills

Best Practices

Prerequisites

Estimated Duration

Project Structure

Step-by-Step Guide

1. Pre-commit Hooks Setup

Install pre-commit

Create .pre-commit-config.yaml

Hook-by-hook explanation

Configure SQLFluff rules

Activate the hooks

2. GitHub Actions: SQL & dbt Pipeline

Create .github/workflows/dbt-ci.yml

Key concepts explained

Create the CI profile

3. Airflow DAG Testing

Create .github/workflows/airflow-tests.yml

Test 1: DAG Import Validation (test_dag_integrity.py)

Test 2: DAG Structure Tests (test_dag_structure.py)

Test 3: Unit Tests for Task Callables (test_task_callables.py)

4. Terraform Infrastructure Deployment

Create .github/workflows/terraform-deploy.yml

Remote Backend Configuration (backend.tf)

Example Terraform Configuration (main.tf)

Environment Variables (variables.tf)

Environment-Specific Variables (environments/dev.tfvars)

Secret Management

5. Environment Strategy

Environment Overview

Branch Protection Rules

GitHub Environment Configuration

dbt Environment Profiles

Promotion Workflow

Rollback Strategy

Stretch Goals

Slack Notifications

Data Diff in PR Workflow

Schema Migration Validation

Blue-Green Deployment for Airflow

Deliverables

Tips and Resources

Documentation

Tips

Common Pitfalls

Project Details

Tools & Technologies

Difficulty Level

Estimated Duration

More Projects

Create `.pre-commit-config.yaml`

Create `.github/workflows/dbt-ci.yml`

Create `.github/workflows/airflow-tests.yml`

Test 1: DAG Import Validation (`test_dag_integrity.py`)

Test 2: DAG Structure Tests (`test_dag_structure.py`)

Test 3: Unit Tests for Task Callables (`test_task_callables.py`)

Create `.github/workflows/terraform-deploy.yml`

Remote Backend Configuration (`backend.tf`)

Example Terraform Configuration (`main.tf`)

Environment Variables (`variables.tf`)

Environment-Specific Variables (`environments/dev.tfvars`)