🧱 Modern Data Stack Roadmap

    Master the core tools used in modern data teams — from containerization to dbt, BigQuery, and Kafka. Build real projects and get job-ready.

    ✓ Expert-Designed Learning Path• Industry-Validated Curriculum• Real-World Application Focus

    This roadmap was created by data engineering professionals with 34 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master Docker, Terraform, Airflow and 7 more technologies.

    How long does it take? Most learners with SQL and Python basics complete this roadmap in 4-7 months part-time (10-15 hours/week), or about 2-3 months full-time. The 9 sections contain 34 hands-on tasks, ending with a full pipeline project.

    The 9 steps: (0) Pre-requisites and fundamentals · (1) Containerization & Infrastructure · (2) Workflow Orchestration with Airflow · (3) Data Ingestion & Loading (Airflow and dlt) · (4) Data Warehousing in BigQuery · (5) Analytics Engineering with dbt · (6) Batch Processing with Spark · (7) Streaming with Kafka · (8) Final Project: Build a Real Data Pipeline.

    Intermediate
    9 sections • 34 tasks

    Skills You'll Learn

    • Cloud infrastructure
    • SQL & analytics engineering
    • ETL & orchestration
    • Batch & stream processing
    • Data modeling

    Tools You'll Use

    • Docker
    • Terraform
    • Airflow
    • dlt
    • BigQuery
    • dbt
    • Metabase
    • Spark
    • Kafka
    • GitHub

    Projects to Build

    Step 0: Pre-requisites and fundamentals

    -Learn the fundamentals
    -Know basic SQL and Python

    Step 1: Containerization & Infrastructure

    -Install Docker & Docker Compose
    -Run PostgreSQL using Docker locally
    -Install Terraform CLI
    -Provision GCP infra (BQ dataset + GCS bucket)

    Step 2: Workflow Orchestration with Airflow

    -Set up Airflow locally with Docker (you can also use https://www.astronomer.io/'s free tier)
    -Build a basic flow (CSV file to BigQuery)
    -Schedule a flow to run daily
    -Add logging and notification features

    Step 3: Data Ingestion & Loading (Airflow and dlt)

    -Create API ingestion task (e.g., GitHub or OpenWeather) with dlt
    -Normalize JSON into flat tables
    -Run on a schedule and incrementally with Airflow

    Step 4: Data Warehousing in BigQuery

    -Load sample data into BigQuery
    -Apply partitioning and clustering
    -Run SQL queries and optimize costs

    Step 5: Analytics Engineering with dbt

    -Install and initialize dbt with BigQuery
    -Build staging models
    -Add documentation and tests
    -Deploy with GitHub Actions or dbt Cloud
    -Visualize output in Metabase

    Step 6: Batch Processing with Spark

    -Install Spark locally or via Colab
    -Load and transform a CSV with PySpark
    -Run groupBy and joins on large datasets
    -Explore partitioning and performance tuning

    Step 7: Streaming with Kafka

    -Install Kafka via Docker or use Confluent Cloud
    -Create a simple producer/consumer
    -Process events with Kafka Streams or KSQL
    -Use Schema Registry with Avro or Protobuf

    Final Project: Build a Real Data Pipeline

    -Choose a dataset and domain (e.g., finance, sports, ecommerce)
    -Ingest the data using Airflow and dlt for batch or Kafka for streaming
    -Model and test with dbt
    -Load into BigQuery and visualize KPIs
    -Publish project on GitHub and write a short case study

    Curriculum Reference

    A free preview of the learning material in this roadmap — the full reference for every section is available when you sign in. Click any task to expand it.

    Step 1: Containerization & Infrastructure

    Install Docker & Docker Compose

    This section is heavily inspired by the Zoomcamp Docker and Terraform project. It's a great way to learn about Docker and Terraform.
    Their course is amazing and free!

    Run PostgreSQL using Docker locally

    Use the following command to run a PostgreSQL container:

    docker run --name my-postgres \
      -e POSTGRES_USER=admin \
      -e POSTGRES_PASSWORD=admin123 \
      -e POSTGRES_DB=mydatabase \
      -p 5432:5432 \
      -d postgres
    

    Explanation:

    --name my-postgres: sets the container name.

    -e POSTGRES_USER=admin: sets the DB username.

    -e POSTGRES_PASSWORD=admin123: sets the password.

    -e POSTGRES_DB=mydatabase: creates a default DB.

    -p 5432:5432: maps the container port to your local machine.

    -d: runs in detached mode.

    postgres: pulls the official PostgreSQL image.

    To persist data even after the container is removed:

    docker run --name my-postgres \
      -e POSTGRES_USER=admin \
      -e POSTGRES_PASSWORD=admin123 \
      -e POSTGRES_DB=mydatabase \
      -v pgdata:/var/lib/postgresql/data \
      -p 5432:5432 \
      -d postgres
    

    Use a client like psql, DBeaver, or any Postgres GUI tool. For psql:

    psql -h localhost -U admin -d mydatabase
    

    It will prompt for the password (admin123).

    Stop container:

    docker stop my-postgres
    

    Start container again:

    docker start my-postgres
    

    Remove container:

    docker rm -f my-postgres
    
    Install Terraform CLI
    Provision GCP infra (BQ dataset + GCS bucket)

    Unlock the learning materials for the remaining 6 sections

    Sign in free to open the curated guides, videos and docs for every task — and track your progress as you go.

    Sign in to continue

    Frequently Asked Questions

    What is the modern data stack?

    The modern data stack is the set of cloud-native, modular tools that data teams use to ingest, store, transform, and serve data. This roadmap covers Docker, Terraform, Airflow, dlt, BigQuery, dbt, Metabase, Spark, and Kafka across nine sections.

    What tools make up the modern data stack?

    This roadmap teaches Docker and Terraform for infrastructure, Airflow and dlt for orchestration and ingestion, BigQuery for warehousing, dbt for analytics engineering, Metabase for visualization, and Spark and Kafka for batch and stream processing.

    Is the modern data stack still relevant in 2026?

    Yes. The core tools in this roadmap, including Airflow, dbt, BigQuery, Spark, and Kafka, remain industry standards for data teams. The roadmap builds five real projects from infrastructure-as-code through batch processing and real-time streaming.

    Do I need to know SQL and Python before starting?

    Yes. This is an intermediate roadmap and Step 0 expects basic SQL and Python before you begin. From there you move into containerization, orchestration, warehousing, analytics engineering, and batch and stream processing.

    What is the difference between batch and stream processing?

    Batch processing runs joins and aggregations on large stored datasets, covered in this roadmap with Apache Spark. Stream processing handles continuous real-time events, covered with Kafka, Kafka Streams, KSQL, and Schema Registry. The final project uses both.

    Sign up for free courses and get early access to AI-powered grading, quizzes, and curated learning resources for each roadmap step.