Data Engineer Roadmap 2026: From Zero to Job-Ready (Step-by-Step)

A free, step-by-step data engineering roadmap for 2026. Learn SQL, Python, ETL, cloud fundamentals, dbt, Airflow and Docker through 51 hands-on tasks and build the projects you need to land your first data engineer job.

✓ Expert-Designed Learning Path• Industry-Validated Curriculum• Real-World Application Focus

This roadmap was created by data engineering professionals with 51 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master Python, SQL, PostgreSQL and 5 more technologies.

How long does it take? Most career-changers complete this roadmap in 6-9 months studying part-time (10-15 hours/week), or about 3-4 months full-time. The 11 sections contain 51 hands-on tasks.

The 11 steps: (0) Prerequisites · (1) SQL Fundamentals · (2) Python for Data · (3) Version Control and CLI · (4) Databases and Data Modeling · (5) Docker and Development Environment · (6) Your First ETL Pipeline · (7) Cloud Fundamentals · (8) Orchestration Basics · (9) Analytics Engineering · (10) Portfolio and Job Search.

Beginner

11 sections • 51 tasks

Skills You'll Learn

SQL
Python
ETL fundamentals
Cloud basics
Data modeling
Version control

Tools You'll Use

Python
SQL
PostgreSQL
Docker
Git
DuckDB
Airflow
dbt

Projects to Build

Local Data Engineering Environment with dlt, DuckDB & Jupyter
Set up a local development environment for data processing and analytics using Jupyter notebooks, dlt, and DuckDB. All tools are open-source and run locally.
Scheduled GitHub ETL with Polars, DLT & DuckDB
Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB
End-to-End Analytics Platform with DuckDB + Metabase
Build a modern, low-cost analytics stack using DuckDB, Metabase, and GitHub Actions for automated data updates and business-ready dashboards.

Learning Resources

Python Official Tutorial

documentation

SQLBolt Interactive SQL Tutorial

course

Docker Getting Started

documentation

Step 0: Prerequisites

-Understand basic computer science concepts: how the internet works, client-server model, and file systems

-Get comfortable with the command line: navigate directories, create files, and run scripts

-Learn how to use a code editor (VS Code recommended) and install useful extensions

-Understand what data engineering is and how it fits in the data ecosystem alongside analytics and data science

Step 1: SQL Fundamentals

-Learn SELECT, WHERE, ORDER BY, and LIMIT to query data from tables

-Master JOIN types: INNER, LEFT, RIGHT, and FULL OUTER joins across multiple tables

-Use GROUP BY and aggregate functions (COUNT, SUM, AVG, MIN, MAX) for data summarization

-Write subqueries and Common Table Expressions (CTEs) for complex queries

-Learn window functions (ROW_NUMBER, RANK, LAG, LEAD, running totals) for advanced analytics

Step 2: Python for Data

-Install Python, set up a virtual environment, and learn core syntax (variables, loops, functions)

-Work with Python data structures: lists, dictionaries, sets, and comprehensions

-Read and write files: CSV, JSON, and Parquet using pandas or polars

-Make HTTP requests to REST APIs and parse JSON responses

-Handle errors gracefully with try/except and implement basic logging

Step 3: Version Control and CLI

-Install Git and learn the basics: init, add, commit, push, pull, and branching

-Create a GitHub account and push your first repository

-Practice essential bash commands: pipes, redirects, grep, awk, and cron

-Set up SSH keys for secure access to remote servers and GitHub

Step 4: Databases and Data Modeling

-Install PostgreSQL locally and practice creating databases, tables, and inserting data

-Understand relational database design: primary keys, foreign keys, and constraints

-Learn normalization (1NF, 2NF, 3NF) and when to denormalize for performance

-Draw Entity-Relationship (ER) diagrams to model a real-world business domain

-Explore [Data Modeling fundamentals](/fundamentals/data-modeling) for deeper schema design skills

Step 5: Docker and Development Environment

-Install Docker and understand containers vs virtual machines

-Write your first Dockerfile to containerize a Python script

-Use Docker Compose to spin up PostgreSQL and pgAdmin as a local data stack

-Mount local volumes for code hot-reloading and data persistence

Step 6: Your First ETL Pipeline

-Extract data from a public REST API (e.g., weather, financial, or open government data)

-Transform the raw data using Python: clean, filter, enrich, and reshape

-Load the transformed data into a PostgreSQL database

-Add logging, error handling, and idempotency to make the pipeline production-ready

-Schedule the pipeline to run daily using cron or a simple Python scheduler

Step 7: Cloud Fundamentals

-Create a free-tier account on AWS, GCP, or Azure and explore the console

-Learn object storage (S3 / GCS): upload files, set permissions, and organize with prefixes

-Understand IAM basics: users, roles, policies, and the principle of least privilege

-Provision a managed database (RDS / Cloud SQL) and connect from your local machine

Step 8: Orchestration Basics

-Understand what orchestration is and why it matters for data pipelines

-Install Apache Airflow locally using Docker Compose

-Write your first DAG with tasks, dependencies, and a schedule

-Use Airflow operators to run Python functions, execute SQL, and transfer data

-Monitor DAG runs, handle failures, and set up retries and alerts

Step 9: Analytics Engineering

-Understand what analytics engineering is and where dbt fits in the modern data stack

-Install dbt Core and initialize a project connected to PostgreSQL or DuckDB

-Create staging and mart models using SQL and Jinja templating

-Add data tests (not_null, unique, accepted_values, relationships) to validate transformations

-Generate and serve dbt documentation to share your data lineage with stakeholders

Step 10: Portfolio and Job Search

-Build a portfolio with 2-3 end-to-end data pipeline projects on GitHub

-Write clear README files with architecture diagrams for each project

-Tailor your resume to highlight data engineering skills, tools, and measurable outcomes

-Practice common data engineering interview topics: SQL, system design, and pipeline architecture

-Explore the [Interview Prep](/interview-prep) section for real questions from top companies

Curriculum Reference

A free preview of the learning material in this roadmap — the full reference for every section is available when you sign in. Click any task to expand it.

Step 0: Prerequisites

Understand basic computer science concepts: how the internet works, client-server model, and file systems

Before diving into data engineering, you need a solid grasp of a few core CS concepts.

How the Internet Works

IP Address: A unique numerical label assigned to each device on a network
DNS: Translates human-readable domain names (google.com) to IP addresses
HTTP/HTTPS: Protocols for transferring data between clients and servers
TCP/IP: The foundational communication protocols of the internet

Client-Server Architecture

Client: Makes requests (browser, Python script, mobile app)
Server: Processes requests and sends responses (web server, database server, API server)
Request/Response Cycle: Client sends a request → Server processes it → Server returns a response

File Systems

Directories/Folders: Hierarchical organization of files
Paths: Absolute (/home/user/data) vs Relative (./data)
File Extensions: .csv, .json, .parquet, .sql — you'll use all of these
Permissions: Read, Write, Execute (important for scripts and data files)

Why This Matters: Data pipelines pull data from APIs (HTTP), move files across systems (file I/O), and connect to databases (client-server). These fundamentals appear everywhere.

How the Internet Works in 5 Minutes (video)
Client-Server Model (MDN Web Docs) (documentation)

Get comfortable with the command line: navigate directories, create files, and run scripts

The command line is your primary tool as a data engineer. Get comfortable with these essentials.

Navigation

pwd          # Print working directory
ls           # List files and directories
ls -la       # List all files with details
cd /path     # Change directory
cd ..        # Go up one level
cd ~         # Go to home directory

File Operations

cat file.txt       # Display file contents
head -n 10 file.csv  # First 10 lines
tail -n 10 file.csv  # Last 10 lines
wc -l file.csv     # Count lines
cp source dest     # Copy file
mv source dest     # Move/rename file
mkdir dirname      # Create directory
rm file.txt        # Delete file

Searching & Filtering

grep 'pattern' file.txt    # Search for text
find . -name '*.csv'       # Find files by name
| (pipe)                   # Chain commands
> output.txt               # Redirect output to file

Process Management

ps aux         # List running processes
top            # Monitor system resources
kill PID       # Stop a process
Ctrl+C         # Cancel running command

Tip: Practice by navigating your file system, creating directories, and manipulating text files. You'll use these commands daily.

Linux Command Line Crash Course (freeCodeCamp) (video)
The Linux Command Line for Beginners (Ubuntu) (documentation)

Learn how to use a code editor (VS Code recommended) and install useful extensions

VS Code is the most popular editor for data engineers. Install these extensions to boost your productivity.

Must-Have Extensions

Python (Microsoft): Linting, debugging, IntelliSense for Python
Pylance: Fast Python language server with type checking
SQLTools: Run SQL queries directly from VS Code
Docker: Manage containers and images
GitLens: Enhanced Git integration
YAML: Syntax highlighting for config files
Rainbow CSV: Color-coded CSV viewing

Key Shortcuts

Action	Mac	Windows
Open Terminal	Cmd+`	Ctrl+`
Command Palette	`Cmd+Shift+P`	`Ctrl+Shift+P`
Quick Open File	`Cmd+P`	`Ctrl+P`
Toggle Sidebar	`Cmd+B`	`Ctrl+B`
Find in Files	`Cmd+Shift+F`	`Ctrl+Shift+F`

Settings Tips

Enable Auto Save (File > Auto Save)
Set Python interpreter (Cmd+Shift+P → "Python: Select Interpreter")
Use the integrated terminal for running scripts

Tip: Learn keyboard shortcuts early — they compound over time.

Getting Started with VS Code (documentation)
VS Code Setup for Python and Data Engineering (video)

Understand what data engineering is and how it fits in the data ecosystem alongside analytics and data science

Data engineering is the foundation of every data-driven organization. Here's what the role involves.

The Role

Data engineers build and maintain the infrastructure that allows data to flow from sources to consumers (analysts, data scientists, ML models, dashboards).

Core Responsibilities

Build Data Pipelines: Automate the movement of data from source systems to storage
Design Data Models: Structure data for efficient querying and analysis
Ensure Data Quality: Validate, clean, and monitor data reliability
Manage Infrastructure: Set up databases, cloud services, orchestration tools
Optimize Performance: Make queries and pipelines fast and cost-effective

A Typical Day

Monitor overnight pipeline runs for failures
Debug a broken data pipeline
Write SQL transformations for a new dashboard
Review a teammate's pull request
Set up a new data source integration
Optimize a slow query

Career Path

Junior DE → Mid-level DE → Senior DE → Staff/Principal DE or Data Architect

Average salaries range from $85K (junior) to $180K+ (senior/staff) in the US.

The bottom line: If you enjoy building systems, automating workflows, and solving puzzles with data, data engineering is for you.

What is Data Engineering? (AWS) (documentation)
Data Engineering in 100 Seconds (Fireship) (video)
Data Engineering Zoomcamp (DataTalks.Club) (documentation)

Frequently Asked Questions

How long does it take to become a data engineer?

Most people complete this roadmap in 6-9 months part-time (10-15 hours/week) or 3-4 months full-time, covering 51 hands-on tasks across 11 sections.

Do I need a degree to become a data engineer?

No. A portfolio of 2-3 end-to-end data pipeline projects on GitHub matters more to hiring managers than a formal degree. The final step of this roadmap covers exactly what to build.

What should I learn first for data engineering?

Start with SQL and Python — they appear in nearly every data engineering job description. SQL is the single most-used skill; Python is the primary programming language for pipelines.

Which cloud should I learn — AWS, GCP, or Azure?

AWS has the largest ecosystem and the most job listings, GCP's BigQuery is excellent for analytics, and Azure is common in enterprise environments. Learn one deeply; the concepts transfer between providers.

Is data engineering hard to learn without a CS background?

No. This roadmap starts at step zero with prerequisites and assumes no prior experience. The main requirement is consistency over 6-9 months of part-time study.

Sign up for free courses and get early access to AI-powered grading, quizzes, and curated learning resources for each roadmap step.

Related Resources

How to Become a Data Engineer

A complete guide to launching your data engineering career

SQL Fundamentals

Build a strong foundation in the most essential data skill

Hands-On Projects

Apply what you learn with real-world data engineering projects