🚀 Startup Stack Roadmap

Build a scalable, cost-effective data stack using modern open-source tools and serverless architecture.

✓ Expert-Designed Learning Path• Industry-Validated Curriculum• Real-World Application Focus

This roadmap was created by data engineering professionals with 31 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master DuckDB, Polars, Metabase and 3 more technologies.

How long does it take? Most learners complete this roadmap in 4-6 months studying part-time (10-15 hours/week), or about 2-3 months full-time. The 8 sections contain 31 hands-on tasks built around a lightweight, serverless stack.

The 8 steps: (0) Pre-requisites and fundamentals · (1) Local Development Environment · (2) Data Processing with Polars · (3) Analytics with DuckDB · (4) Version control and CI/CD · (5) Serverless data processing · (6) Data visualization with Metabase · (7) Production orchestration.

Beginner to Intermediate

8 sections • 31 tasks

Skills You'll Learn

SQL
Data modeling
Python
ETL/ELT
Serverless
Cloud

Tools You'll Use

DuckDB
Polars
Metabase
AWS Lambda/GCP Cloud Functions
GitHub Actions/AWS EventBridge
GitHub

Projects to Build

Local Data Engineering Environment with dlt, DuckDB & Jupyter
Set up a local development environment for data processing and analytics using Jupyter notebooks, dlt, and DuckDB. All tools are open-source and run locally.
Scheduled GitHub ETL with Polars, DLT & DuckDB
Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB
End-to-End Analytics Platform with DuckDB + Metabase
Build a modern, low-cost analytics stack using DuckDB, Metabase, and GitHub Actions for automated data updates and business-ready dashboards.

Learning Resources

Jupyter Notebooks Guide

documentation

Step 0: Pre-requisites and fundamentals

-Learn the fundamentals of data engineering

-Master Python basics and SQL

-Understand cloud computing concepts

Step 1: Local Development Environment

-Set up Python virtual environment

-Install Jupyter Notebooks

-Configure DuckDB and Polars

-Create your first data processing notebook

Step 2: Data Processing with Polars

-Learn Polars DataFrame operations

-Practice data transformations in notebooks

-Implement data quality checks

-Optimize performance with Polars

Step 3: Analytics with DuckDB

-Learn DuckDB SQL syntax

-Query public datasets

-Create analytical views

-Optimize query performance

Step 4: Version control and CI/CD

-Learn Git basics

-Create a GitHub repository for your project

-Set up GitHub Actions for data pipeline orchestration

-Implement CI/CD for data quality checks

Step 5: Serverless data processing

-Set up AWS Lambda or GCP Cloud Functions

-Create serverless data processing functions

-Implement error handling and retries

-Set up monitoring and logging

Step 6: Data visualization with Metabase

-Install and configure Metabase

-Connect Metabase to DuckDB

-Create dashboards and visualizations

-Set up automated reporting

Step 7: Production orchestration

-Set up AWS EventBridge or GCP Cloud Scheduler

-Create orchestration workflows

-Implement monitoring and alerting

-Set up data pipeline observability

Curriculum Reference

A free preview of the learning material in this roadmap — the full reference for every section is available when you sign in. Click any task to expand it.

Step 0: Pre-requisites and fundamentals

Learn the fundamentals of data engineering

Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.

Core Concepts

Data Pipelines: Automated workflows that move and transform data from source to destination.

ETL vs ELT:

ETL (Extract, Transform, Load): Transform data before loading into the warehouse
ELT (Extract, Load, Transform): Load raw data first, then transform in the warehouse

Data Warehouses: Centralized repositories optimized for analytical queries (e.g., Snowflake, BigQuery, Redshift)

Data Lakes: Storage systems that can hold raw data in its native format (e.g., S3, Azure Data Lake)

Data Modeling: Structuring data for efficient storage and retrieval

Data Quality: Ensuring data is accurate, complete, and reliable

Key Skills for Data Engineers

Programming: Python, SQL, Scala, Java
Databases: SQL and NoSQL systems
Cloud Platforms: AWS, GCP, Azure
Orchestration: Airflow, Prefect, Dagster
Data Processing: Spark, Kafka, dbt
Infrastructure: Docker, Kubernetes, Terraform

Next Steps: Master SQL and Python fundamentals before diving into specific tools and frameworks.

What is Data Engineering? (AWS) (documentation)
Data Engineering Zoomcamp (DataTalks.Club) (documentation)
Data Engineering vs Data Science Explained (video)

Understand cloud computing concepts

Cloud Computing Explained (video)

Frequently Asked Questions

What data stack should a startup use?

A startup can run a lean, cost-effective stack with DuckDB and Polars for local processing, Metabase for dashboards, and serverless functions on AWS Lambda or GCP Cloud Functions. This roadmap builds exactly that across eight sections.

Why use DuckDB and Polars for a startup data stack?

DuckDB and Polars let a small team process and query data locally without provisioning a warehouse or cluster, keeping costs low. This roadmap teaches Polars DataFrame operations and DuckDB SQL before adding serverless processing and orchestration.

What is a serverless data stack?

A serverless data stack runs processing in managed functions like AWS Lambda or GCP Cloud Functions that scale to zero when idle. This roadmap covers building those functions, error handling and retries, and orchestrating them with EventBridge or Cloud Scheduler.

Do I need to know Python and SQL for this roadmap?

Yes. Step 0 expects Python basics, SQL, and an understanding of cloud computing concepts before you start. From there you set up a local environment with Jupyter, DuckDB, and Polars, then move into serverless processing and production orchestration.

How do you orchestrate a startup data pipeline cheaply?

This roadmap uses GitHub Actions for CI/CD and pipeline runs, then AWS EventBridge or GCP Cloud Scheduler for production scheduling, with monitoring, alerting, and observability. It avoids heavyweight orchestrators in favor of low-cost managed services.

Sign up for free courses and get early access to AI-powered grading, quizzes, and curated learning resources for each roadmap step.

Related Resources

How to Become a Data Engineer

A complete guide to launching your data engineering career

SQL Fundamentals

Build a strong foundation in the most essential data skill

Hands-On Projects

Apply what you learn with real-world data engineering projects