๐ Startup Stack Roadmap
Build a scalable, cost-effective data stack using modern open-source tools and serverless architecture.
This roadmap was created by data engineering professionals with 31 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master DuckDB, Polars, Metabase and 3 more technologies.
How long does it take? Most learners complete this roadmap in 4-6 months studying part-time (10-15 hours/week), or about 2-3 months full-time. The 8 sections contain 31 hands-on tasks built around a lightweight, serverless stack.
The 8 steps: (0) Pre-requisites and fundamentals ยท (1) Local Development Environment ยท (2) Data Processing with Polars ยท (3) Analytics with DuckDB ยท (4) Version control and CI/CD ยท (5) Serverless data processing ยท (6) Data visualization with Metabase ยท (7) Production orchestration.
Skills You'll Learn
- SQL
- Data modeling
- Python
- ETL/ELT
- Serverless
- Cloud
Tools You'll Use
- DuckDB
- Polars
- Metabase
- AWS Lambda/GCP Cloud Functions
- GitHub Actions/AWS EventBridge
- GitHub
Projects to Build
- Local Data Engineering Environment with dlt, DuckDB & Jupyter
Set up a local development environment for data processing and analytics using Jupyter notebooks, dlt, and DuckDB. All tools are open-source and run locally.
- Scheduled GitHub ETL with Polars, DLT & DuckDB
Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB
- End-to-End Analytics Platform with DuckDB + Metabase
Build a modern, low-cost analytics stack using DuckDB, Metabase, and GitHub Actions for automated data updates and business-ready dashboards.
Learning Resources
Step 0: Pre-requisites and fundamentals
Step 1: Local Development Environment
Step 2: Data Processing with Polars
Step 3: Analytics with DuckDB
Step 4: Version control and CI/CD
Step 5: Serverless data processing
Step 6: Data visualization with Metabase
Step 7: Production orchestration
Curriculum Reference
A free preview of the learning material in this roadmap โ the full reference for every section is available when you sign in. Click any task to expand it.
Step 0: Pre-requisites and fundamentals
Learn the fundamentals of data engineering
Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.
Core Concepts
Data Pipelines: Automated workflows that move and transform data from source to destination.
ETL vs ELT:
- ETL (Extract, Transform, Load): Transform data before loading into the warehouse
- ELT (Extract, Load, Transform): Load raw data first, then transform in the warehouse
Data Warehouses: Centralized repositories optimized for analytical queries (e.g., Snowflake, BigQuery, Redshift)
Data Lakes: Storage systems that can hold raw data in its native format (e.g., S3, Azure Data Lake)
Data Modeling: Structuring data for efficient storage and retrieval
Data Quality: Ensuring data is accurate, complete, and reliable
Key Skills for Data Engineers
- Programming: Python, SQL, Scala, Java
- Databases: SQL and NoSQL systems
- Cloud Platforms: AWS, GCP, Azure
- Orchestration: Airflow, Prefect, Dagster
- Data Processing: Spark, Kafka, dbt
- Infrastructure: Docker, Kubernetes, Terraform
Next Steps: Master SQL and Python fundamentals before diving into specific tools and frameworks.
- What is Data Engineering? (AWS) (documentation)
- Data Engineering Zoomcamp (DataTalks.Club) (documentation)
- Data Engineering vs Data Science Explained (video)
Understand cloud computing concepts
- Cloud Computing Explained (video)
Unlock the learning materials for the remaining 7 sections
Sign in free to open the curated guides, videos and docs for every task โ and track your progress as you go.
Sign in to continueFrequently Asked Questions
What data stack should a startup use?
A startup can run a lean, cost-effective stack with DuckDB and Polars for local processing, Metabase for dashboards, and serverless functions on AWS Lambda or GCP Cloud Functions. This roadmap builds exactly that across eight sections.
Why use DuckDB and Polars for a startup data stack?
DuckDB and Polars let a small team process and query data locally without provisioning a warehouse or cluster, keeping costs low. This roadmap teaches Polars DataFrame operations and DuckDB SQL before adding serverless processing and orchestration.
What is a serverless data stack?
A serverless data stack runs processing in managed functions like AWS Lambda or GCP Cloud Functions that scale to zero when idle. This roadmap covers building those functions, error handling and retries, and orchestrating them with EventBridge or Cloud Scheduler.
Do I need to know Python and SQL for this roadmap?
Yes. Step 0 expects Python basics, SQL, and an understanding of cloud computing concepts before you start. From there you set up a local environment with Jupyter, DuckDB, and Polars, then move into serverless processing and production orchestration.
How do you orchestrate a startup data pipeline cheaply?
This roadmap uses GitHub Actions for CI/CD and pipeline runs, then AWS EventBridge or GCP Cloud Scheduler for production scheduling, with monitoring, alerting, and observability. It avoids heavyweight orchestrators in favor of low-cost managed services.