๐Ÿš€ Startup Stack Roadmap

    Build a scalable, cost-effective data stack using modern open-source tools and serverless architecture.

    โœ“ Expert-Designed Learning Pathโ€ข Industry-Validated Curriculumโ€ข Real-World Application Focus

    This roadmap was created by data engineering professionals with 31 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master DuckDB, Polars, Metabase and 3 more technologies.

    How long does it take? Most learners complete this roadmap in 4-6 months studying part-time (10-15 hours/week), or about 2-3 months full-time. The 8 sections contain 31 hands-on tasks built around a lightweight, serverless stack.

    The 8 steps: (0) Pre-requisites and fundamentals ยท (1) Local Development Environment ยท (2) Data Processing with Polars ยท (3) Analytics with DuckDB ยท (4) Version control and CI/CD ยท (5) Serverless data processing ยท (6) Data visualization with Metabase ยท (7) Production orchestration.

    Beginner to Intermediate
    8 sections โ€ข 31 tasks

    Skills You'll Learn

    • SQL
    • Data modeling
    • Python
    • ETL/ELT
    • Serverless
    • Cloud

    Tools You'll Use

    • DuckDB
    • Polars
    • Metabase
    • AWS Lambda/GCP Cloud Functions
    • GitHub Actions/AWS EventBridge
    • GitHub

    Projects to Build

    Step 0: Pre-requisites and fundamentals

    -Learn the fundamentals of data engineering
    -Master Python basics and SQL
    -Understand cloud computing concepts

    Step 1: Local Development Environment

    -Set up Python virtual environment
    -Install Jupyter Notebooks
    -Configure DuckDB and Polars
    -Create your first data processing notebook

    Step 2: Data Processing with Polars

    -Learn Polars DataFrame operations
    -Practice data transformations in notebooks
    -Implement data quality checks
    -Optimize performance with Polars

    Step 3: Analytics with DuckDB

    -Learn DuckDB SQL syntax
    -Query public datasets
    -Create analytical views
    -Optimize query performance

    Step 4: Version control and CI/CD

    -Learn Git basics
    -Create a GitHub repository for your project
    -Set up GitHub Actions for data pipeline orchestration
    -Implement CI/CD for data quality checks

    Step 5: Serverless data processing

    -Set up AWS Lambda or GCP Cloud Functions
    -Create serverless data processing functions
    -Implement error handling and retries
    -Set up monitoring and logging

    Step 6: Data visualization with Metabase

    -Install and configure Metabase
    -Connect Metabase to DuckDB
    -Create dashboards and visualizations
    -Set up automated reporting

    Step 7: Production orchestration

    -Set up AWS EventBridge or GCP Cloud Scheduler
    -Create orchestration workflows
    -Implement monitoring and alerting
    -Set up data pipeline observability

    Curriculum Reference

    A free preview of the learning material in this roadmap โ€” the full reference for every section is available when you sign in. Click any task to expand it.

    Step 0: Pre-requisites and fundamentals

    Learn the fundamentals of data engineering

    Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale.


    Core Concepts

    Data Pipelines: Automated workflows that move and transform data from source to destination.

    ETL vs ELT:

    • ETL (Extract, Transform, Load): Transform data before loading into the warehouse
    • ELT (Extract, Load, Transform): Load raw data first, then transform in the warehouse

    Data Warehouses: Centralized repositories optimized for analytical queries (e.g., Snowflake, BigQuery, Redshift)

    Data Lakes: Storage systems that can hold raw data in its native format (e.g., S3, Azure Data Lake)

    Data Modeling: Structuring data for efficient storage and retrieval

    Data Quality: Ensuring data is accurate, complete, and reliable


    Key Skills for Data Engineers

    • Programming: Python, SQL, Scala, Java
    • Databases: SQL and NoSQL systems
    • Cloud Platforms: AWS, GCP, Azure
    • Orchestration: Airflow, Prefect, Dagster
    • Data Processing: Spark, Kafka, dbt
    • Infrastructure: Docker, Kubernetes, Terraform

    Next Steps: Master SQL and Python fundamentals before diving into specific tools and frameworks.

    Understand cloud computing concepts

    Unlock the learning materials for the remaining 7 sections

    Sign in free to open the curated guides, videos and docs for every task โ€” and track your progress as you go.

    Sign in to continue

    Frequently Asked Questions

    What data stack should a startup use?

    A startup can run a lean, cost-effective stack with DuckDB and Polars for local processing, Metabase for dashboards, and serverless functions on AWS Lambda or GCP Cloud Functions. This roadmap builds exactly that across eight sections.

    Why use DuckDB and Polars for a startup data stack?

    DuckDB and Polars let a small team process and query data locally without provisioning a warehouse or cluster, keeping costs low. This roadmap teaches Polars DataFrame operations and DuckDB SQL before adding serverless processing and orchestration.

    What is a serverless data stack?

    A serverless data stack runs processing in managed functions like AWS Lambda or GCP Cloud Functions that scale to zero when idle. This roadmap covers building those functions, error handling and retries, and orchestrating them with EventBridge or Cloud Scheduler.

    Do I need to know Python and SQL for this roadmap?

    Yes. Step 0 expects Python basics, SQL, and an understanding of cloud computing concepts before you start. From there you set up a local environment with Jupyter, DuckDB, and Polars, then move into serverless processing and production orchestration.

    How do you orchestrate a startup data pipeline cheaply?

    This roadmap uses GitHub Actions for CI/CD and pipeline runs, then AWS EventBridge or GCP Cloud Scheduler for production scheduling, with monitoring, alerting, and observability. It avoids heavyweight orchestrators in favor of low-cost managed services.

    Sign up for free courses and get early access to AI-powered grading, quizzes, and curated learning resources for each roadmap step.