Data Engineering Projects

    Build real-world data engineering experience with hands-on projects. From simple ETL pipelines to complex streaming architectures, master the skills employers are looking for.

    Build Your Data Engineering Portfolio

    Our project-based learning approach gives you practical experience with real-world data engineering challenges. Each project includes detailed instructions, starter code, and comprehensive solutions to help you learn effectively.

    Project Categories:

    • • ETL/ELT Data Pipelines
    • • Real-time Stream Processing
    • • Data Warehouse & Lake Architecture
    • • Cloud-Native Data Solutions
    • • Microservices Data Architecture
    • • Analytics & Monitoring Dashboards

    Technologies You'll Master:

    JupyterdltDuckDBPythonPolarsDLTGitHub ActionsTerraform/OpenTofu/PulumiMetabaseDockerTerraformGCP+14 more

    8 projects available across 3 difficulty levels. Perfect for building a portfolio that demonstrates your data engineering expertise to employers.

    Showing 8 of 8 projects

    Local Data Engineering Environment with dlt, DuckDB & Jupyter

    Set up a local development environment for data processing and analytics using Jupyter notebooks, dlt, and DuckDB. All tools are open-source and run locally.

    Beginner
    2-4 hours

    Tools & Technologies:

    Jupyter
    dlt
    DuckDB
    Python

    Scheduled GitHub ETL with Polars, DLT & DuckDB

    Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB

    Intermediate
    4-6 hours

    Tools & Technologies:

    Polars
    DLT
    DuckDB
    GitHub Actions
    +2 more

    End-to-End Analytics Platform with DuckDB + Metabase

    Build a modern, low-cost analytics stack using DuckDB, Metabase, and GitHub Actions for automated data updates and business-ready dashboards.

    Intermediate
    6-10 hours

    Tools & Technologies:

    DuckDB
    Metabase
    Python
    GitHub Actions
    +1 more

    Infrastructure-as-Code Setup on GCP

    Provision a GCP environment using Terraform with BigQuery & Cloud Storage, staying within free tier limits

    Intermediate
    4-6 hours

    Tools & Technologies:

    Terraform
    GCP
    BigQuery
    Cloud Storage
    +1 more

    ETL Pipeline Orchestration with Apache Airflow

    Design and implement an orchestrated ETL pipeline using Apache Airflow to extract, transform, and load weather data from a public API into a data warehouse.

    Intermediate
    8-12 hours

    Tools & Technologies:

    Airflow
    Docker
    Python
    APIs
    +2 more

    Analytics Engineering Workflow with dbt + Metabase

    Build a production-grade analytics workflow: model, test, and document data with dbt, then visualize insights in Metabase.

    Intermediate
    6-10 hours

    Tools & Technologies:

    dbt
    BigQuery
    SQL
    Metabase

    GitHub Events Analytics with PySpark

    Build a production-style batch data pipeline using Apache Spark to process GitHub event logs

    Advanced
    10-12 hours

    Tools & Technologies:

    Apache Spark
    Python
    PySpark
    Docker
    +2 more

    ⚡ Real-Time Data Streaming with Apache Kafka

    Build a real-time data pipeline using Kafka (Confluent Cloud), JSON, Python, and Polars. Simulate NYC Taxi data, process in real time, and visualize with Metabase.

    Advanced
    8-12 hours

    Tools & Technologies:

    Kafka
    Confluent Cloud
    Python
    Polars
    +4 more

    Why Choose Project-Based Learning?

    Real-World Application

    Work with actual datasets and scenarios that mirror production environments. Build solutions that demonstrate your ability to handle complex data challenges.

    Portfolio Development

    Create a compelling portfolio that showcases your technical skills to potential employers. Each project includes documentation and deployment instructions.

    Industry-Relevant Skills

    Focus on the tools and technologies that are in high demand in the data engineering job market. Stay current with modern data stack practices.