Scheduled GitHub ETL with Polars, DLT & DuckDB

    Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB

    ✓ Expert-Designed Project• Industry-Validated Implementation• Production-Ready Architecture

    This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Polars, DLT, DuckDB and 3 more technologies through hands-on implementation. Rated intermediate level with comprehensive documentation and starter code.

    Intermediate
    4-6 hours

    ⚙️ Project: Scheduled GitHub ETL with Polars, DLT & DuckDB

    📌 Project Overview

    In this project, you'll build a scheduled ETL pipeline that extracts data from the GitHub Repositories API, transforms it using Polars, and stores the results in a local DuckDB database. The pipeline runs daily via GitHub Actions, offering a fully automated and cloud-free environment that simulates a production-grade workflow.

    When you're ready, you can extend the project by deploying it to a serverless cloud environment, using either AWS Lambda or Google Cloud Functions, managed via Terraform, OpenTofu, or Pulumi.


    🎯 Learning Objectives

    • Extract structured data from the GitHub API
    • Transform data efficiently using Polars
    • Build an ETL pipeline using DLT
    • Store data locally using DuckDB for offline analytics
    • Automate the pipeline with GitHub Actions on a daily schedule
    • Optionally deploy and manage the pipeline in a serverless cloud environment

    📂 Project Structure

    github-etl/
    ├── src/
    │   ├── extract.py         # GitHub API client
    │   ├── transform.py       # Polars transformation logic
    │   ├── load.py            # Write to DuckDB
    │   └── main.py            # Entrypoint for the ETL flow
    ├── dlt/
    │   └── pipeline.py        # DLT pipeline definition
    ├── config/
    │   └── config.toml        # Repo/org selection, GitHub token, file paths
    ├── data/
    │   └── output.duckdb      # DuckDB output database
    ├── .github/
    │   └── workflows/
    │       └── schedule.yml   # GitHub Actions job definition
    ├── tests/
    │   └── test_transform.py  # Unit tests for transformation logic
    ├── requirements.txt
    └── README.md
    

    🔄 Step-by-Step Guide

    1. Extract GitHub Data

    • Use GitHub REST API (v3) to retrieve:
      • Repository metadata: stars, forks, license, language, timestamps
    • Implement support for pagination and basic rate limiting
    • Configure target repos via config.toml or environment variables

    2. Transform with Polars

    • Normalize and clean JSON responses
    • Filter for active repositories
    • Add derived fields (e.g. "days since last push", "star/fork ratio")
    • Prepare data as structured DataFrames

    3. Load into DuckDB

    • Store output in a local .duckdb database file
    • Support append or overwrite modes for repeated daily runs
    • Organize tables with a partitioning or snapshot strategy

    4. Automate with GitHub Actions

    • Set up a workflow that:
      • Installs dependencies
      • Runs the ETL pipeline daily
      • Stores the updated DuckDB file locally or as a GitHub artifact

    🚀 Optional: Deploy as Serverless Function

    You can deploy the ETL pipeline to run automatically in the cloud once it's stable.

    Option A: AWS Lambda

    • Containerize or zip your ETL function
    • Use Terraform, OpenTofu, or Pulumi for infrastructure management
    • Schedule the function using EventBridge rules (cron expressions)
    • Store config and secrets securely using AWS Parameter Store or Secrets Manager

    Option B: GCP Cloud Functions

    • Package your ETL as an HTTP-triggered or scheduled function
    • Use Cloud Scheduler to trigger execution
    • Store secrets securely using Google Secret Manager
    • Manage deployment and permissions using Terraform, OpenTofu, or Pulumi

    📦 Deliverables

    • Fully automated GitHub-to-DuckDB ETL pipeline
    • Daily GitHub Actions schedule
    • Reusable DLT + Polars codebase
    • Local analytics database (DuckDB)
    • Optional: Cloud deployment instructions and infrastructure-as-code templates

    🧪 Optional Extensions

    • Add incremental extract logic using updated_at or pushed_at
    • Extend with contributors, issues, or PRs endpoints
    • Track metrics over time (snapshot history)
    • Visualize data using Observable, Streamlit, or Superset
    • Add linting, type checking, or Slack alerts to the CI/CD pipeline

    Project Details

    Tools & Technologies

    Polars
    DLT
    DuckDB
    GitHub Actions
    Python
    Terraform/OpenTofu/Pulumi

    Difficulty Level

    Intermediate

    Estimated Duration

    4-6 hours

    Sign in to submit projects and track your progress

    More Projects