Scheduled GitHub ETL with Polars, DLT & DuckDB
Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Polars, DLT, DuckDB and 3 more technologies through hands-on implementation. Rated intermediate level with comprehensive documentation and starter code.
⚙️ Project: Scheduled GitHub ETL with Polars, DLT & DuckDB
📌 Project Overview
In this project, you'll build a scheduled ETL pipeline that extracts data from the GitHub Repositories API, transforms it using Polars, and stores the results in a local DuckDB database. The pipeline runs daily via GitHub Actions, offering a fully automated and cloud-free environment that simulates a production-grade workflow.
When you're ready, you can extend the project by deploying it to a serverless cloud environment, using either AWS Lambda or Google Cloud Functions, managed via Terraform, OpenTofu, or Pulumi.
🎯 Learning Objectives
- Extract structured data from the GitHub API
- Transform data efficiently using Polars
- Build an ETL pipeline using DLT
- Store data locally using DuckDB for offline analytics
- Automate the pipeline with GitHub Actions on a daily schedule
- Optionally deploy and manage the pipeline in a serverless cloud environment
📂 Project Structure
github-etl/
├── src/
│ ├── extract.py # GitHub API client
│ ├── transform.py # Polars transformation logic
│ ├── load.py # Write to DuckDB
│ └── main.py # Entrypoint for the ETL flow
├── dlt/
│ └── pipeline.py # DLT pipeline definition
├── config/
│ └── config.toml # Repo/org selection, GitHub token, file paths
├── data/
│ └── output.duckdb # DuckDB output database
├── .github/
│ └── workflows/
│ └── schedule.yml # GitHub Actions job definition
├── tests/
│ └── test_transform.py # Unit tests for transformation logic
├── requirements.txt
└── README.md
🔄 Step-by-Step Guide
1. Extract GitHub Data
- Use GitHub REST API (v3) to retrieve:
- Repository metadata: stars, forks, license, language, timestamps
- Implement support for pagination and basic rate limiting
- Configure target repos via
config.tomlor environment variables
2. Transform with Polars
- Normalize and clean JSON responses
- Filter for active repositories
- Add derived fields (e.g. "days since last push", "star/fork ratio")
- Prepare data as structured DataFrames
3. Load into DuckDB
- Store output in a local
.duckdbdatabase file - Support append or overwrite modes for repeated daily runs
- Organize tables with a partitioning or snapshot strategy
4. Automate with GitHub Actions
- Set up a workflow that:
- Installs dependencies
- Runs the ETL pipeline daily
- Stores the updated DuckDB file locally or as a GitHub artifact
🚀 Optional: Deploy as Serverless Function
You can deploy the ETL pipeline to run automatically in the cloud once it's stable.
Option A: AWS Lambda
- Containerize or zip your ETL function
- Use Terraform, OpenTofu, or Pulumi for infrastructure management
- Schedule the function using EventBridge rules (cron expressions)
- Store config and secrets securely using AWS Parameter Store or Secrets Manager
Option B: GCP Cloud Functions
- Package your ETL as an HTTP-triggered or scheduled function
- Use Cloud Scheduler to trigger execution
- Store secrets securely using Google Secret Manager
- Manage deployment and permissions using Terraform, OpenTofu, or Pulumi
📦 Deliverables
- Fully automated GitHub-to-DuckDB ETL pipeline
- Daily GitHub Actions schedule
- Reusable DLT + Polars codebase
- Local analytics database (DuckDB)
- Optional: Cloud deployment instructions and infrastructure-as-code templates
🧪 Optional Extensions
- Add incremental extract logic using
updated_atorpushed_at - Extend with contributors, issues, or PRs endpoints
- Track metrics over time (snapshot history)
- Visualize data using Observable, Streamlit, or Superset
- Add linting, type checking, or Slack alerts to the CI/CD pipeline