⚙️ Project: Scheduled GitHub ETL with Polars, DLT & DuckDB

📌 Project Overview

In this project, you'll build a scheduled ETL pipeline that extracts data from the GitHub Repositories API, transforms it using Polars, and stores the results in a local DuckDB database. The pipeline runs daily via GitHub Actions, offering a fully automated and cloud-free environment that simulates a production-grade workflow.

When you're ready, you can extend the project by deploying it to a serverless cloud environment, using either AWS Lambda or Google Cloud Functions, managed via Terraform, OpenTofu, or Pulumi.

🎯 Learning Objectives

Extract structured data from the GitHub API
Transform data efficiently using Polars
Build an ETL pipeline using DLT
Store data locally using DuckDB for offline analytics
Automate the pipeline with GitHub Actions on a daily schedule
Optionally deploy and manage the pipeline in a serverless cloud environment

📂 Project Structure

github-etl/
├── src/
│   ├── extract.py         # GitHub API client
│   ├── transform.py       # Polars transformation logic
│   ├── load.py            # Write to DuckDB
│   └── main.py            # Entrypoint for the ETL flow
├── dlt/
│   └── pipeline.py        # DLT pipeline definition
├── config/
│   └── config.toml        # Repo/org selection, GitHub token, file paths
├── data/
│   └── output.duckdb      # DuckDB output database
├── .github/
│   └── workflows/
│       └── schedule.yml   # GitHub Actions job definition
├── tests/
│   └── test_transform.py  # Unit tests for transformation logic
├── requirements.txt
└── README.md

🔄 Step-by-Step Guide

1. Extract GitHub Data

Use GitHub REST API (v3) to retrieve:
- Repository metadata: stars, forks, license, language, timestamps
Implement support for pagination and basic rate limiting
Configure target repos via config.toml or environment variables

2. Transform with Polars

Normalize and clean JSON responses
Filter for active repositories
Add derived fields (e.g. "days since last push", "star/fork ratio")
Prepare data as structured DataFrames

3. Load into DuckDB

Store output in a local .duckdb database file
Support append or overwrite modes for repeated daily runs
Organize tables with a partitioning or snapshot strategy

4. Automate with GitHub Actions

Set up a workflow that:
- Installs dependencies
- Runs the ETL pipeline daily
- Stores the updated DuckDB file locally or as a GitHub artifact

🚀 Optional: Deploy as Serverless Function

You can deploy the ETL pipeline to run automatically in the cloud once it's stable.

Option A: AWS Lambda

Containerize or zip your ETL function
Use Terraform, OpenTofu, or Pulumi for infrastructure management
Schedule the function using EventBridge rules (cron expressions)
Store config and secrets securely using AWS Parameter Store or Secrets Manager

Option B: GCP Cloud Functions

Package your ETL as an HTTP-triggered or scheduled function
Use Cloud Scheduler to trigger execution
Store secrets securely using Google Secret Manager
Manage deployment and permissions using Terraform, OpenTofu, or Pulumi

📦 Deliverables

Fully automated GitHub-to-DuckDB ETL pipeline
Daily GitHub Actions schedule
Reusable DLT + Polars codebase
Local analytics database (DuckDB)
Optional: Cloud deployment instructions and infrastructure-as-code templates

🧪 Optional Extensions

Add incremental extract logic using updated_at or pushed_at
Extend with contributors, issues, or PRs endpoints
Track metrics over time (snapshot history)
Visualize data using Observable, Streamlit, or Superset
Add linting, type checking, or Slack alerts to the CI/CD pipeline

Scheduled GitHub ETL with Polars, DLT & DuckDB