GitHub Events Analytics with PySpark
Build a production-style batch data pipeline using Apache Spark to process GitHub event logs
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Apache Spark, Python, PySpark and 3 more technologies through hands-on implementation. Rated advanced level with comprehensive documentation and starter code.
๐ Project: GitHub Events Analytics with PySpark
๐ Project Overview
In this project, you'll build a realistic, production-style batch data pipeline using Apache Spark (running in Docker) to process public GitHub event logs. You will parse raw JSON logs, transform and filter event data, compute usage metrics over time, and write out partitioned Parquet datasets ready for analytics.
This project simulates the kind of job a data engineer might build for product analytics, open-source contribution insights, or developer engagement reporting โ without any external APIs or joins.
๐ฏ Learning Objectives
- Build a modular Spark pipeline using real code (no notebooks)
- Parse and normalize complex nested JSON data
- Filter and categorize high-volume event logs
- Compute time-based and entity-based aggregations
- Partition and store structured output efficiently for downstream use
- Run everything inside Docker for a reproducible, production-like setup
๐ Project Structure
github-analytics/
โโโ docker/
โ โโโ spark/ # Dockerfile for Spark runtime
โโโ data/
โ โโโ raw/ # GitHub Archive .json.gz files
โ โโโ output/ # Partitioned Parquet output
โโโ jobs/
โ โโโ ingest.py # Read and normalize raw JSON
โ โโโ transform.py # Filter, clean, enrich with derived columns
โ โโโ aggregate.py # Compute KPIs and metrics
โโโ config/
โ โโโ job_config.yaml # Input/output paths, date ranges, etc.
โโโ scripts/
โ โโโ run_job.sh # CLI wrapper for spark-submit
โโโ tests/ # Unit tests for logic and edge cases
โโโ requirements.txt
โโโ README.md
๐ Pipeline Workflow
1. Ingest Job
- Input: One or more hourly
.json.gzfiles from https://www.gharchive.org/ - Output: Raw DataFrame with structured fields
- Tasks:
- Read multi-line JSON
- Extract top-level fields:
type,created_at,repo.name,actor.login,org.login - Normalize nested structures from
payload(e.g. extract PR number or issue labels)
2. Transform Job
- Input: Raw DataFrame from
ingest.py - Output: Cleaned and enriched DataFrame
- Tasks:
- Filter for selected event types:
PushEvent,PullRequestEvent,IssuesEvent,WatchEvent - Derive:
event_date,event_hourfromcreated_atis_botflag (based onactor.login)event_category(e.g., content, collaboration, passive)
- Drop malformed or low-quality records
- Optional: Repartition by
event_date
- Filter for selected event types:
3. Aggregate Job
- Input: Cleaned DataFrame from
transform.py - Output: Daily metrics in partitioned Parquet format
- Tasks:
- Count events by
event_type,event_hour,is_bot - Identify top repositories and actors by activity
- Compute total daily volume
- Write partitioned output to:
data/output/event_date=YYYY-MM-DD/
- Count events by
๐ Execution
Each job is runnable via Docker and spark-submit, e.g.:
./scripts/run_job.sh ingest --date=2025-06-01
./scripts/run_job.sh transform --date=2025-06-01
./scripts/run_job.sh aggregate --date=2025-06-01
All jobs are:
- Date-parametrized
- Idempotent (safe to rerun)
- Designed to run independently or as a chain
๐ฆ Deliverables
- Production-style Spark project running in Docker
- Ingestion, transformation, and aggregation jobs
- Output: Partitioned Parquet dataset of GitHub metrics
README.mdwith project setup, usage, and examples- (Optional) Unit tests for logic and schema integrity
๐งช Optional Extensions
- Add schema validation (e.g., PySpark
StructType) - Write metrics to a local DuckDB database for quick analysis
- Implement lightweight event deduplication using event IDs
- Add summary CSV exports for top-N stats (e.g., top repos)