🚀 Project: GitHub Events Analytics with PySpark

📌 Project Overview

In this project, you'll build a realistic, production-style batch data pipeline using Apache Spark (running in Docker) to process public GitHub event logs. You will parse raw JSON logs, transform and filter event data, compute usage metrics over time, and write out partitioned Parquet datasets ready for analytics.

This project simulates the kind of job a data engineer might build for product analytics, open-source contribution insights, or developer engagement reporting — without any external APIs or joins.

🎯 Learning Objectives

Build a modular Spark pipeline using real code (no notebooks)
Parse and normalize complex nested JSON data
Filter and categorize high-volume event logs
Compute time-based and entity-based aggregations
Partition and store structured output efficiently for downstream use
Run everything inside Docker for a reproducible, production-like setup

📂 Project Structure

github-analytics/
├── docker/
│   └── spark/               # Dockerfile for Spark runtime
├── data/
│   ├── raw/                 # GitHub Archive .json.gz files
│   └── output/              # Partitioned Parquet output
├── jobs/
│   ├── ingest.py            # Read and normalize raw JSON
│   ├── transform.py         # Filter, clean, enrich with derived columns
│   ├── aggregate.py         # Compute KPIs and metrics
├── config/
│   └── job_config.yaml      # Input/output paths, date ranges, etc.
├── scripts/
│   └── run_job.sh           # CLI wrapper for spark-submit
├── tests/                   # Unit tests for logic and edge cases
├── requirements.txt
└── README.md

🔄 Pipeline Workflow

1. Ingest Job

Input: One or more hourly .json.gz files from https://www.gharchive.org/
Output: Raw DataFrame with structured fields
Tasks:
- Read multi-line JSON
- Extract top-level fields: type, created_at, repo.name, actor.login, org.login
- Normalize nested structures from payload (e.g. extract PR number or issue labels)

2. Transform Job

Input: Raw DataFrame from ingest.py
Output: Cleaned and enriched DataFrame
Tasks:
- Filter for selected event types: PushEvent, PullRequestEvent, IssuesEvent, WatchEvent
- Derive:
  - event_date, event_hour from created_at
  - is_bot flag (based on actor.login)
  - event_category (e.g., content, collaboration, passive)
- Drop malformed or low-quality records
- Optional: Repartition by event_date

3. Aggregate Job

Input: Cleaned DataFrame from transform.py
Output: Daily metrics in partitioned Parquet format
Tasks:
- Count events by event_type, event_hour, is_bot
- Identify top repositories and actors by activity
- Compute total daily volume
- Write partitioned output to: data/output/event_date=YYYY-MM-DD/

🚀 Execution

Each job is runnable via Docker and spark-submit, e.g.:

./scripts/run_job.sh ingest --date=2025-06-01
./scripts/run_job.sh transform --date=2025-06-01
./scripts/run_job.sh aggregate --date=2025-06-01

All jobs are:

Date-parametrized
Idempotent (safe to rerun)
Designed to run independently or as a chain

📦 Deliverables

Production-style Spark project running in Docker
Ingestion, transformation, and aggregation jobs
Output: Partitioned Parquet dataset of GitHub metrics
README.md with project setup, usage, and examples
(Optional) Unit tests for logic and schema integrity

🧪 Optional Extensions

Add schema validation (e.g., PySpark StructType)
Write metrics to a local DuckDB database for quick analysis
Implement lightweight event deduplication using event IDs
Add summary CSV exports for top-N stats (e.g., top repos)

GitHub Events Analytics with PySpark

🚀 Project: GitHub Events Analytics with PySpark

📌 Project Overview

🎯 Learning Objectives

📂 Project Structure

🔄 Pipeline Workflow

1. Ingest Job

2. Transform Job

3. Aggregate Job

🚀 Execution

📦 Deliverables

🧪 Optional Extensions

Project Details

Tools & Technologies

Difficulty Level

Estimated Duration

More Projects