GitHub Events Analytics with PySpark

    Build a production-style batch data pipeline using Apache Spark to process GitHub event logs

    โœ“ Expert-Designed Projectโ€ข Industry-Validated Implementationโ€ข Production-Ready Architecture

    This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Apache Spark, Python, PySpark and 3 more technologies through hands-on implementation. Rated advanced level with comprehensive documentation and starter code.

    Advanced
    10-12 hours

    ๐Ÿš€ Project: GitHub Events Analytics with PySpark

    ๐Ÿ“Œ Project Overview

    In this project, you'll build a realistic, production-style batch data pipeline using Apache Spark (running in Docker) to process public GitHub event logs. You will parse raw JSON logs, transform and filter event data, compute usage metrics over time, and write out partitioned Parquet datasets ready for analytics.

    This project simulates the kind of job a data engineer might build for product analytics, open-source contribution insights, or developer engagement reporting โ€” without any external APIs or joins.

    ๐ŸŽฏ Learning Objectives

    • Build a modular Spark pipeline using real code (no notebooks)
    • Parse and normalize complex nested JSON data
    • Filter and categorize high-volume event logs
    • Compute time-based and entity-based aggregations
    • Partition and store structured output efficiently for downstream use
    • Run everything inside Docker for a reproducible, production-like setup

    ๐Ÿ“‚ Project Structure

    github-analytics/
    โ”œโ”€โ”€ docker/
    โ”‚   โ””โ”€โ”€ spark/               # Dockerfile for Spark runtime
    โ”œโ”€โ”€ data/
    โ”‚   โ”œโ”€โ”€ raw/                 # GitHub Archive .json.gz files
    โ”‚   โ””โ”€โ”€ output/              # Partitioned Parquet output
    โ”œโ”€โ”€ jobs/
    โ”‚   โ”œโ”€โ”€ ingest.py            # Read and normalize raw JSON
    โ”‚   โ”œโ”€โ”€ transform.py         # Filter, clean, enrich with derived columns
    โ”‚   โ”œโ”€โ”€ aggregate.py         # Compute KPIs and metrics
    โ”œโ”€โ”€ config/
    โ”‚   โ””โ”€โ”€ job_config.yaml      # Input/output paths, date ranges, etc.
    โ”œโ”€โ”€ scripts/
    โ”‚   โ””โ”€โ”€ run_job.sh           # CLI wrapper for spark-submit
    โ”œโ”€โ”€ tests/                   # Unit tests for logic and edge cases
    โ”œโ”€โ”€ requirements.txt
    โ””โ”€โ”€ README.md
    

    ๐Ÿ”„ Pipeline Workflow

    1. Ingest Job

    • Input: One or more hourly .json.gz files from https://www.gharchive.org/
    • Output: Raw DataFrame with structured fields
    • Tasks:
      • Read multi-line JSON
      • Extract top-level fields: type, created_at, repo.name, actor.login, org.login
      • Normalize nested structures from payload (e.g. extract PR number or issue labels)

    2. Transform Job

    • Input: Raw DataFrame from ingest.py
    • Output: Cleaned and enriched DataFrame
    • Tasks:
      • Filter for selected event types: PushEvent, PullRequestEvent, IssuesEvent, WatchEvent
      • Derive:
        • event_date, event_hour from created_at
        • is_bot flag (based on actor.login)
        • event_category (e.g., content, collaboration, passive)
      • Drop malformed or low-quality records
      • Optional: Repartition by event_date

    3. Aggregate Job

    • Input: Cleaned DataFrame from transform.py
    • Output: Daily metrics in partitioned Parquet format
    • Tasks:
      • Count events by event_type, event_hour, is_bot
      • Identify top repositories and actors by activity
      • Compute total daily volume
      • Write partitioned output to: data/output/event_date=YYYY-MM-DD/

    ๐Ÿš€ Execution

    Each job is runnable via Docker and spark-submit, e.g.:

    ./scripts/run_job.sh ingest --date=2025-06-01
    ./scripts/run_job.sh transform --date=2025-06-01
    ./scripts/run_job.sh aggregate --date=2025-06-01
    

    All jobs are:

    • Date-parametrized
    • Idempotent (safe to rerun)
    • Designed to run independently or as a chain

    ๐Ÿ“ฆ Deliverables

    • Production-style Spark project running in Docker
    • Ingestion, transformation, and aggregation jobs
    • Output: Partitioned Parquet dataset of GitHub metrics
    • README.md with project setup, usage, and examples
    • (Optional) Unit tests for logic and schema integrity

    ๐Ÿงช Optional Extensions

    • Add schema validation (e.g., PySpark StructType)
    • Write metrics to a local DuckDB database for quick analysis
    • Implement lightweight event deduplication using event IDs
    • Add summary CSV exports for top-N stats (e.g., top repos)

    Project Details

    Tools & Technologies

    Apache Spark
    Python
    PySpark
    Docker
    Parquet
    JSON

    Difficulty Level

    Advanced

    Estimated Duration

    10-12 hours

    Sign in to submit projects and track your progress

    More Projects