⚡ Real-Time Data Streaming with Apache Kafka

    Build a real-time data pipeline using Kafka (Confluent Cloud), JSON, Python, and Polars. Simulate NYC Taxi data, process in real time, and visualize with Metabase.

    ✓ Expert-Designed Project• Industry-Validated Implementation• Production-Ready Architecture

    This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Kafka, Confluent Cloud, Python and 5 more technologies through hands-on implementation. Rated advanced level with comprehensive documentation and starter code.

    Advanced
    8-12 hours

    ⚡ Real-Time Data Streaming with Apache Kafka (JSON + Python + Polars)

    📌 Project Overview

    In this hands-on project, you'll build a real-time data pipeline using Apache Kafka on Confluent Cloud, with JSON message format and a Python-based consumer using Polars for transformation and analysis. You'll simulate realistic event data using NYC Taxi trip records, process it in real time, and visualize the results in a dashboard tool like Metabase.

    You'll also explore how this approach scales to large volumes using Spark Structured Streaming.


    🎯 Learning Objectives

    • Understand event-driven architecture and Kafka fundamentals
    • Use Kafka producers and consumers with JSON format
    • Perform streaming transformations using Polars in Python
    • Sink transformed data into DuckDB or Parquet for visualization
    • Explore scalability using Spark Structured Streaming

    ⌛ Estimated Duration

    🕒 8–12 hours
    🧠 Difficulty: Intermediate


    📦 Recommended Dataset

    Dataset: NYC Taxi Trips

    • Simulate event streams by emitting rows from a historical CSV
    • Each record includes timestamps, distance, fare, location IDs
    • Good for aggregations, rolling averages, and windowing

    ☁️ Using Confluent Cloud Free Tier

    1. Sign up at https://www.confluent.cloud
    2. Create a Basic Kafka Cluster (select any cloud provider)
    3. Create an API Key and download client.properties
    4. Create topics: taxi_trips, trip_summary
    5. Schema Registry is not needed (messages are JSON)

    🗂 Suggested Project Structure

    kafka-json-python-polars/
    ├── config/
    │   └── client.properties
    ├── producers/
    │   └── nyc_taxi_producer.py
    ├── consumers/
    │   └── polars_consumer.py
    ├── dashboards/
    │   └── screenshots/
    ├── data/
    │   └── summary.parquet
    ├── README.md
    

    🔄 Step-by-Step Guide

    1. 📤 Produce Taxi Data

    • Implement nyc_taxi_producer.py:
      • Reads from local NYC taxi CSV file
      • Sends one row at a time to Kafka (taxi_trips) as JSON
      • Use confluent_kafka Python client
    from confluent_kafka import Producer
    import json
    
    p = Producer({'bootstrap.servers': '...'})
    p.produce('taxi_trips', value=json.dumps(row_dict))
    

    2. 🔁 Consume and Transform with Polars

    • Implement polars_consumer.py:
      • Consumes JSON messages from taxi_trips
      • Loads messages into a Polars DataFrame
      • Applies windowed aggregations and filters
      • Writes outputs to Parquet or DuckDB
    import polars as pl
    
    df = pl.DataFrame(batch_messages).groupby("pickup_hour").agg([
        pl.col("fare_amount").mean().alias("avg_fare"),
        pl.count().alias("trip_count")
    ])
    

    3. 📊 Visualize with Metabase

    • Load Parquet/DuckDB output into Metabase
    • Build:
      • Trip count over time (line chart)
      • Average fare by vendor (bar chart)
      • Distribution of trip distances (histogram)
    • Enable auto-refresh (e.g., every 1 minute)

    ✅ Deliverables

    • Kafka producer and consumer Python scripts
    • JSON message format only (no Schema Registry)
    • Output tables in Parquet or DuckDB
    • Dashboard screenshots
    • README.md with setup instructions and diagrams

    🚀 Optional Extension: Large-Scale Upgrade

    🧠 Switch to Spark Structured Streaming

    • Replace the Python consumer with PySpark
    • Use Kafka source in Spark:
    df = spark.readStream.format("kafka")\
        .option("subscribe", "taxi_trips")\
        .load()
    
    • Apply Spark SQL or DataFrame transformations
    • Write to BigQuery, Delta Lake, or cloud storage
    • Useful for processing millions of messages/day or horizontal scaling

    Project Details

    Tools & Technologies

    Kafka
    Confluent Cloud
    Python
    Polars
    DuckDB
    Parquet
    Metabase
    Spark Structured Streaming

    Difficulty Level

    Advanced

    Estimated Duration

    8-12 hours

    Sign in to submit projects and track your progress

    More Projects