⚡ Real-Time Data Streaming with Apache Kafka (JSON + Python + Polars)

📌 Project Overview

In this hands-on project, you'll build a real-time data pipeline using Apache Kafka on Confluent Cloud, with JSON message format and a Python-based consumer using Polars for transformation and analysis. You'll simulate realistic event data using NYC Taxi trip records, process it in real time, and visualize the results in a dashboard tool like Metabase.

You'll also explore how this approach scales to large volumes using Spark Structured Streaming.

🎯 Learning Objectives

Understand event-driven architecture and Kafka fundamentals
Use Kafka producers and consumers with JSON format
Perform streaming transformations using Polars in Python
Sink transformed data into DuckDB or Parquet for visualization
Explore scalability using Spark Structured Streaming

⌛ Estimated Duration

🕒 8–12 hours
🧠 Difficulty: Intermediate

📦 Recommended Dataset

Dataset: NYC Taxi Trips

Simulate event streams by emitting rows from a historical CSV
Each record includes timestamps, distance, fare, location IDs
Good for aggregations, rolling averages, and windowing

☁️ Using Confluent Cloud Free Tier

Sign up at https://www.confluent.cloud
Create a Basic Kafka Cluster (select any cloud provider)
Create an API Key and download client.properties
Create topics: taxi_trips, trip_summary
Schema Registry is not needed (messages are JSON)

🗂 Suggested Project Structure

kafka-json-python-polars/
├── config/
│   └── client.properties
├── producers/
│   └── nyc_taxi_producer.py
├── consumers/
│   └── polars_consumer.py
├── dashboards/
│   └── screenshots/
├── data/
│   └── summary.parquet
├── README.md

🔄 Step-by-Step Guide

1. 📤 Produce Taxi Data

Implement nyc_taxi_producer.py:
- Reads from local NYC taxi CSV file
- Sends one row at a time to Kafka (taxi_trips) as JSON
- Use confluent_kafka Python client

from confluent_kafka import Producer
import json

p = Producer({'bootstrap.servers': '...'})
p.produce('taxi_trips', value=json.dumps(row_dict))

2. 🔁 Consume and Transform with Polars

Implement polars_consumer.py:
- Consumes JSON messages from taxi_trips
- Loads messages into a Polars DataFrame
- Applies windowed aggregations and filters
- Writes outputs to Parquet or DuckDB

import polars as pl

df = pl.DataFrame(batch_messages).groupby("pickup_hour").agg([
    pl.col("fare_amount").mean().alias("avg_fare"),
    pl.count().alias("trip_count")
])

3. 📊 Visualize with Metabase

Load Parquet/DuckDB output into Metabase
Build:
- Trip count over time (line chart)
- Average fare by vendor (bar chart)
- Distribution of trip distances (histogram)
Enable auto-refresh (e.g., every 1 minute)

✅ Deliverables

Kafka producer and consumer Python scripts
JSON message format only (no Schema Registry)
Output tables in Parquet or DuckDB
Dashboard screenshots
README.md with setup instructions and diagrams

🚀 Optional Extension: Large-Scale Upgrade

🧠 Switch to Spark Structured Streaming

Replace the Python consumer with PySpark
Use Kafka source in Spark:

df = spark.readStream.format("kafka")\
    .option("subscribe", "taxi_trips")\
    .load()

Apply Spark SQL or DataFrame transformations
Write to BigQuery, Delta Lake, or cloud storage
Useful for processing millions of messages/day or horizontal scaling

⚡ Real-Time Data Streaming with Apache Kafka