⚡ Real-Time Data Streaming with Apache Kafka
Build a real-time data pipeline using Kafka (Confluent Cloud), JSON, Python, and Polars. Simulate NYC Taxi data, process in real time, and visualize with Metabase.
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Kafka, Confluent Cloud, Python and 5 more technologies through hands-on implementation. Rated advanced level with comprehensive documentation and starter code.
⚡ Real-Time Data Streaming with Apache Kafka (JSON + Python + Polars)
📌 Project Overview
In this hands-on project, you'll build a real-time data pipeline using Apache Kafka on Confluent Cloud, with JSON message format and a Python-based consumer using Polars for transformation and analysis. You'll simulate realistic event data using NYC Taxi trip records, process it in real time, and visualize the results in a dashboard tool like Metabase.
You'll also explore how this approach scales to large volumes using Spark Structured Streaming.
🎯 Learning Objectives
- Understand event-driven architecture and Kafka fundamentals
- Use Kafka producers and consumers with JSON format
- Perform streaming transformations using Polars in Python
- Sink transformed data into DuckDB or Parquet for visualization
- Explore scalability using Spark Structured Streaming
⌛ Estimated Duration
🕒 8–12 hours
🧠 Difficulty: Intermediate
📦 Recommended Dataset
Dataset: NYC Taxi Trips
- Simulate event streams by emitting rows from a historical CSV
- Each record includes timestamps, distance, fare, location IDs
- Good for aggregations, rolling averages, and windowing
☁️ Using Confluent Cloud Free Tier
- Sign up at https://www.confluent.cloud
- Create a Basic Kafka Cluster (select any cloud provider)
- Create an API Key and download
client.properties - Create topics:
taxi_trips,trip_summary - Schema Registry is not needed (messages are JSON)
🗂 Suggested Project Structure
kafka-json-python-polars/
├── config/
│ └── client.properties
├── producers/
│ └── nyc_taxi_producer.py
├── consumers/
│ └── polars_consumer.py
├── dashboards/
│ └── screenshots/
├── data/
│ └── summary.parquet
├── README.md
🔄 Step-by-Step Guide
1. 📤 Produce Taxi Data
- Implement
nyc_taxi_producer.py:- Reads from local NYC taxi CSV file
- Sends one row at a time to Kafka (
taxi_trips) as JSON - Use
confluent_kafkaPython client
from confluent_kafka import Producer
import json
p = Producer({'bootstrap.servers': '...'})
p.produce('taxi_trips', value=json.dumps(row_dict))
2. 🔁 Consume and Transform with Polars
- Implement
polars_consumer.py:- Consumes JSON messages from
taxi_trips - Loads messages into a Polars DataFrame
- Applies windowed aggregations and filters
- Writes outputs to Parquet or DuckDB
- Consumes JSON messages from
import polars as pl
df = pl.DataFrame(batch_messages).groupby("pickup_hour").agg([
pl.col("fare_amount").mean().alias("avg_fare"),
pl.count().alias("trip_count")
])
3. 📊 Visualize with Metabase
- Load Parquet/DuckDB output into Metabase
- Build:
- Trip count over time (line chart)
- Average fare by vendor (bar chart)
- Distribution of trip distances (histogram)
- Enable auto-refresh (e.g., every 1 minute)
✅ Deliverables
- Kafka producer and consumer Python scripts
- JSON message format only (no Schema Registry)
- Output tables in Parquet or DuckDB
- Dashboard screenshots
README.mdwith setup instructions and diagrams
🚀 Optional Extension: Large-Scale Upgrade
🧠 Switch to Spark Structured Streaming
- Replace the Python consumer with PySpark
- Use Kafka source in Spark:
df = spark.readStream.format("kafka")\
.option("subscribe", "taxi_trips")\
.load()
- Apply Spark SQL or DataFrame transformations
- Write to BigQuery, Delta Lake, or cloud storage
- Useful for processing millions of messages/day or horizontal scaling