TL;DR: Apache Kafka is a distributed event streaming platform that serves as the backbone for real-time data pipelines. It retains messages for configurable periods (unlike traditional queues), supports multiple independent consumer groups, and enables use cases like Change Data Capture, event streaming, and log aggregation. For data engineers, Kafka is a core skill for building real-time data infrastructure.
Apache Kafka for Data Engineers: Architecture, Use Cases & Getting Started
Apache Kafka has become the backbone of real-time data infrastructure at companies like LinkedIn, Netflix, Uber, and Airbnb. Originally built at LinkedIn in 2011 to handle their activity stream data, Kafka has evolved into a distributed event streaming platform capable of processing trillions of messages per day. For data engineers, understanding Kafka is no longer optional — it is a core skill.
This guide covers Kafka's architecture, its key components, real-world use cases, how it compares to alternatives, and how to get started with a local setup and Python code.
What Is Apache Kafka?
At its core, Kafka is a distributed, fault-tolerant, high-throughput event streaming platform. Think of it as a commit log that multiple systems can write to and read from independently. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable period, allowing multiple consumers to read the same data at different speeds and at different times.
This design makes Kafka ideal for:
- Decoupling data producers from consumers
- Building real-time data pipelines
- Event-driven architectures
- Replayable data streams
Kafka Architecture Deep Dive
Brokers and Clusters
A Kafka broker is a single server that stores data and serves client requests. Multiple brokers form a cluster. Kafka distributes data and load across brokers for scalability and fault tolerance. In production, a typical cluster has 3 to 50+ brokers depending on throughput requirements.
Each broker is identified by an integer ID and can handle hundreds of thousands of reads and writes per second. Brokers are stateless regarding consumers — they do not track who has read what. That responsibility falls on the consumer.
Topics and Partitions
A topic is a named stream of records — think of it as a table in a database or a folder in a file system. Producers write to topics and consumers read from topics.
Each topic is divided into partitions, which are ordered, immutable sequences of records. Partitions are the unit of parallelism in Kafka:
Topic: user-events
├── Partition 0: [msg0, msg1, msg2, msg3, ...]
├── Partition 1: [msg0, msg1, msg2, ...]
└── Partition 2: [msg0, msg1, msg2, msg3, msg4, ...]
Key points about partitions:
- Ordering is guaranteed only within a partition, not across partitions.
- Each record within a partition has a unique sequential offset.
- Partitions are distributed across brokers for load balancing.
- The number of partitions determines the maximum parallelism for consumers.
Producers
Producers publish records to topics. When sending a record, the producer can specify:
- Key: Used to determine which partition receives the record. Records with the same key always go to the same partition (assuming partition count does not change), which guarantees ordering for that key.
- Value: The actual payload (often JSON, Avro, or Protobuf).
- Headers: Optional metadata key-value pairs.
If no key is specified, records are distributed across partitions using a round-robin or sticky partition strategy.
Consumer Groups
A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer within the group, ensuring that each record is processed by only one consumer in the group.
Consumer Group: analytics-pipeline
├── Consumer A → Partition 0, Partition 1
├── Consumer B → Partition 2, Partition 3
└── Consumer C → Partition 4, Partition 5
If you add more consumers than partitions, some consumers sit idle. If a consumer fails, its partitions are rebalanced to the remaining consumers. This is how Kafka achieves horizontal scalability and fault tolerance on the consumption side.
Multiple consumer groups can read the same topic independently. For example, a user-events topic could be consumed by an analytics pipeline, a recommendation engine, and a fraud detection system — each as a separate consumer group, each maintaining its own offset.
Replication
Kafka replicates each partition across multiple brokers for fault tolerance. Each partition has one leader and zero or more followers. All reads and writes go through the leader. Followers replicate data from the leader and take over if the leader fails.
The replication factor (commonly set to 3) determines how many copies of each partition exist. The in-sync replicas (ISR) set tracks which followers are caught up with the leader.
ZooKeeper and KRaft
Historically, Kafka used Apache ZooKeeper for cluster metadata management, broker registration, and leader election. Starting with Kafka 3.3, the KRaft (Kafka Raft) mode replaces ZooKeeper, embedding metadata management directly into Kafka brokers. New deployments should use KRaft mode.
Delivery Guarantees
Kafka supports three delivery semantics:
| Guarantee | Description | How |
|---|---|---|
| At most once | Messages may be lost, never duplicated | Consumer commits offset before processing |
| At least once | Messages are never lost, may be duplicated | Consumer commits offset after processing |
| Exactly once | Messages are neither lost nor duplicated | Idempotent producer + transactional writes + read_committed consumers |
Exactly-once semantics (EOS) was introduced in Kafka 0.11 and is achieved through:
- Idempotent producers: The broker deduplicates retried writes using a producer ID and sequence number.
- Transactions: Producers can atomically write to multiple partitions and commit consumer offsets in a single transaction.
- Read committed isolation: Consumers only read committed (non-transactional or committed transactional) messages.
For most data engineering pipelines, at-least-once delivery with idempotent downstream processing is the pragmatic choice. Exactly-once adds overhead and complexity.
Kafka Connect
Kafka Connect is a framework for streaming data between Kafka and external systems without writing custom code. It uses connectors — pre-built plugins for common systems.
- Source connectors pull data into Kafka (e.g., from PostgreSQL, MySQL, MongoDB, S3).
- Sink connectors push data from Kafka to external systems (e.g., to Elasticsearch, Snowflake, BigQuery, S3).
Popular connectors include:
- Debezium: CDC (Change Data Capture) source connectors for PostgreSQL, MySQL, SQL Server, MongoDB
- S3 Sink/Source: Read and write to Amazon S3
- JDBC Source/Sink: Generic relational database connectors
- BigQuery Sink: Stream data directly into Google BigQuery
Connect runs in distributed mode for production (multiple workers sharing the load) or standalone mode for development.
Schema Registry
In production, you need to ensure that producers and consumers agree on the data format. The Schema Registry (provided by Confluent) stores and enforces schemas for Kafka topics.
It supports Avro, Protobuf, and JSON Schema, with schema evolution rules:
- Backward compatible: New schema can read old data
- Forward compatible: Old schema can read new data
- Full compatible: Both directions
Schema Registry prevents producers from publishing data that would break consumers — a critical safety net for data pipelines at scale.
Real-World Use Cases
Change Data Capture (CDC)
CDC captures row-level changes from a database and streams them into Kafka. This is arguably the most impactful use case for data engineers. Instead of batch-extracting data from source databases, you get a continuous stream of inserts, updates, and deletes.
Debezium is the standard tool for this. It reads the database's transaction log (WAL in PostgreSQL, binlog in MySQL) and produces change events to Kafka topics.
PostgreSQL WAL → Debezium → Kafka Topic → Sink Connector → Data Warehouse
This pattern eliminates the need for periodic full table scans, reduces load on source databases, and delivers near real-time data freshness.
Event Streaming
Modern applications emit events: user clicks, page views, transactions, IoT sensor readings. Kafka serves as the central nervous system for these events, enabling multiple downstream systems to react independently.
A single user-activity topic can feed:
- A real-time dashboard (via Kafka Streams or Flink)
- A data lake (via S3 sink connector)
- A recommendation engine (via a custom consumer)
- A fraud detection system (via a stream processing application)
Log Aggregation
Kafka can replace traditional log aggregation systems. Applications write logs to Kafka topics instead of local files, and downstream consumers route them to Elasticsearch, S3, or a monitoring platform. This centralizes log management and makes logs available for both operational monitoring and analytical processing.
For more patterns on building robust data pipelines, see our data pipeline design patterns guide.
Kafka vs Alternatives
Kafka vs Google Cloud Pub/Sub
| Feature | Kafka | Cloud Pub/Sub |
|---|---|---|
| Hosting | Self-managed or Confluent Cloud | Fully managed |
| Ordering | Per-partition guaranteed | Per-key with ordering keys |
| Retention | Configurable (days to forever) | 7 days default, 31 max |
| Replay | Full replay from any offset | Seek to timestamp |
| Throughput | Millions of msgs/sec | Auto-scales, no provisioning |
| Cost model | Infrastructure-based | Per-message pricing |
Choose Pub/Sub when you want zero operational overhead and are on GCP. Choose Kafka when you need fine-grained control, long retention, or multi-cloud portability.
Kafka vs Amazon Kinesis
| Feature | Kafka | Kinesis Data Streams |
|---|---|---|
| Shards/Partitions | Configurable, thousands+ | 1-10,000 shards |
| Retention | Configurable | 1-365 days |
| Throughput per shard | Higher (config-dependent) | 1 MB/s in, 2 MB/s out |
| Consumer model | Pull-based | Pull or enhanced fan-out |
| Ecosystem | Kafka Connect, Streams, ksqlDB | Lambda, Firehose, Analytics |
Choose Kinesis for tight AWS integration and simpler operations. Choose Kafka for higher throughput, richer ecosystem, and cloud portability.
Getting Started with Docker
The fastest way to run Kafka locally is with Docker Compose. Here is a minimal setup using KRaft mode (no ZooKeeper):
# docker-compose.yml
version: '3.8'
services:
kafka:
image: apache/kafka:3.7.0
hostname: kafka
container_name: kafka
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_LOG_RETENTION_HOURS: 168
CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
Start it with:
docker compose up -d
Python Producer and Consumer Examples
Install the Kafka Python client:
pip install confluent-kafka
Producer
from confluent_kafka import Producer
import json
import time
# Configuration
config = {
'bootstrap.servers': 'localhost:9092',
'client.id': 'python-producer',
'acks': 'all', # Wait for all replicas to acknowledge
'retries': 3, # Retry on transient failures
'enable.idempotence': True, # Prevent duplicate messages on retry
}
producer = Producer(config)
def delivery_callback(err, msg):
"""Called once for each message produced to indicate delivery result."""
if err is not None:
print(f'Message delivery failed: {err}')
else:
print(f'Message delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}')
# Produce sample user events
events = [
{'user_id': 'u123', 'event': 'page_view', 'page': '/products', 'timestamp': time.time()},
{'user_id': 'u456', 'event': 'add_to_cart', 'product_id': 'p789', 'timestamp': time.time()},
{'user_id': 'u123', 'event': 'purchase', 'order_id': 'o001', 'timestamp': time.time()},
]
topic = 'user-events'
for event in events:
# Use user_id as key to ensure ordering per user
producer.produce(
topic=topic,
key=event['user_id'],
value=json.dumps(event),
callback=delivery_callback,
)
# Wait for all messages to be delivered
producer.flush()
print('All messages delivered.')
Consumer
from confluent_kafka import Consumer, KafkaError
import json
# Configuration
config = {
'bootstrap.servers': 'localhost:9092',
'group.id': 'analytics-pipeline',
'auto.offset.reset': 'earliest', # Start from beginning if no committed offset
'enable.auto.commit': False, # Manual offset commits for at-least-once
}
consumer = Consumer(config)
consumer.subscribe(['user-events'])
try:
while True:
msg = consumer.poll(timeout=1.0) # Wait up to 1 second for a message
if msg is None:
continue
if msg.error():
if msg.error().code() == KafkaError._PARTITION_EOF:
print(f'Reached end of partition {msg.partition()}')
else:
print(f'Error: {msg.error()}')
continue
# Process the message
event = json.loads(msg.value().decode('utf-8'))
print(f'Received: user={event["user_id"]}, event={event["event"]}')
# Commit offset after successful processing (at-least-once)
consumer.commit(asynchronous=False)
except KeyboardInterrupt:
pass
finally:
consumer.close()
Production Considerations
Topic Design
- One topic per event type (e.g.,
user-signups,order-completed) is cleaner than a singleall-eventstopic. - Partition count: Start with the number of consumers you expect. Increasing partitions later is possible but reshuffles key-based routing.
- Retention: Set based on your replay requirements. Common values are 7 days for operational data, 30+ days for analytical pipelines, and infinite for event sourcing.
Monitoring
Key metrics to monitor:
- Consumer lag: The difference between the latest offset and the consumer's committed offset. Growing lag means your consumers cannot keep up.
- Under-replicated partitions: Indicates broker health issues.
- Request latency: Producer and consumer request times.
- Disk usage: Kafka stores everything on disk; plan capacity accordingly.
Tools like Kafka UI, Confluent Control Center, and Prometheus + Grafana with JMX exporters are standard for monitoring.
Common Pitfalls
- Too few partitions: Limits consumer parallelism. You cannot have more active consumers than partitions.
- Large messages: Kafka defaults to 1MB max message size. For larger payloads, store the data externally (S3) and send a reference through Kafka.
- Consumer rebalancing storms: Too many consumers joining and leaving causes frequent rebalancing. Use static group membership and incremental cooperative rebalancing.
- Not handling poison pills: A malformed message can crash your consumer in a loop. Implement dead letter queues for messages that fail processing.
Where to Go Next
Kafka is a vast ecosystem. Once you are comfortable with producers and consumers, explore:
- Kafka Streams for stateful stream processing within your JVM application
- ksqlDB for SQL-based stream processing
- Apache Flink for complex event processing at scale
- Schema Registry for schema management and evolution
If you want hands-on practice with Kafka, check out our streaming data pipeline project where you build a complete pipeline from scratch. For a broader view of where Kafka fits in the modern data ecosystem, explore the Modern Data Stack roadmap.
Frequently Asked Questions
What is Apache Kafka?
Apache Kafka is a distributed, fault-tolerant, high-throughput event streaming platform originally built at LinkedIn. It functions as a durable commit log that multiple systems can write to and read from independently. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable period, enabling replay, multiple consumer groups, and decoupled architectures.
What is a Kafka topic?
A Kafka topic is a named stream of records that producers write to and consumers read from — analogous to a table in a database. Each topic is divided into partitions, which are ordered, immutable sequences of records. Partitions enable parallel processing and are distributed across brokers for scalability. Ordering is guaranteed only within a single partition, not across partitions.
What is a consumer group in Kafka?
A consumer group is a set of consumers that cooperate to consume a topic, where each partition is assigned to exactly one consumer within the group. This ensures each record is processed by only one consumer in the group while enabling horizontal scalability. Multiple consumer groups can read the same topic independently, allowing different systems (analytics, fraud detection, recommendations) to process the same data stream at their own pace.
When should I use Kafka vs a message queue?
Use Kafka when you need durable message retention with replay capability, multiple independent consumers reading the same data, high-throughput event streaming, or decoupled producer-consumer architectures. Use a traditional message queue (like RabbitMQ or SQS) when you need simple point-to-point messaging, messages should be deleted after processing, or you want simpler operations with lower throughput requirements.
What is Kafka Connect?
Kafka Connect is a framework for streaming data between Kafka and external systems without writing custom code. Source connectors pull data into Kafka (e.g., Debezium for CDC from PostgreSQL/MySQL), and sink connectors push data from Kafka to external systems (e.g., S3, Snowflake, BigQuery, Elasticsearch). It runs in distributed mode for production scalability and handles serialization, offset management, and fault tolerance automatically.