Apache Kafka for Data Engineers: Architecture, Use Cases & Getting Started

TL;DR: Apache Kafka is a distributed event streaming platform that serves as the backbone for real-time data pipelines. It retains messages for configurable periods (unlike traditional queues), supports multiple independent consumer groups, and enables use cases like Change Data Capture, event streaming, and log aggregation. For data engineers, Kafka is a core skill for building real-time data infrastructure.

Apache Kafka has become the backbone of real-time data infrastructure at companies like LinkedIn, Netflix, Uber, and Airbnb. Originally built at LinkedIn in 2011 to handle their activity stream data, Kafka has evolved into a distributed event streaming platform capable of processing trillions of messages per day. For data engineers, understanding Kafka is no longer optional — it is a core skill.

This guide covers Kafka's architecture, its key components, real-world use cases, how it compares to alternatives, and how to get started with a local setup and Python code.

What Is Apache Kafka?

At its core, Kafka is a distributed, fault-tolerant, high-throughput event streaming platform. Think of it as a commit log that multiple systems can write to and read from independently. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable period, allowing multiple consumers to read the same data at different speeds and at different times.

This design makes Kafka ideal for:

Decoupling data producers from consumers
Building real-time data pipelines
Event-driven architectures
Replayable data streams

Kafka Architecture Deep Dive

Brokers and Clusters

A Kafka broker is a single server that stores data and serves client requests. Multiple brokers form a cluster. Kafka distributes data and load across brokers for scalability and fault tolerance. In production, a typical cluster has 3 to 50+ brokers depending on throughput requirements.

Each broker is identified by an integer ID and can handle hundreds of thousands of reads and writes per second. Brokers are stateless regarding consumers — they do not track who has read what. That responsibility falls on the consumer.

Topics and Partitions

A topic is a named stream of records — think of it as a table in a database or a folder in a file system. Producers write to topics and consumers read from topics.

Each topic is divided into partitions, which are ordered, immutable sequences of records. Partitions are the unit of parallelism in Kafka:

Topic: user-events
├── Partition 0: [msg0, msg1, msg2, msg3, ...]
├── Partition 1: [msg0, msg1, msg2, ...]
└── Partition 2: [msg0, msg1, msg2, msg3, msg4, ...]

Key points about partitions:

Ordering is guaranteed only within a partition, not across partitions.
Each record within a partition has a unique sequential offset.
Partitions are distributed across brokers for load balancing.
The number of partitions determines the maximum parallelism for consumers.

Producers

Producers publish records to topics. When sending a record, the producer can specify:

Key: Used to determine which partition receives the record. Records with the same key always go to the same partition (assuming partition count does not change), which guarantees ordering for that key.
Value: The actual payload (often JSON, Avro, or Protobuf).
Headers: Optional metadata key-value pairs.

If no key is specified, records are distributed across partitions using a round-robin or sticky partition strategy.

Consumer Groups

A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer within the group, ensuring that each record is processed by only one consumer in the group.

Consumer Group: analytics-pipeline
├── Consumer A → Partition 0, Partition 1
├── Consumer B → Partition 2, Partition 3
└── Consumer C → Partition 4, Partition 5

If you add more consumers than partitions, some consumers sit idle. If a consumer fails, its partitions are rebalanced to the remaining consumers. This is how Kafka achieves horizontal scalability and fault tolerance on the consumption side.

Multiple consumer groups can read the same topic independently. For example, a user-events topic could be consumed by an analytics pipeline, a recommendation engine, and a fraud detection system — each as a separate consumer group, each maintaining its own offset.

Replication

Kafka replicates each partition across multiple brokers for fault tolerance. Each partition has one leader and zero or more followers. All reads and writes go through the leader. Followers replicate data from the leader and take over if the leader fails.

The replication factor (commonly set to 3) determines how many copies of each partition exist. The in-sync replicas (ISR) set tracks which followers are caught up with the leader.

ZooKeeper and KRaft

Historically, Kafka used Apache ZooKeeper for cluster metadata management, broker registration, and leader election. Starting with Kafka 3.3, the KRaft (Kafka Raft) mode replaces ZooKeeper, embedding metadata management directly into Kafka brokers. New deployments should use KRaft mode.

Delivery Guarantees

Kafka supports three delivery semantics:

Guarantee	Description	How
At most once	Messages may be lost, never duplicated	Consumer commits offset before processing
At least once	Messages are never lost, may be duplicated	Consumer commits offset after processing
Exactly once	Messages are neither lost nor duplicated	Idempotent producer + transactional writes + read_committed consumers

Exactly-once semantics (EOS) was introduced in Kafka 0.11 and is achieved through:

Idempotent producers: The broker deduplicates retried writes using a producer ID and sequence number.
Transactions: Producers can atomically write to multiple partitions and commit consumer offsets in a single transaction.
Read committed isolation: Consumers only read committed (non-transactional or committed transactional) messages.

For most data engineering pipelines, at-least-once delivery with idempotent downstream processing is the pragmatic choice. Exactly-once adds overhead and complexity.

Kafka Connect

Kafka Connect is a framework for streaming data between Kafka and external systems without writing custom code. It uses connectors — pre-built plugins for common systems.

Source connectors pull data into Kafka (e.g., from PostgreSQL, MySQL, MongoDB, S3).
Sink connectors push data from Kafka to external systems (e.g., to Elasticsearch, Snowflake, BigQuery, S3).

Popular connectors include:

Debezium: CDC (Change Data Capture) source connectors for PostgreSQL, MySQL, SQL Server, MongoDB
S3 Sink/Source: Read and write to Amazon S3
JDBC Source/Sink: Generic relational database connectors
BigQuery Sink: Stream data directly into Google BigQuery

Connect runs in distributed mode for production (multiple workers sharing the load) or standalone mode for development.

Schema Registry

In production, you need to ensure that producers and consumers agree on the data format. The Schema Registry (provided by Confluent) stores and enforces schemas for Kafka topics.

It supports Avro, Protobuf, and JSON Schema, with schema evolution rules:

Backward compatible: New schema can read old data
Forward compatible: Old schema can read new data
Full compatible: Both directions

Schema Registry prevents producers from publishing data that would break consumers — a critical safety net for data pipelines at scale.

Real-World Use Cases

Change Data Capture (CDC)

CDC captures row-level changes from a database and streams them into Kafka. This is arguably the most impactful use case for data engineers. Instead of batch-extracting data from source databases, you get a continuous stream of inserts, updates, and deletes.

Debezium is the standard tool for this. It reads the database's transaction log (WAL in PostgreSQL, binlog in MySQL) and produces change events to Kafka topics.

PostgreSQL WAL → Debezium → Kafka Topic → Sink Connector → Data Warehouse

This pattern eliminates the need for periodic full table scans, reduces load on source databases, and delivers near real-time data freshness.

Event Streaming

Modern applications emit events: user clicks, page views, transactions, IoT sensor readings. Kafka serves as the central nervous system for these events, enabling multiple downstream systems to react independently.

A single user-activity topic can feed:

A real-time dashboard (via Kafka Streams or Flink)
A data lake (via S3 sink connector)
A recommendation engine (via a custom consumer)
A fraud detection system (via a stream processing application)

Log Aggregation

Kafka can replace traditional log aggregation systems. Applications write logs to Kafka topics instead of local files, and downstream consumers route them to Elasticsearch, S3, or a monitoring platform. This centralizes log management and makes logs available for both operational monitoring and analytical processing.

For more patterns on building robust data pipelines, see our data pipeline design patterns guide.

Kafka vs Alternatives

Kafka vs Google Cloud Pub/Sub

Feature	Kafka	Cloud Pub/Sub
Hosting	Self-managed or Confluent Cloud	Fully managed
Ordering	Per-partition guaranteed	Per-key with ordering keys
Retention	Configurable (days to forever)	7 days default, 31 max
Replay	Full replay from any offset	Seek to timestamp
Throughput	Millions of msgs/sec	Auto-scales, no provisioning
Cost model	Infrastructure-based	Per-message pricing

Choose Pub/Sub when you want zero operational overhead and are on GCP. Choose Kafka when you need fine-grained control, long retention, or multi-cloud portability.

Kafka vs Amazon Kinesis

Feature	Kafka	Kinesis Data Streams
Shards/Partitions	Configurable, thousands+	1-10,000 shards
Retention	Configurable	1-365 days
Throughput per shard	Higher (config-dependent)	1 MB/s in, 2 MB/s out
Consumer model	Pull-based	Pull or enhanced fan-out
Ecosystem	Kafka Connect, Streams, ksqlDB	Lambda, Firehose, Analytics

Choose Kinesis for tight AWS integration and simpler operations. Choose Kafka for higher throughput, richer ecosystem, and cloud portability.

Getting Started with Docker

The fastest way to run Kafka locally is with Docker Compose. Here is a minimal setup using KRaft mode (no ZooKeeper):

# docker-compose.yml
version: '3.8'
services:
  kafka:
    image: apache/kafka:3.7.0
    hostname: kafka
    container_name: kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_LOG_RETENTION_HOURS: 168
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk

Start it with:

docker compose up -d

Python Producer and Consumer Examples

Install the Kafka Python client:

pip install confluent-kafka

Producer

from confluent_kafka import Producer
import json
import time

# Configuration
config = {
    'bootstrap.servers': 'localhost:9092',
    'client.id': 'python-producer',
    'acks': 'all',               # Wait for all replicas to acknowledge
    'retries': 3,                # Retry on transient failures
    'enable.idempotence': True,  # Prevent duplicate messages on retry
}

producer = Producer(config)

def delivery_callback(err, msg):
    """Called once for each message produced to indicate delivery result."""
    if err is not None:
        print(f'Message delivery failed: {err}')
    else:
        print(f'Message delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}')

# Produce sample user events
events = [
    {'user_id': 'u123', 'event': 'page_view', 'page': '/products', 'timestamp': time.time()},
    {'user_id': 'u456', 'event': 'add_to_cart', 'product_id': 'p789', 'timestamp': time.time()},
    {'user_id': 'u123', 'event': 'purchase', 'order_id': 'o001', 'timestamp': time.time()},
]

topic = 'user-events'

for event in events:
    # Use user_id as key to ensure ordering per user
    producer.produce(
        topic=topic,
        key=event['user_id'],
        value=json.dumps(event),
        callback=delivery_callback,
    )

# Wait for all messages to be delivered
producer.flush()
print('All messages delivered.')

Consumer

from confluent_kafka import Consumer, KafkaError
import json

# Configuration
config = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'analytics-pipeline',
    'auto.offset.reset': 'earliest',  # Start from beginning if no committed offset
    'enable.auto.commit': False,       # Manual offset commits for at-least-once
}

consumer = Consumer(config)
consumer.subscribe(['user-events'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)  # Wait up to 1 second for a message

        if msg is None:
            continue
        if msg.error():
            if msg.error().code() == KafkaError._PARTITION_EOF:
                print(f'Reached end of partition {msg.partition()}')
            else:
                print(f'Error: {msg.error()}')
            continue

        # Process the message
        event = json.loads(msg.value().decode('utf-8'))
        print(f'Received: user={event["user_id"]}, event={event["event"]}')

        # Commit offset after successful processing (at-least-once)
        consumer.commit(asynchronous=False)

except KeyboardInterrupt:
    pass
finally:
    consumer.close()

Production Considerations

Topic Design

One topic per event type (e.g., user-signups, order-completed) is cleaner than a single all-events topic.
Partition count: Start with the number of consumers you expect. Increasing partitions later is possible but reshuffles key-based routing.
Retention: Set based on your replay requirements. Common values are 7 days for operational data, 30+ days for analytical pipelines, and infinite for event sourcing.

Monitoring

Key metrics to monitor:

Consumer lag: The difference between the latest offset and the consumer's committed offset. Growing lag means your consumers cannot keep up.
Under-replicated partitions: Indicates broker health issues.
Request latency: Producer and consumer request times.
Disk usage: Kafka stores everything on disk; plan capacity accordingly.

Tools like Kafka UI, Confluent Control Center, and Prometheus + Grafana with JMX exporters are standard for monitoring.

Common Pitfalls

Too few partitions: Limits consumer parallelism. You cannot have more active consumers than partitions.
Large messages: Kafka defaults to 1MB max message size. For larger payloads, store the data externally (S3) and send a reference through Kafka.
Consumer rebalancing storms: Too many consumers joining and leaving causes frequent rebalancing. Use static group membership and incremental cooperative rebalancing.
Not handling poison pills: A malformed message can crash your consumer in a loop. Implement dead letter queues for messages that fail processing.

Where to Go Next

Kafka is a vast ecosystem. Once you are comfortable with producers and consumers, explore:

Kafka Streams for stateful stream processing within your JVM application
ksqlDB for SQL-based stream processing
Apache Flink for complex event processing at scale
Schema Registry for schema management and evolution

If you want hands-on practice with Kafka, check out our streaming data pipeline project where you build a complete pipeline from scratch. For a broader view of where Kafka fits in the modern data ecosystem, explore the Modern Data Stack roadmap.

Frequently Asked Questions

What is Apache Kafka?

Apache Kafka is a distributed, fault-tolerant, high-throughput event streaming platform originally built at LinkedIn. It functions as a durable commit log that multiple systems can write to and read from independently. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable period, enabling replay, multiple consumer groups, and decoupled architectures.

What is a Kafka topic?

A Kafka topic is a named stream of records that producers write to and consumers read from — analogous to a table in a database. Each topic is divided into partitions, which are ordered, immutable sequences of records. Partitions enable parallel processing and are distributed across brokers for scalability. Ordering is guaranteed only within a single partition, not across partitions.

What is a consumer group in Kafka?

A consumer group is a set of consumers that cooperate to consume a topic, where each partition is assigned to exactly one consumer within the group. This ensures each record is processed by only one consumer in the group while enabling horizontal scalability. Multiple consumer groups can read the same topic independently, allowing different systems (analytics, fraud detection, recommendations) to process the same data stream at their own pace.

When should I use Kafka vs a message queue?

Use Kafka when you need durable message retention with replay capability, multiple independent consumers reading the same data, high-throughput event streaming, or decoupled producer-consumer architectures. Use a traditional message queue (like RabbitMQ or SQS) when you need simple point-to-point messaging, messages should be deleted after processing, or you want simpler operations with lower throughput requirements.

What is Kafka Connect?

Kafka Connect is a framework for streaming data between Kafka and external systems without writing custom code. Source connectors pull data into Kafka (e.g., Debezium for CDC from PostgreSQL/MySQL), and sink connectors push data from Kafka to external systems (e.g., S3, Snowflake, BigQuery, Elasticsearch). It runs in distributed mode for production scalability and handles serialization, offset management, and fault tolerance automatically.