Apache Kafka for Data Engineers: Architecture, Use Cases & Getting Started

    Learn Apache Kafka architecture, key concepts, and practical use cases. Includes Python examples, Docker setup, and comparisons with Pub/Sub and Kinesis.

    By Adriano Sanges--14 min read
    Apache Kafka
    event streaming
    data engineering
    real-time data
    CDC
    Python

    TL;DR: Apache Kafka is a distributed event streaming platform that serves as the backbone for real-time data pipelines. It retains messages for configurable periods (unlike traditional queues), supports multiple independent consumer groups, and enables use cases like Change Data Capture, event streaming, and log aggregation. For data engineers, Kafka is a core skill for building real-time data infrastructure.

    Apache Kafka for Data Engineers: Architecture, Use Cases & Getting Started

    Apache Kafka has become the backbone of real-time data infrastructure at companies like LinkedIn, Netflix, Uber, and Airbnb. Originally built at LinkedIn in 2011 to handle their activity stream data, Kafka has evolved into a distributed event streaming platform capable of processing trillions of messages per day. For data engineers, understanding Kafka is no longer optional — it is a core skill.

    This guide covers Kafka's architecture, its key components, real-world use cases, how it compares to alternatives, and how to get started with a local setup and Python code.

    What Is Apache Kafka?

    At its core, Kafka is a distributed, fault-tolerant, high-throughput event streaming platform. Think of it as a commit log that multiple systems can write to and read from independently. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable period, allowing multiple consumers to read the same data at different speeds and at different times.

    This design makes Kafka ideal for:

    • Decoupling data producers from consumers
    • Building real-time data pipelines
    • Event-driven architectures
    • Replayable data streams

    Kafka Architecture Deep Dive

    Brokers and Clusters

    A Kafka broker is a single server that stores data and serves client requests. Multiple brokers form a cluster. Kafka distributes data and load across brokers for scalability and fault tolerance. In production, a typical cluster has 3 to 50+ brokers depending on throughput requirements.

    Each broker is identified by an integer ID and can handle hundreds of thousands of reads and writes per second. Brokers are stateless regarding consumers — they do not track who has read what. That responsibility falls on the consumer.

    Topics and Partitions

    A topic is a named stream of records — think of it as a table in a database or a folder in a file system. Producers write to topics and consumers read from topics.

    Each topic is divided into partitions, which are ordered, immutable sequences of records. Partitions are the unit of parallelism in Kafka:

    Topic: user-events
    ├── Partition 0: [msg0, msg1, msg2, msg3, ...]
    ├── Partition 1: [msg0, msg1, msg2, ...]
    └── Partition 2: [msg0, msg1, msg2, msg3, msg4, ...]
    

    Key points about partitions:

    • Ordering is guaranteed only within a partition, not across partitions.
    • Each record within a partition has a unique sequential offset.
    • Partitions are distributed across brokers for load balancing.
    • The number of partitions determines the maximum parallelism for consumers.

    Producers

    Producers publish records to topics. When sending a record, the producer can specify:

    • Key: Used to determine which partition receives the record. Records with the same key always go to the same partition (assuming partition count does not change), which guarantees ordering for that key.
    • Value: The actual payload (often JSON, Avro, or Protobuf).
    • Headers: Optional metadata key-value pairs.

    If no key is specified, records are distributed across partitions using a round-robin or sticky partition strategy.

    Consumer Groups

    A consumer group is a set of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer within the group, ensuring that each record is processed by only one consumer in the group.

    Consumer Group: analytics-pipeline
    ├── Consumer A → Partition 0, Partition 1
    ├── Consumer B → Partition 2, Partition 3
    └── Consumer C → Partition 4, Partition 5
    

    If you add more consumers than partitions, some consumers sit idle. If a consumer fails, its partitions are rebalanced to the remaining consumers. This is how Kafka achieves horizontal scalability and fault tolerance on the consumption side.

    Multiple consumer groups can read the same topic independently. For example, a user-events topic could be consumed by an analytics pipeline, a recommendation engine, and a fraud detection system — each as a separate consumer group, each maintaining its own offset.

    Replication

    Kafka replicates each partition across multiple brokers for fault tolerance. Each partition has one leader and zero or more followers. All reads and writes go through the leader. Followers replicate data from the leader and take over if the leader fails.

    The replication factor (commonly set to 3) determines how many copies of each partition exist. The in-sync replicas (ISR) set tracks which followers are caught up with the leader.

    ZooKeeper and KRaft

    Historically, Kafka used Apache ZooKeeper for cluster metadata management, broker registration, and leader election. Starting with Kafka 3.3, the KRaft (Kafka Raft) mode replaces ZooKeeper, embedding metadata management directly into Kafka brokers. New deployments should use KRaft mode.

    Delivery Guarantees

    Kafka supports three delivery semantics:

    Guarantee Description How
    At most once Messages may be lost, never duplicated Consumer commits offset before processing
    At least once Messages are never lost, may be duplicated Consumer commits offset after processing
    Exactly once Messages are neither lost nor duplicated Idempotent producer + transactional writes + read_committed consumers

    Exactly-once semantics (EOS) was introduced in Kafka 0.11 and is achieved through:

    1. Idempotent producers: The broker deduplicates retried writes using a producer ID and sequence number.
    2. Transactions: Producers can atomically write to multiple partitions and commit consumer offsets in a single transaction.
    3. Read committed isolation: Consumers only read committed (non-transactional or committed transactional) messages.

    For most data engineering pipelines, at-least-once delivery with idempotent downstream processing is the pragmatic choice. Exactly-once adds overhead and complexity.

    Kafka Connect

    Kafka Connect is a framework for streaming data between Kafka and external systems without writing custom code. It uses connectors — pre-built plugins for common systems.

    • Source connectors pull data into Kafka (e.g., from PostgreSQL, MySQL, MongoDB, S3).
    • Sink connectors push data from Kafka to external systems (e.g., to Elasticsearch, Snowflake, BigQuery, S3).

    Popular connectors include:

    • Debezium: CDC (Change Data Capture) source connectors for PostgreSQL, MySQL, SQL Server, MongoDB
    • S3 Sink/Source: Read and write to Amazon S3
    • JDBC Source/Sink: Generic relational database connectors
    • BigQuery Sink: Stream data directly into Google BigQuery

    Connect runs in distributed mode for production (multiple workers sharing the load) or standalone mode for development.

    Schema Registry

    In production, you need to ensure that producers and consumers agree on the data format. The Schema Registry (provided by Confluent) stores and enforces schemas for Kafka topics.

    It supports Avro, Protobuf, and JSON Schema, with schema evolution rules:

    • Backward compatible: New schema can read old data
    • Forward compatible: Old schema can read new data
    • Full compatible: Both directions

    Schema Registry prevents producers from publishing data that would break consumers — a critical safety net for data pipelines at scale.

    Real-World Use Cases

    Change Data Capture (CDC)

    CDC captures row-level changes from a database and streams them into Kafka. This is arguably the most impactful use case for data engineers. Instead of batch-extracting data from source databases, you get a continuous stream of inserts, updates, and deletes.

    Debezium is the standard tool for this. It reads the database's transaction log (WAL in PostgreSQL, binlog in MySQL) and produces change events to Kafka topics.

    PostgreSQL WAL → Debezium → Kafka Topic → Sink Connector → Data Warehouse
    

    This pattern eliminates the need for periodic full table scans, reduces load on source databases, and delivers near real-time data freshness.

    Event Streaming

    Modern applications emit events: user clicks, page views, transactions, IoT sensor readings. Kafka serves as the central nervous system for these events, enabling multiple downstream systems to react independently.

    A single user-activity topic can feed:

    • A real-time dashboard (via Kafka Streams or Flink)
    • A data lake (via S3 sink connector)
    • A recommendation engine (via a custom consumer)
    • A fraud detection system (via a stream processing application)

    Log Aggregation

    Kafka can replace traditional log aggregation systems. Applications write logs to Kafka topics instead of local files, and downstream consumers route them to Elasticsearch, S3, or a monitoring platform. This centralizes log management and makes logs available for both operational monitoring and analytical processing.

    For more patterns on building robust data pipelines, see our data pipeline design patterns guide.

    Kafka vs Alternatives

    Kafka vs Google Cloud Pub/Sub

    Feature Kafka Cloud Pub/Sub
    Hosting Self-managed or Confluent Cloud Fully managed
    Ordering Per-partition guaranteed Per-key with ordering keys
    Retention Configurable (days to forever) 7 days default, 31 max
    Replay Full replay from any offset Seek to timestamp
    Throughput Millions of msgs/sec Auto-scales, no provisioning
    Cost model Infrastructure-based Per-message pricing

    Choose Pub/Sub when you want zero operational overhead and are on GCP. Choose Kafka when you need fine-grained control, long retention, or multi-cloud portability.

    Kafka vs Amazon Kinesis

    Feature Kafka Kinesis Data Streams
    Shards/Partitions Configurable, thousands+ 1-10,000 shards
    Retention Configurable 1-365 days
    Throughput per shard Higher (config-dependent) 1 MB/s in, 2 MB/s out
    Consumer model Pull-based Pull or enhanced fan-out
    Ecosystem Kafka Connect, Streams, ksqlDB Lambda, Firehose, Analytics

    Choose Kinesis for tight AWS integration and simpler operations. Choose Kafka for higher throughput, richer ecosystem, and cloud portability.

    Getting Started with Docker

    The fastest way to run Kafka locally is with Docker Compose. Here is a minimal setup using KRaft mode (no ZooKeeper):

    # docker-compose.yml
    version: '3.8'
    services:
      kafka:
        image: apache/kafka:3.7.0
        hostname: kafka
        container_name: kafka
        ports:
          - "9092:9092"
        environment:
          KAFKA_NODE_ID: 1
          KAFKA_PROCESS_ROLES: broker,controller
          KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
          KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
          KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
          KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
          KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
          KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
          KAFKA_LOG_RETENTION_HOURS: 168
          CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
    

    Start it with:

    docker compose up -d
    

    Python Producer and Consumer Examples

    Install the Kafka Python client:

    pip install confluent-kafka
    

    Producer

    from confluent_kafka import Producer
    import json
    import time
    
    # Configuration
    config = {
        'bootstrap.servers': 'localhost:9092',
        'client.id': 'python-producer',
        'acks': 'all',               # Wait for all replicas to acknowledge
        'retries': 3,                # Retry on transient failures
        'enable.idempotence': True,  # Prevent duplicate messages on retry
    }
    
    producer = Producer(config)
    
    def delivery_callback(err, msg):
        """Called once for each message produced to indicate delivery result."""
        if err is not None:
            print(f'Message delivery failed: {err}')
        else:
            print(f'Message delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}')
    
    # Produce sample user events
    events = [
        {'user_id': 'u123', 'event': 'page_view', 'page': '/products', 'timestamp': time.time()},
        {'user_id': 'u456', 'event': 'add_to_cart', 'product_id': 'p789', 'timestamp': time.time()},
        {'user_id': 'u123', 'event': 'purchase', 'order_id': 'o001', 'timestamp': time.time()},
    ]
    
    topic = 'user-events'
    
    for event in events:
        # Use user_id as key to ensure ordering per user
        producer.produce(
            topic=topic,
            key=event['user_id'],
            value=json.dumps(event),
            callback=delivery_callback,
        )
    
    # Wait for all messages to be delivered
    producer.flush()
    print('All messages delivered.')
    

    Consumer

    from confluent_kafka import Consumer, KafkaError
    import json
    
    # Configuration
    config = {
        'bootstrap.servers': 'localhost:9092',
        'group.id': 'analytics-pipeline',
        'auto.offset.reset': 'earliest',  # Start from beginning if no committed offset
        'enable.auto.commit': False,       # Manual offset commits for at-least-once
    }
    
    consumer = Consumer(config)
    consumer.subscribe(['user-events'])
    
    try:
        while True:
            msg = consumer.poll(timeout=1.0)  # Wait up to 1 second for a message
    
            if msg is None:
                continue
            if msg.error():
                if msg.error().code() == KafkaError._PARTITION_EOF:
                    print(f'Reached end of partition {msg.partition()}')
                else:
                    print(f'Error: {msg.error()}')
                continue
    
            # Process the message
            event = json.loads(msg.value().decode('utf-8'))
            print(f'Received: user={event["user_id"]}, event={event["event"]}')
    
            # Commit offset after successful processing (at-least-once)
            consumer.commit(asynchronous=False)
    
    except KeyboardInterrupt:
        pass
    finally:
        consumer.close()
    

    Production Considerations

    Topic Design

    • One topic per event type (e.g., user-signups, order-completed) is cleaner than a single all-events topic.
    • Partition count: Start with the number of consumers you expect. Increasing partitions later is possible but reshuffles key-based routing.
    • Retention: Set based on your replay requirements. Common values are 7 days for operational data, 30+ days for analytical pipelines, and infinite for event sourcing.

    Monitoring

    Key metrics to monitor:

    • Consumer lag: The difference between the latest offset and the consumer's committed offset. Growing lag means your consumers cannot keep up.
    • Under-replicated partitions: Indicates broker health issues.
    • Request latency: Producer and consumer request times.
    • Disk usage: Kafka stores everything on disk; plan capacity accordingly.

    Tools like Kafka UI, Confluent Control Center, and Prometheus + Grafana with JMX exporters are standard for monitoring.

    Common Pitfalls

    1. Too few partitions: Limits consumer parallelism. You cannot have more active consumers than partitions.
    2. Large messages: Kafka defaults to 1MB max message size. For larger payloads, store the data externally (S3) and send a reference through Kafka.
    3. Consumer rebalancing storms: Too many consumers joining and leaving causes frequent rebalancing. Use static group membership and incremental cooperative rebalancing.
    4. Not handling poison pills: A malformed message can crash your consumer in a loop. Implement dead letter queues for messages that fail processing.

    Where to Go Next

    Kafka is a vast ecosystem. Once you are comfortable with producers and consumers, explore:

    • Kafka Streams for stateful stream processing within your JVM application
    • ksqlDB for SQL-based stream processing
    • Apache Flink for complex event processing at scale
    • Schema Registry for schema management and evolution

    If you want hands-on practice with Kafka, check out our streaming data pipeline project where you build a complete pipeline from scratch. For a broader view of where Kafka fits in the modern data ecosystem, explore the Modern Data Stack roadmap.

    Frequently Asked Questions

    What is Apache Kafka?

    Apache Kafka is a distributed, fault-tolerant, high-throughput event streaming platform originally built at LinkedIn. It functions as a durable commit log that multiple systems can write to and read from independently. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable period, enabling replay, multiple consumer groups, and decoupled architectures.

    What is a Kafka topic?

    A Kafka topic is a named stream of records that producers write to and consumers read from — analogous to a table in a database. Each topic is divided into partitions, which are ordered, immutable sequences of records. Partitions enable parallel processing and are distributed across brokers for scalability. Ordering is guaranteed only within a single partition, not across partitions.

    What is a consumer group in Kafka?

    A consumer group is a set of consumers that cooperate to consume a topic, where each partition is assigned to exactly one consumer within the group. This ensures each record is processed by only one consumer in the group while enabling horizontal scalability. Multiple consumer groups can read the same topic independently, allowing different systems (analytics, fraud detection, recommendations) to process the same data stream at their own pace.

    When should I use Kafka vs a message queue?

    Use Kafka when you need durable message retention with replay capability, multiple independent consumers reading the same data, high-throughput event streaming, or decoupled producer-consumer architectures. Use a traditional message queue (like RabbitMQ or SQS) when you need simple point-to-point messaging, messages should be deleted after processing, or you want simpler operations with lower throughput requirements.

    What is Kafka Connect?

    Kafka Connect is a framework for streaming data between Kafka and external systems without writing custom code. Source connectors pull data into Kafka (e.g., Debezium for CDC from PostgreSQL/MySQL), and sink connectors push data from Kafka to external systems (e.g., S3, Snowflake, BigQuery, Elasticsearch). It runs in distributed mode for production scalability and handles serialization, offset management, and fault tolerance automatically.

    About the Author

    Adriano Sanges is a data engineering professional and the creator of dataskew.io. With years of experience building data platforms at scale, he shares practical insights and hands-on guides to help aspiring data engineers advance their careers.

    Ready to Apply What You Learned?

    Take the next step in your data engineering journey with structured roadmaps and hands-on projects designed for real-world experience.