Data Pipeline Design Patterns Every Engineer Should Know

    Master essential data pipeline design patterns including idempotency, backfilling, error handling, and schema evolution for production systems.

    By Adriano Sanges--14 min read
    data pipelines
    design patterns
    data engineering
    idempotency
    streaming
    batch processing

    TL;DR: Production-grade data pipelines rely on key design patterns: idempotency (safe to re-run), backfilling (reprocess any date range), dead letter queues (bad records don't kill the pipeline), and schema evolution (handle change gracefully). Combine these with retry logic and monitoring to build pipelines that run reliably for years.

    Data Pipeline Design Patterns Every Engineer Should Know

    Building a data pipeline that works once is easy. Building one that works reliably at scale, handles failures gracefully, and can be maintained by a team over years — that's where design patterns matter. This guide covers the patterns that separate production-grade pipelines from prototypes.

    Why Patterns Matter in Data Engineering

    Data pipelines are deceptively complex. They deal with distributed systems, unreliable networks, schema changes, late-arriving data, and partial failures. Without established patterns, you end up with fragile spaghetti code that breaks at 3 AM and nobody wants to debug.

    These patterns aren't theoretical — they come from real-world production systems at companies processing billions of records daily. Learning them early will save you countless on-call incidents.

    Pattern 1: Idempotency

    Idempotency is arguably the most important pattern in data engineering. An operation is idempotent if running it multiple times produces the same result as running it once.

    Why It Matters

    Pipelines fail and need to be retried. If your pipeline inserts data without idempotency, retrying it creates duplicate records. If it updates data without idempotency, you might apply a transformation twice and get incorrect values.

    Implementation Strategies

    Strategy A: DELETE + INSERT (Partition Replacement)

    The simplest idempotent pattern. Delete the existing data for the time period, then insert the fresh data.

    -- Idempotent daily load: safe to re-run for any date
    BEGIN TRANSACTION;
    
    DELETE FROM analytics.daily_revenue
    WHERE date = '{{ ds }}';
    
    INSERT INTO analytics.daily_revenue (date, category, revenue, order_count)
    SELECT
        order_date AS date,
        product_category AS category,
        SUM(amount) AS revenue,
        COUNT(*) AS order_count
    FROM raw.orders
    WHERE order_date = '{{ ds }}'
    GROUP BY order_date, product_category;
    
    COMMIT;
    

    This pattern is safe because no matter how many times you run it for a given date, the result is the same.

    Strategy B: MERGE / UPSERT

    When you need to update existing records or insert new ones based on a key:

    -- Idempotent upsert using MERGE
    MERGE INTO analytics.customer_metrics AS target
    USING (
        SELECT
            customer_id,
            COUNT(*) AS total_orders,
            SUM(amount) AS lifetime_value,
            MAX(order_date) AS last_order_date
        FROM raw.orders
        WHERE order_date = '{{ ds }}'
        GROUP BY customer_id
    ) AS source
    ON target.customer_id = source.customer_id
    WHEN MATCHED THEN
        UPDATE SET
            total_orders = source.total_orders,
            lifetime_value = source.lifetime_value,
            last_order_date = source.last_order_date
    WHEN NOT MATCHED THEN
        INSERT (customer_id, total_orders, lifetime_value, last_order_date)
        VALUES (source.customer_id, source.total_orders, source.lifetime_value, source.last_order_date);
    

    Strategy C: Immutable Append with Deduplication

    Write all records with a processing timestamp, then deduplicate at read time:

    -- Write layer: always append, never update
    INSERT INTO raw.events_log (event_id, event_data, processed_at)
    SELECT event_id, event_data, CURRENT_TIMESTAMP
    FROM staging.new_events;
    
    -- Read layer: deduplicate using window function
    CREATE VIEW analytics.events AS
    SELECT * FROM (
        SELECT
            *,
            ROW_NUMBER() OVER (
                PARTITION BY event_id
                ORDER BY processed_at DESC
            ) AS rn
        FROM raw.events_log
    )
    WHERE rn = 1;
    

    Pattern 2: Backfilling

    Backfilling is the ability to reprocess historical data. It's not an edge case — it's a regular operation. Schema changes, bug fixes, new business logic, and late-arriving data all require backfills.

    Design for Backfilling from Day One

    The key principle: every pipeline should accept a date parameter and process exactly that date's data. Never hardcode CURRENT_DATE or NOW().

    # Bad: Not backfillable
    def process_orders():
        today = datetime.now().date()
        orders = fetch_orders(date=today)
        transform_and_load(orders)
    
    # Good: Backfillable by design
    def process_orders(execution_date: date):
        orders = fetch_orders(date=execution_date)
        transform_and_load(orders, partition_date=execution_date)
    

    Backfilling with Apache Airflow

    Airflow's catchup mechanism is purpose-built for backfilling. When combined with idempotent tasks, you can safely reprocess any date range:

    from airflow import DAG
    from airflow.operators.python import PythonOperator
    from datetime import datetime
    
    dag = DAG(
        "daily_revenue_pipeline",
        schedule_interval="@daily",
        start_date=datetime(2025, 1, 1),
        catchup=True,  # Enable backfilling
    )
    
    def process_daily_revenue(**context):
        execution_date = context["ds"]  # YYYY-MM-DD string
        # Idempotent: safe to run multiple times for same date
        delete_partition(table="daily_revenue", date=execution_date)
        data = extract_orders(date=execution_date)
        transformed = transform(data)
        load_partition(table="daily_revenue", data=transformed, date=execution_date)
    
    task = PythonOperator(
        task_id="process_daily_revenue",
        python_callable=process_daily_revenue,
        dag=dag,
    )
    

    To backfill a specific date range:

    airflow dags backfill daily_revenue_pipeline \
        --start-date 2025-06-01 \
        --end-date 2025-06-30
    

    Practice building these orchestration patterns in our Orchestration with Airflow project.

    Pattern 3: Schema Evolution

    Schemas change. Columns get added, renamed, or removed. Data types shift. A robust pipeline handles schema evolution without breaking.

    Strategies for Schema Evolution

    Additive-Only Changes

    The safest approach: only add new columns, never remove or rename existing ones. Consumers of the data continue to work because their expected columns still exist.

    -- Safe: adding a column
    ALTER TABLE analytics.customers ADD COLUMN loyalty_tier VARCHAR(20);
    
    -- Dangerous: renaming breaks downstream queries
    -- ALTER TABLE analytics.customers RENAME COLUMN name TO full_name;
    

    Schema Registry (Streaming)

    For event-driven architectures, a schema registry (like Confluent Schema Registry for Kafka) enforces compatibility rules:

    • Backward compatible: New schema can read data written with the old schema.
    • Forward compatible: Old schema can read data written with the new schema.
    • Full compatible: Both directions work.
    {
      "type": "record",
      "name": "OrderEvent",
      "fields": [
        {"name": "order_id", "type": "string"},
        {"name": "amount", "type": "double"},
        {"name": "currency", "type": "string", "default": "USD"},
        {"name": "loyalty_points", "type": ["null", "int"], "default": null}
      ]
    }
    

    New fields with defaults are backward compatible — old consumers simply ignore them.

    Versioned Tables

    For significant schema changes, create a new version of the table and migrate consumers incrementally:

    analytics.customers_v1  -- Old schema, deprecated
    analytics.customers_v2  -- New schema, active
    analytics.customers     -- View pointing to v2
    

    Pattern 4: Error Handling and Dead Letter Queues

    Not every record will process successfully. The question is: does one bad record kill your entire pipeline?

    The Dead Letter Queue Pattern

    Instead of failing the pipeline when a record can't be processed, route it to a dead letter queue (DLQ) for investigation, and continue processing the rest.

    from dataclasses import dataclass
    from typing import List
    from datetime import datetime
    import logging
    
    logger = logging.getLogger(__name__)
    
    @dataclass
    class ProcessingResult:
        successful: List[dict]
        failed: List[dict]
    
    def process_batch(records: List[dict]) -> ProcessingResult:
        successful = []
        failed = []
    
        for record in records:
            try:
                validated = validate_and_transform(record)
                successful.append(validated)
            except Exception as e:
                logger.warning(f"Record failed processing: {e}")
                failed.append({
                    "original_record": record,
                    "error": str(e),
                    "error_type": type(e).__name__,
                    "timestamp": datetime.utcnow().isoformat(),
                })
    
        return ProcessingResult(successful=successful, failed=failed)
    
    def run_pipeline(records: List[dict]):
        result = process_batch(records)
    
        # Load successful records to target
        load_to_warehouse(result.successful)
    
        # Route failures to dead letter queue
        if result.failed:
            load_to_dlq(result.failed)
            logger.warning(
                f"{len(result.failed)}/{len(records)} records "
                f"routed to dead letter queue"
            )
    
            # Alert if failure rate exceeds threshold
            failure_rate = len(result.failed) / len(records)
            if failure_rate > 0.05:  # 5% threshold
                send_alert(f"High failure rate: {failure_rate:.1%}")
    

    Circuit Breaker Pattern

    If a downstream system is failing consistently, don't keep hammering it. Implement a circuit breaker that stops calls after a threshold of failures:

    import time
    
    class CircuitBreaker:
        def __init__(self, failure_threshold=5, reset_timeout=60):
            self.failure_count = 0
            self.failure_threshold = failure_threshold
            self.reset_timeout = reset_timeout
            self.state = "CLOSED"  # CLOSED = normal, OPEN = failing
            self.last_failure_time = None
    
        def call(self, func, *args, **kwargs):
            if self.state == "OPEN":
                if time.time() - self.last_failure_time > self.reset_timeout:
                    self.state = "HALF_OPEN"
                else:
                    raise CircuitBreakerOpenError(
                        "Circuit is open, call rejected"
                    )
    
            try:
                result = func(*args, **kwargs)
                self._on_success()
                return result
            except Exception as e:
                self._on_failure()
                raise
    
        def _on_success(self):
            self.failure_count = 0
            self.state = "CLOSED"
    
        def _on_failure(self):
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
    

    Pattern 5: Retry with Exponential Backoff

    Transient failures — network timeouts, rate limits, temporary unavailability — are normal. Retrying with exponential backoff prevents overwhelming a struggling service.

    import time
    import random
    import logging
    
    def retry_with_backoff(func, max_retries=5, base_delay=1.0, max_delay=60.0):
        """Retry a function with exponential backoff and jitter."""
        for attempt in range(max_retries):
            try:
                return func()
            except (ConnectionError, TimeoutError) as e:
                if attempt == max_retries - 1:
                    raise  # Final attempt, propagate the error
    
                # Exponential backoff with jitter
                delay = min(base_delay * (2 ** attempt), max_delay)
                jitter = random.uniform(0, delay * 0.1)
                wait_time = delay + jitter
    
                logging.warning(
                    f"Attempt {attempt + 1} failed: {e}. "
                    f"Retrying in {wait_time:.1f}s..."
                )
                time.sleep(wait_time)
    

    The jitter is critical. Without it, if multiple pipeline instances fail simultaneously, they all retry at the exact same intervals, creating a "thundering herd" that overwhelms the downstream system again.

    Pattern 6: Exactly-Once vs At-Least-Once Delivery

    This is a fundamental concept in distributed data systems, especially for streaming pipelines.

    At-Least-Once

    The message is guaranteed to be delivered, but may arrive more than once. This is the default for most systems because it's simpler and more performant.

    Solution: Combine at-least-once delivery with idempotent processing. Your consumer must handle duplicates gracefully.

    import json
    
    # At-least-once consumer with idempotent processing
    def consume_events(consumer):
        for message in consumer:
            event = json.loads(message.value)
    
            # Idempotent check: skip if already processed
            if is_already_processed(event["event_id"]):
                consumer.commit()
                continue
    
            process_event(event)
            mark_as_processed(event["event_id"])
            consumer.commit()
    

    Exactly-Once

    The message is delivered precisely once. This requires transactional coordination between the message system and the processing system. Kafka supports exactly-once semantics through idempotent producers and transactional consumers, but it comes with a performance cost.

    In practice, at-least-once with idempotent consumers is the most common pattern because it's simpler and nearly as effective.

    Pattern 7: Batch vs Streaming — and When to Use Each

    This isn't an either-or choice. Most mature data platforms use both.

    Batch Processing

    • Processes data in discrete chunks (hourly, daily)
    • Higher throughput, lower cost per record
    • Simpler to reason about, debug, and backfill
    • Tools: Spark, Airflow, dbt

    Best for: Analytics, reporting, ML training, data warehouse loads

    Stream Processing

    • Processes data continuously as it arrives
    • Lower latency (seconds to minutes)
    • More complex state management
    • Tools: Kafka Streams, Apache Flink, Spark Structured Streaming

    Best for: Real-time dashboards, fraud detection, operational alerts, event-driven architectures

    The Lambda and Kappa Architectures

    Lambda Architecture: Run both batch and streaming layers in parallel. Batch provides the "truth" layer; streaming provides the "speed" layer. Results are merged at query time.

    [Source] --> [Batch Layer] --> [Batch Views]    \
                                                     --> [Serving Layer] --> [Query]
    [Source] --> [Speed Layer] --> [Real-time Views] /
    

    Kappa Architecture: Use a single streaming layer for everything. Historical reprocessing is done by replaying the event log.

    [Source] --> [Event Log] --> [Stream Processing] --> [Serving Layer] --> [Query]
    

    Kappa is simpler to maintain but requires a replayable event log (like Kafka with long retention). In practice, most teams end up with a blend of both.

    Build your streaming skills with our Streaming with Kafka project, and explore the full Modern Data Stack roadmap to see how these patterns fit together.

    Pattern 8: Slowly Changing Dimensions (SCD) in Pipelines

    When dimension data changes over time (a customer moves to a new city, a product changes categories), your pipeline must decide how to handle the history.

    • Type 1: Overwrite the old value. Simple but loses history.
    • Type 2: Add a new row with effective dates. Preserves full history.
    • Type 3: Add a column for the previous value. Limited history.
    -- SCD Type 2 implementation
    -- When a customer's city changes, close the old record and open a new one
    
    -- Step 1: Close existing records for changed customers
    UPDATE dim_customer
    SET
        effective_end_date = CURRENT_DATE - INTERVAL '1 day',
        is_current = FALSE
    WHERE customer_id IN (
        SELECT s.customer_id
        FROM staging.customer_updates s
        JOIN dim_customer d ON s.customer_id = d.customer_id
        WHERE s.city != d.city AND d.is_current = TRUE
    )
    AND is_current = TRUE;
    
    -- Step 2: Insert new records for changed customers
    INSERT INTO dim_customer
        (customer_id, name, city, effective_start_date, effective_end_date, is_current)
    SELECT
        s.customer_id,
        s.name,
        s.city,
        CURRENT_DATE,
        '9999-12-31',
        TRUE
    FROM staging.customer_updates s
    WHERE NOT EXISTS (
        SELECT 1 FROM dim_customer d
        WHERE d.customer_id = s.customer_id
          AND d.city = s.city
          AND d.is_current = TRUE
    );
    

    Putting It All Together

    A production-grade pipeline combines multiple patterns:

    1. Parameterized execution dates — enables backfilling
    2. Idempotent operations — safe to retry and backfill
    3. Dead letter queues — bad records don't kill the pipeline
    4. Retry with backoff — handles transient failures
    5. Schema evolution support — adapts to changing sources
    6. Monitoring and alerting — know when things break before users do
    def daily_pipeline_task(execution_date: str):
        """
        Idempotent, backfillable, error-tolerant pipeline task.
        """
        # Extract with retry
        raw_data = retry_with_backoff(
            lambda: extract_from_api(date=execution_date)
        )
    
        # Transform with error routing
        result = process_batch(raw_data)
    
        # Idempotent load (delete + insert)
        with database.transaction():
            delete_partition("analytics.metrics", date=execution_date)
            insert_records("analytics.metrics", result.successful)
    
        # Route failures to DLQ
        if result.failed:
            insert_records("dlq.failed_records", result.failed)
    
        # Emit metrics for monitoring
        emit_metric("pipeline.records_processed", len(result.successful))
        emit_metric("pipeline.records_failed", len(result.failed))
    

    Key Takeaways

    1. Idempotency is non-negotiable — every pipeline should be safe to re-run.
    2. Design for backfilling from day one — you will need it sooner than you think.
    3. Handle failures gracefully — dead letter queues and circuit breakers keep your system resilient.
    4. Prefer at-least-once with idempotent consumers over exactly-once — simpler and nearly as effective.
    5. Batch and streaming serve different needs — most platforms use both.
    6. Schema evolution is a feature, not an afterthought — plan for it in your architecture.

    These patterns aren't just academic concepts. They're the difference between a pipeline that runs smoothly for years and one that wakes you up every other night. Invest the time to learn and apply them — your future self (and your on-call rotation) will thank you.

    Frequently Asked Questions

    What is an idempotent pipeline?

    An idempotent pipeline produces the same result regardless of how many times it is executed with the same input. This is achieved through strategies like DELETE + INSERT (replacing a partition), MERGE/UPSERT operations, or immutable append with deduplication at read time. Idempotency is critical because pipelines fail and need to be retried safely without creating duplicate data.

    What is backfilling in data engineering?

    Backfilling is the process of reprocessing historical data through a pipeline. It is needed when you fix bugs, change business logic, or handle late-arriving data. A backfillable pipeline accepts a date parameter and processes exactly that date's data, rather than hardcoding the current date. Tools like Apache Airflow have built-in backfill support via the catchup mechanism.

    What is exactly-once processing?

    Exactly-once processing guarantees that each message or record is processed precisely one time, with no duplicates and no data loss. True exactly-once requires transactional coordination between the messaging system and the processing system, which adds performance overhead. In practice, most data engineers use at-least-once delivery combined with idempotent consumers, which is simpler and nearly as effective.

    How do I handle schema changes in data pipelines?

    The safest strategy is additive-only changes: only add new columns, never remove or rename existing ones. For streaming systems, use a schema registry (like Confluent Schema Registry) to enforce backward and forward compatibility. For major changes, use versioned tables and migrate consumers incrementally. Always plan for schema evolution in your pipeline architecture from day one.

    About the Author

    Adriano Sanges is a data engineering professional and the creator of dataskew.io. With years of experience building data platforms at scale, he shares practical insights and hands-on guides to help aspiring data engineers advance their careers.

    Ready to Apply What You Learned?

    Take the next step in your data engineering journey with structured roadmaps and hands-on projects designed for real-world experience.