TL;DR: Production-grade data pipelines rely on key design patterns: idempotency (safe to re-run), backfilling (reprocess any date range), dead letter queues (bad records don't kill the pipeline), and schema evolution (handle change gracefully). Combine these with retry logic and monitoring to build pipelines that run reliably for years.

Data Pipeline Design Patterns Every Engineer Should Know

Building a data pipeline that works once is easy. Building one that works reliably at scale, handles failures gracefully, and can be maintained by a team over years — that's where design patterns matter. This guide covers the patterns that separate production-grade pipelines from prototypes.

Why Patterns Matter in Data Engineering

Data pipelines are deceptively complex. They deal with distributed systems, unreliable networks, schema changes, late-arriving data, and partial failures. Without established patterns, you end up with fragile spaghetti code that breaks at 3 AM and nobody wants to debug.

These patterns aren't theoretical — they come from real-world production systems at companies processing billions of records daily. Learning them early will save you countless on-call incidents.

Pattern 1: Idempotency

Idempotency is arguably the most important pattern in data engineering. An operation is idempotent if running it multiple times produces the same result as running it once.

Why It Matters

Pipelines fail and need to be retried. If your pipeline inserts data without idempotency, retrying it creates duplicate records. If it updates data without idempotency, you might apply a transformation twice and get incorrect values.

Implementation Strategies

Strategy A: DELETE + INSERT (Partition Replacement)

The simplest idempotent pattern. Delete the existing data for the time period, then insert the fresh data.

-- Idempotent daily load: safe to re-run for any date
BEGIN TRANSACTION;

DELETE FROM analytics.daily_revenue
WHERE date = '{{ ds }}';

INSERT INTO analytics.daily_revenue (date, category, revenue, order_count)
SELECT
    order_date AS date,
    product_category AS category,
    SUM(amount) AS revenue,
    COUNT(*) AS order_count
FROM raw.orders
WHERE order_date = '{{ ds }}'
GROUP BY order_date, product_category;

COMMIT;

This pattern is safe because no matter how many times you run it for a given date, the result is the same.

Strategy B: MERGE / UPSERT

When you need to update existing records or insert new ones based on a key:

-- Idempotent upsert using MERGE
MERGE INTO analytics.customer_metrics AS target
USING (
    SELECT
        customer_id,
        COUNT(*) AS total_orders,
        SUM(amount) AS lifetime_value,
        MAX(order_date) AS last_order_date
    FROM raw.orders
    WHERE order_date = '{{ ds }}'
    GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
    UPDATE SET
        total_orders = source.total_orders,
        lifetime_value = source.lifetime_value,
        last_order_date = source.last_order_date
WHEN NOT MATCHED THEN
    INSERT (customer_id, total_orders, lifetime_value, last_order_date)
    VALUES (source.customer_id, source.total_orders, source.lifetime_value, source.last_order_date);

Strategy C: Immutable Append with Deduplication

Write all records with a processing timestamp, then deduplicate at read time:

-- Write layer: always append, never update
INSERT INTO raw.events_log (event_id, event_data, processed_at)
SELECT event_id, event_data, CURRENT_TIMESTAMP
FROM staging.new_events;

-- Read layer: deduplicate using window function
CREATE VIEW analytics.events AS
SELECT * FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY event_id
            ORDER BY processed_at DESC
        ) AS rn
    FROM raw.events_log
)
WHERE rn = 1;

Pattern 2: Backfilling

Backfilling is the ability to reprocess historical data. It's not an edge case — it's a regular operation. Schema changes, bug fixes, new business logic, and late-arriving data all require backfills.

Design for Backfilling from Day One

The key principle: every pipeline should accept a date parameter and process exactly that date's data. Never hardcode CURRENT_DATE or NOW().

# Bad: Not backfillable
def process_orders():
    today = datetime.now().date()
    orders = fetch_orders(date=today)
    transform_and_load(orders)

# Good: Backfillable by design
def process_orders(execution_date: date):
    orders = fetch_orders(date=execution_date)
    transform_and_load(orders, partition_date=execution_date)

Backfilling with Apache Airflow

Airflow's catchup mechanism is purpose-built for backfilling. When combined with idempotent tasks, you can safely reprocess any date range:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

dag = DAG(
    "daily_revenue_pipeline",
    schedule_interval="@daily",
    start_date=datetime(2025, 1, 1),
    catchup=True,  # Enable backfilling
)

def process_daily_revenue(**context):
    execution_date = context["ds"]  # YYYY-MM-DD string
    # Idempotent: safe to run multiple times for same date
    delete_partition(table="daily_revenue", date=execution_date)
    data = extract_orders(date=execution_date)
    transformed = transform(data)
    load_partition(table="daily_revenue", data=transformed, date=execution_date)

task = PythonOperator(
    task_id="process_daily_revenue",
    python_callable=process_daily_revenue,
    dag=dag,
)

To backfill a specific date range:

airflow dags backfill daily_revenue_pipeline \
    --start-date 2025-06-01 \
    --end-date 2025-06-30

Practice building these orchestration patterns in our Orchestration with Airflow project.

Pattern 3: Schema Evolution

Schemas change. Columns get added, renamed, or removed. Data types shift. A robust pipeline handles schema evolution without breaking.

Strategies for Schema Evolution

Additive-Only Changes

The safest approach: only add new columns, never remove or rename existing ones. Consumers of the data continue to work because their expected columns still exist.

-- Safe: adding a column
ALTER TABLE analytics.customers ADD COLUMN loyalty_tier VARCHAR(20);

-- Dangerous: renaming breaks downstream queries
-- ALTER TABLE analytics.customers RENAME COLUMN name TO full_name;

Schema Registry (Streaming)

For event-driven architectures, a schema registry (like Confluent Schema Registry for Kafka) enforces compatibility rules:

Backward compatible: New schema can read data written with the old schema.
Forward compatible: Old schema can read data written with the new schema.
Full compatible: Both directions work.

{
  "type": "record",
  "name": "OrderEvent",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "amount", "type": "double"},
    {"name": "currency", "type": "string", "default": "USD"},
    {"name": "loyalty_points", "type": ["null", "int"], "default": null}
  ]
}

New fields with defaults are backward compatible — old consumers simply ignore them.

Versioned Tables

For significant schema changes, create a new version of the table and migrate consumers incrementally:

analytics.customers_v1  -- Old schema, deprecated
analytics.customers_v2  -- New schema, active
analytics.customers     -- View pointing to v2

Pattern 4: Error Handling and Dead Letter Queues

Not every record will process successfully. The question is: does one bad record kill your entire pipeline?

The Dead Letter Queue Pattern

Instead of failing the pipeline when a record can't be processed, route it to a dead letter queue (DLQ) for investigation, and continue processing the rest.

from dataclasses import dataclass
from typing import List
from datetime import datetime
import logging

logger = logging.getLogger(__name__)

@dataclass
class ProcessingResult:
    successful: List[dict]
    failed: List[dict]

def process_batch(records: List[dict]) -> ProcessingResult:
    successful = []
    failed = []

    for record in records:
        try:
            validated = validate_and_transform(record)
            successful.append(validated)
        except Exception as e:
            logger.warning(f"Record failed processing: {e}")
            failed.append({
                "original_record": record,
                "error": str(e),
                "error_type": type(e).__name__,
                "timestamp": datetime.utcnow().isoformat(),
            })

    return ProcessingResult(successful=successful, failed=failed)

def run_pipeline(records: List[dict]):
    result = process_batch(records)

    # Load successful records to target
    load_to_warehouse(result.successful)

    # Route failures to dead letter queue
    if result.failed:
        load_to_dlq(result.failed)
        logger.warning(
            f"{len(result.failed)}/{len(records)} records "
            f"routed to dead letter queue"
        )

        # Alert if failure rate exceeds threshold
        failure_rate = len(result.failed) / len(records)
        if failure_rate > 0.05:  # 5% threshold
            send_alert(f"High failure rate: {failure_rate:.1%}")

Circuit Breaker Pattern

If a downstream system is failing consistently, don't keep hammering it. Implement a circuit breaker that stops calls after a threshold of failures:

import time

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.state = "CLOSED"  # CLOSED = normal, OPEN = failing
        self.last_failure_time = None

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitBreakerOpenError(
                    "Circuit is open, call rejected"
                )

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

Pattern 5: Retry with Exponential Backoff

Transient failures — network timeouts, rate limits, temporary unavailability — are normal. Retrying with exponential backoff prevents overwhelming a struggling service.

import time
import random
import logging

def retry_with_backoff(func, max_retries=5, base_delay=1.0, max_delay=60.0):
    """Retry a function with exponential backoff and jitter."""
    for attempt in range(max_retries):
        try:
            return func()
        except (ConnectionError, TimeoutError) as e:
            if attempt == max_retries - 1:
                raise  # Final attempt, propagate the error

            # Exponential backoff with jitter
            delay = min(base_delay * (2 ** attempt), max_delay)
            jitter = random.uniform(0, delay * 0.1)
            wait_time = delay + jitter

            logging.warning(
                f"Attempt {attempt + 1} failed: {e}. "
                f"Retrying in {wait_time:.1f}s..."
            )
            time.sleep(wait_time)

The jitter is critical. Without it, if multiple pipeline instances fail simultaneously, they all retry at the exact same intervals, creating a "thundering herd" that overwhelms the downstream system again.

Pattern 6: Exactly-Once vs At-Least-Once Delivery

This is a fundamental concept in distributed data systems, especially for streaming pipelines.

At-Least-Once

The message is guaranteed to be delivered, but may arrive more than once. This is the default for most systems because it's simpler and more performant.

Solution: Combine at-least-once delivery with idempotent processing. Your consumer must handle duplicates gracefully.

import json

# At-least-once consumer with idempotent processing
def consume_events(consumer):
    for message in consumer:
        event = json.loads(message.value)

        # Idempotent check: skip if already processed
        if is_already_processed(event["event_id"]):
            consumer.commit()
            continue

        process_event(event)
        mark_as_processed(event["event_id"])
        consumer.commit()

Exactly-Once

The message is delivered precisely once. This requires transactional coordination between the message system and the processing system. Kafka supports exactly-once semantics through idempotent producers and transactional consumers, but it comes with a performance cost.

In practice, at-least-once with idempotent consumers is the most common pattern because it's simpler and nearly as effective.

Pattern 7: Batch vs Streaming — and When to Use Each

This isn't an either-or choice. Most mature data platforms use both.

Batch Processing

Processes data in discrete chunks (hourly, daily)
Higher throughput, lower cost per record
Simpler to reason about, debug, and backfill
Tools: Spark, Airflow, dbt

Best for: Analytics, reporting, ML training, data warehouse loads

Stream Processing

Processes data continuously as it arrives
Lower latency (seconds to minutes)
More complex state management
Tools: Kafka Streams, Apache Flink, Spark Structured Streaming

Best for: Real-time dashboards, fraud detection, operational alerts, event-driven architectures

The Lambda and Kappa Architectures

Lambda Architecture: Run both batch and streaming layers in parallel. Batch provides the "truth" layer; streaming provides the "speed" layer. Results are merged at query time.

[Source] --> [Batch Layer] --> [Batch Views]    \
                                                 --> [Serving Layer] --> [Query]
[Source] --> [Speed Layer] --> [Real-time Views] /

Kappa Architecture: Use a single streaming layer for everything. Historical reprocessing is done by replaying the event log.

[Source] --> [Event Log] --> [Stream Processing] --> [Serving Layer] --> [Query]

Kappa is simpler to maintain but requires a replayable event log (like Kafka with long retention). In practice, most teams end up with a blend of both.

Build your streaming skills with our Streaming with Kafka project, and explore the full Modern Data Stack roadmap to see how these patterns fit together.

Pattern 8: Slowly Changing Dimensions (SCD) in Pipelines

When dimension data changes over time (a customer moves to a new city, a product changes categories), your pipeline must decide how to handle the history.

Type 1: Overwrite the old value. Simple but loses history.
Type 2: Add a new row with effective dates. Preserves full history.
Type 3: Add a column for the previous value. Limited history.

-- SCD Type 2 implementation
-- When a customer's city changes, close the old record and open a new one

-- Step 1: Close existing records for changed customers
UPDATE dim_customer
SET
    effective_end_date = CURRENT_DATE - INTERVAL '1 day',
    is_current = FALSE
WHERE customer_id IN (
    SELECT s.customer_id
    FROM staging.customer_updates s
    JOIN dim_customer d ON s.customer_id = d.customer_id
    WHERE s.city != d.city AND d.is_current = TRUE
)
AND is_current = TRUE;

-- Step 2: Insert new records for changed customers
INSERT INTO dim_customer
    (customer_id, name, city, effective_start_date, effective_end_date, is_current)
SELECT
    s.customer_id,
    s.name,
    s.city,
    CURRENT_DATE,
    '9999-12-31',
    TRUE
FROM staging.customer_updates s
WHERE NOT EXISTS (
    SELECT 1 FROM dim_customer d
    WHERE d.customer_id = s.customer_id
      AND d.city = s.city
      AND d.is_current = TRUE
);

Putting It All Together

A production-grade pipeline combines multiple patterns:

Parameterized execution dates — enables backfilling
Idempotent operations — safe to retry and backfill
Dead letter queues — bad records don't kill the pipeline
Retry with backoff — handles transient failures
Schema evolution support — adapts to changing sources
Monitoring and alerting — know when things break before users do

def daily_pipeline_task(execution_date: str):
    """
    Idempotent, backfillable, error-tolerant pipeline task.
    """
    # Extract with retry
    raw_data = retry_with_backoff(
        lambda: extract_from_api(date=execution_date)
    )

    # Transform with error routing
    result = process_batch(raw_data)

    # Idempotent load (delete + insert)
    with database.transaction():
        delete_partition("analytics.metrics", date=execution_date)
        insert_records("analytics.metrics", result.successful)

    # Route failures to DLQ
    if result.failed:
        insert_records("dlq.failed_records", result.failed)

    # Emit metrics for monitoring
    emit_metric("pipeline.records_processed", len(result.successful))
    emit_metric("pipeline.records_failed", len(result.failed))

Key Takeaways

Idempotency is non-negotiable — every pipeline should be safe to re-run.
Design for backfilling from day one — you will need it sooner than you think.
Handle failures gracefully — dead letter queues and circuit breakers keep your system resilient.
Prefer at-least-once with idempotent consumers over exactly-once — simpler and nearly as effective.
Batch and streaming serve different needs — most platforms use both.
Schema evolution is a feature, not an afterthought — plan for it in your architecture.

These patterns aren't just academic concepts. They're the difference between a pipeline that runs smoothly for years and one that wakes you up every other night. Invest the time to learn and apply them — your future self (and your on-call rotation) will thank you.

Frequently Asked Questions

What is an idempotent pipeline?

An idempotent pipeline produces the same result regardless of how many times it is executed with the same input. This is achieved through strategies like DELETE + INSERT (replacing a partition), MERGE/UPSERT operations, or immutable append with deduplication at read time. Idempotency is critical because pipelines fail and need to be retried safely without creating duplicate data.

What is backfilling in data engineering?

Backfilling is the process of reprocessing historical data through a pipeline. It is needed when you fix bugs, change business logic, or handle late-arriving data. A backfillable pipeline accepts a date parameter and processes exactly that date's data, rather than hardcoding the current date. Tools like Apache Airflow have built-in backfill support via the catchup mechanism.

What is exactly-once processing?

Exactly-once processing guarantees that each message or record is processed precisely one time, with no duplicates and no data loss. True exactly-once requires transactional coordination between the messaging system and the processing system, which adds performance overhead. In practice, most data engineers use at-least-once delivery combined with idempotent consumers, which is simpler and nearly as effective.

How do I handle schema changes in data pipelines?

The safest strategy is additive-only changes: only add new columns, never remove or rename existing ones. For streaming systems, use a schema registry (like Confluent Schema Registry) to enforce backward and forward compatibility. For major changes, use versioned tables and migrate consumers incrementally. Always plan for schema evolution in your pipeline architecture from day one.