TL;DR: Production-grade data pipelines rely on key design patterns: idempotency (safe to re-run), backfilling (reprocess any date range), dead letter queues (bad records don't kill the pipeline), and schema evolution (handle change gracefully). Combine these with retry logic and monitoring to build pipelines that run reliably for years.
Data Pipeline Design Patterns Every Engineer Should Know
Building a data pipeline that works once is easy. Building one that works reliably at scale, handles failures gracefully, and can be maintained by a team over years — that's where design patterns matter. This guide covers the patterns that separate production-grade pipelines from prototypes.
Why Patterns Matter in Data Engineering
Data pipelines are deceptively complex. They deal with distributed systems, unreliable networks, schema changes, late-arriving data, and partial failures. Without established patterns, you end up with fragile spaghetti code that breaks at 3 AM and nobody wants to debug.
These patterns aren't theoretical — they come from real-world production systems at companies processing billions of records daily. Learning them early will save you countless on-call incidents.
Pattern 1: Idempotency
Idempotency is arguably the most important pattern in data engineering. An operation is idempotent if running it multiple times produces the same result as running it once.
Why It Matters
Pipelines fail and need to be retried. If your pipeline inserts data without idempotency, retrying it creates duplicate records. If it updates data without idempotency, you might apply a transformation twice and get incorrect values.
Implementation Strategies
Strategy A: DELETE + INSERT (Partition Replacement)
The simplest idempotent pattern. Delete the existing data for the time period, then insert the fresh data.
-- Idempotent daily load: safe to re-run for any date
BEGIN TRANSACTION;
DELETE FROM analytics.daily_revenue
WHERE date = '{{ ds }}';
INSERT INTO analytics.daily_revenue (date, category, revenue, order_count)
SELECT
order_date AS date,
product_category AS category,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM raw.orders
WHERE order_date = '{{ ds }}'
GROUP BY order_date, product_category;
COMMIT;
This pattern is safe because no matter how many times you run it for a given date, the result is the same.
Strategy B: MERGE / UPSERT
When you need to update existing records or insert new ones based on a key:
-- Idempotent upsert using MERGE
MERGE INTO analytics.customer_metrics AS target
USING (
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(amount) AS lifetime_value,
MAX(order_date) AS last_order_date
FROM raw.orders
WHERE order_date = '{{ ds }}'
GROUP BY customer_id
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
total_orders = source.total_orders,
lifetime_value = source.lifetime_value,
last_order_date = source.last_order_date
WHEN NOT MATCHED THEN
INSERT (customer_id, total_orders, lifetime_value, last_order_date)
VALUES (source.customer_id, source.total_orders, source.lifetime_value, source.last_order_date);
Strategy C: Immutable Append with Deduplication
Write all records with a processing timestamp, then deduplicate at read time:
-- Write layer: always append, never update
INSERT INTO raw.events_log (event_id, event_data, processed_at)
SELECT event_id, event_data, CURRENT_TIMESTAMP
FROM staging.new_events;
-- Read layer: deduplicate using window function
CREATE VIEW analytics.events AS
SELECT * FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY processed_at DESC
) AS rn
FROM raw.events_log
)
WHERE rn = 1;
Pattern 2: Backfilling
Backfilling is the ability to reprocess historical data. It's not an edge case — it's a regular operation. Schema changes, bug fixes, new business logic, and late-arriving data all require backfills.
Design for Backfilling from Day One
The key principle: every pipeline should accept a date parameter and process exactly that date's data. Never hardcode CURRENT_DATE or NOW().
# Bad: Not backfillable
def process_orders():
today = datetime.now().date()
orders = fetch_orders(date=today)
transform_and_load(orders)
# Good: Backfillable by design
def process_orders(execution_date: date):
orders = fetch_orders(date=execution_date)
transform_and_load(orders, partition_date=execution_date)
Backfilling with Apache Airflow
Airflow's catchup mechanism is purpose-built for backfilling. When combined with idempotent tasks, you can safely reprocess any date range:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
dag = DAG(
"daily_revenue_pipeline",
schedule_interval="@daily",
start_date=datetime(2025, 1, 1),
catchup=True, # Enable backfilling
)
def process_daily_revenue(**context):
execution_date = context["ds"] # YYYY-MM-DD string
# Idempotent: safe to run multiple times for same date
delete_partition(table="daily_revenue", date=execution_date)
data = extract_orders(date=execution_date)
transformed = transform(data)
load_partition(table="daily_revenue", data=transformed, date=execution_date)
task = PythonOperator(
task_id="process_daily_revenue",
python_callable=process_daily_revenue,
dag=dag,
)
To backfill a specific date range:
airflow dags backfill daily_revenue_pipeline \
--start-date 2025-06-01 \
--end-date 2025-06-30
Practice building these orchestration patterns in our Orchestration with Airflow project.
Pattern 3: Schema Evolution
Schemas change. Columns get added, renamed, or removed. Data types shift. A robust pipeline handles schema evolution without breaking.
Strategies for Schema Evolution
Additive-Only Changes
The safest approach: only add new columns, never remove or rename existing ones. Consumers of the data continue to work because their expected columns still exist.
-- Safe: adding a column
ALTER TABLE analytics.customers ADD COLUMN loyalty_tier VARCHAR(20);
-- Dangerous: renaming breaks downstream queries
-- ALTER TABLE analytics.customers RENAME COLUMN name TO full_name;
Schema Registry (Streaming)
For event-driven architectures, a schema registry (like Confluent Schema Registry for Kafka) enforces compatibility rules:
- Backward compatible: New schema can read data written with the old schema.
- Forward compatible: Old schema can read data written with the new schema.
- Full compatible: Both directions work.
{
"type": "record",
"name": "OrderEvent",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "currency", "type": "string", "default": "USD"},
{"name": "loyalty_points", "type": ["null", "int"], "default": null}
]
}
New fields with defaults are backward compatible — old consumers simply ignore them.
Versioned Tables
For significant schema changes, create a new version of the table and migrate consumers incrementally:
analytics.customers_v1 -- Old schema, deprecated
analytics.customers_v2 -- New schema, active
analytics.customers -- View pointing to v2
Pattern 4: Error Handling and Dead Letter Queues
Not every record will process successfully. The question is: does one bad record kill your entire pipeline?
The Dead Letter Queue Pattern
Instead of failing the pipeline when a record can't be processed, route it to a dead letter queue (DLQ) for investigation, and continue processing the rest.
from dataclasses import dataclass
from typing import List
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
@dataclass
class ProcessingResult:
successful: List[dict]
failed: List[dict]
def process_batch(records: List[dict]) -> ProcessingResult:
successful = []
failed = []
for record in records:
try:
validated = validate_and_transform(record)
successful.append(validated)
except Exception as e:
logger.warning(f"Record failed processing: {e}")
failed.append({
"original_record": record,
"error": str(e),
"error_type": type(e).__name__,
"timestamp": datetime.utcnow().isoformat(),
})
return ProcessingResult(successful=successful, failed=failed)
def run_pipeline(records: List[dict]):
result = process_batch(records)
# Load successful records to target
load_to_warehouse(result.successful)
# Route failures to dead letter queue
if result.failed:
load_to_dlq(result.failed)
logger.warning(
f"{len(result.failed)}/{len(records)} records "
f"routed to dead letter queue"
)
# Alert if failure rate exceeds threshold
failure_rate = len(result.failed) / len(records)
if failure_rate > 0.05: # 5% threshold
send_alert(f"High failure rate: {failure_rate:.1%}")
Circuit Breaker Pattern
If a downstream system is failing consistently, don't keep hammering it. Implement a circuit breaker that stops calls after a threshold of failures:
import time
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.state = "CLOSED" # CLOSED = normal, OPEN = failing
self.last_failure_time = None
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.reset_timeout:
self.state = "HALF_OPEN"
else:
raise CircuitBreakerOpenError(
"Circuit is open, call rejected"
)
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = "CLOSED"
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
Pattern 5: Retry with Exponential Backoff
Transient failures — network timeouts, rate limits, temporary unavailability — are normal. Retrying with exponential backoff prevents overwhelming a struggling service.
import time
import random
import logging
def retry_with_backoff(func, max_retries=5, base_delay=1.0, max_delay=60.0):
"""Retry a function with exponential backoff and jitter."""
for attempt in range(max_retries):
try:
return func()
except (ConnectionError, TimeoutError) as e:
if attempt == max_retries - 1:
raise # Final attempt, propagate the error
# Exponential backoff with jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1)
wait_time = delay + jitter
logging.warning(
f"Attempt {attempt + 1} failed: {e}. "
f"Retrying in {wait_time:.1f}s..."
)
time.sleep(wait_time)
The jitter is critical. Without it, if multiple pipeline instances fail simultaneously, they all retry at the exact same intervals, creating a "thundering herd" that overwhelms the downstream system again.
Pattern 6: Exactly-Once vs At-Least-Once Delivery
This is a fundamental concept in distributed data systems, especially for streaming pipelines.
At-Least-Once
The message is guaranteed to be delivered, but may arrive more than once. This is the default for most systems because it's simpler and more performant.
Solution: Combine at-least-once delivery with idempotent processing. Your consumer must handle duplicates gracefully.
import json
# At-least-once consumer with idempotent processing
def consume_events(consumer):
for message in consumer:
event = json.loads(message.value)
# Idempotent check: skip if already processed
if is_already_processed(event["event_id"]):
consumer.commit()
continue
process_event(event)
mark_as_processed(event["event_id"])
consumer.commit()
Exactly-Once
The message is delivered precisely once. This requires transactional coordination between the message system and the processing system. Kafka supports exactly-once semantics through idempotent producers and transactional consumers, but it comes with a performance cost.
In practice, at-least-once with idempotent consumers is the most common pattern because it's simpler and nearly as effective.
Pattern 7: Batch vs Streaming — and When to Use Each
This isn't an either-or choice. Most mature data platforms use both.
Batch Processing
- Processes data in discrete chunks (hourly, daily)
- Higher throughput, lower cost per record
- Simpler to reason about, debug, and backfill
- Tools: Spark, Airflow, dbt
Best for: Analytics, reporting, ML training, data warehouse loads
Stream Processing
- Processes data continuously as it arrives
- Lower latency (seconds to minutes)
- More complex state management
- Tools: Kafka Streams, Apache Flink, Spark Structured Streaming
Best for: Real-time dashboards, fraud detection, operational alerts, event-driven architectures
The Lambda and Kappa Architectures
Lambda Architecture: Run both batch and streaming layers in parallel. Batch provides the "truth" layer; streaming provides the "speed" layer. Results are merged at query time.
[Source] --> [Batch Layer] --> [Batch Views] \
--> [Serving Layer] --> [Query]
[Source] --> [Speed Layer] --> [Real-time Views] /
Kappa Architecture: Use a single streaming layer for everything. Historical reprocessing is done by replaying the event log.
[Source] --> [Event Log] --> [Stream Processing] --> [Serving Layer] --> [Query]
Kappa is simpler to maintain but requires a replayable event log (like Kafka with long retention). In practice, most teams end up with a blend of both.
Build your streaming skills with our Streaming with Kafka project, and explore the full Modern Data Stack roadmap to see how these patterns fit together.
Pattern 8: Slowly Changing Dimensions (SCD) in Pipelines
When dimension data changes over time (a customer moves to a new city, a product changes categories), your pipeline must decide how to handle the history.
- Type 1: Overwrite the old value. Simple but loses history.
- Type 2: Add a new row with effective dates. Preserves full history.
- Type 3: Add a column for the previous value. Limited history.
-- SCD Type 2 implementation
-- When a customer's city changes, close the old record and open a new one
-- Step 1: Close existing records for changed customers
UPDATE dim_customer
SET
effective_end_date = CURRENT_DATE - INTERVAL '1 day',
is_current = FALSE
WHERE customer_id IN (
SELECT s.customer_id
FROM staging.customer_updates s
JOIN dim_customer d ON s.customer_id = d.customer_id
WHERE s.city != d.city AND d.is_current = TRUE
)
AND is_current = TRUE;
-- Step 2: Insert new records for changed customers
INSERT INTO dim_customer
(customer_id, name, city, effective_start_date, effective_end_date, is_current)
SELECT
s.customer_id,
s.name,
s.city,
CURRENT_DATE,
'9999-12-31',
TRUE
FROM staging.customer_updates s
WHERE NOT EXISTS (
SELECT 1 FROM dim_customer d
WHERE d.customer_id = s.customer_id
AND d.city = s.city
AND d.is_current = TRUE
);
Putting It All Together
A production-grade pipeline combines multiple patterns:
- Parameterized execution dates — enables backfilling
- Idempotent operations — safe to retry and backfill
- Dead letter queues — bad records don't kill the pipeline
- Retry with backoff — handles transient failures
- Schema evolution support — adapts to changing sources
- Monitoring and alerting — know when things break before users do
def daily_pipeline_task(execution_date: str):
"""
Idempotent, backfillable, error-tolerant pipeline task.
"""
# Extract with retry
raw_data = retry_with_backoff(
lambda: extract_from_api(date=execution_date)
)
# Transform with error routing
result = process_batch(raw_data)
# Idempotent load (delete + insert)
with database.transaction():
delete_partition("analytics.metrics", date=execution_date)
insert_records("analytics.metrics", result.successful)
# Route failures to DLQ
if result.failed:
insert_records("dlq.failed_records", result.failed)
# Emit metrics for monitoring
emit_metric("pipeline.records_processed", len(result.successful))
emit_metric("pipeline.records_failed", len(result.failed))
Key Takeaways
- Idempotency is non-negotiable — every pipeline should be safe to re-run.
- Design for backfilling from day one — you will need it sooner than you think.
- Handle failures gracefully — dead letter queues and circuit breakers keep your system resilient.
- Prefer at-least-once with idempotent consumers over exactly-once — simpler and nearly as effective.
- Batch and streaming serve different needs — most platforms use both.
- Schema evolution is a feature, not an afterthought — plan for it in your architecture.
These patterns aren't just academic concepts. They're the difference between a pipeline that runs smoothly for years and one that wakes you up every other night. Invest the time to learn and apply them — your future self (and your on-call rotation) will thank you.
Frequently Asked Questions
What is an idempotent pipeline?
An idempotent pipeline produces the same result regardless of how many times it is executed with the same input. This is achieved through strategies like DELETE + INSERT (replacing a partition), MERGE/UPSERT operations, or immutable append with deduplication at read time. Idempotency is critical because pipelines fail and need to be retried safely without creating duplicate data.
What is backfilling in data engineering?
Backfilling is the process of reprocessing historical data through a pipeline. It is needed when you fix bugs, change business logic, or handle late-arriving data. A backfillable pipeline accepts a date parameter and processes exactly that date's data, rather than hardcoding the current date. Tools like Apache Airflow have built-in backfill support via the catchup mechanism.
What is exactly-once processing?
Exactly-once processing guarantees that each message or record is processed precisely one time, with no duplicates and no data loss. True exactly-once requires transactional coordination between the messaging system and the processing system, which adds performance overhead. In practice, most data engineers use at-least-once delivery combined with idempotent consumers, which is simpler and nearly as effective.
How do I handle schema changes in data pipelines?
The safest strategy is additive-only changes: only add new columns, never remove or rename existing ones. For streaming systems, use a schema registry (like Confluent Schema Registry) to enforce backward and forward compatibility. For major changes, use versioned tables and migrate consumers incrementally. Always plan for schema evolution in your pipeline architecture from day one.