Data Engineering Blog

In-depth tutorials, guides, and best practices for data engineers. From foundational concepts to advanced design patterns, learn what it takes to build robust and scalable data platforms.

By Adriano Sanges·June 9, 2026·11 min read

Apache Airflow for Data Engineers: DAGs, Operators, and Production Patterns (2026)

How Apache Airflow works for data engineers: DAGs, operators, the scheduler, and the production patterns (idempotency, backfills, sensors) that keep pipelines reliable.

Data Pipelines

Infrastructure

By Adriano Sanges·June 9, 2026·11 min read

Apache Spark for Data Engineers: How It Works and When to Use It (2026)

Apache Spark for data engineers: the execution model (driver, executors, shuffles), DataFrames vs RDDs, production best practices, and when Spark beats SQL or a warehouse.

Data Pipelines

Python & PySpark

By Adriano Sanges·June 9, 2026·10 min read

Change Data Capture (CDC): How It Works and When to Use It (2026)

Change Data Capture (CDC) explained for data engineers: log-based vs query-based, Debezium and friends, the pipeline patterns that matter, and when CDC beats batch reloads.

Data Pipelines

Streaming

By Adriano Sanges·June 9, 2026·10 min read

Delta Lake vs Apache Iceberg: Which Lakehouse Table Format in 2026?

Delta Lake vs Apache Iceberg compared: ACID, time travel, schema evolution, engine support and vendor lock-in. A practical way to pick an open table format for your lakehouse.

Data Storage

Infrastructure

By Adriano Sanges·June 9, 2026·12 min read

Dimensional Modeling: A Practical Guide for Data Engineers (2026)

Dimensional modeling explained: fact and dimension tables, grain, conformed dimensions, and slowly changing dimensions (SCD). The Kimball method, applied with dbt.

Data Modeling

By Adriano Sanges·June 9, 2026·9 min read

Reverse ETL: Syncing Warehouse Data Back to Your Tools (2026)

Reverse ETL explained: how to sync modeled warehouse data back into SaaS tools (CRM, ads, support), when it beats a direct integration, the tools, and the main pitfalls.

Data Pipelines

By Adriano Sanges·April 27, 2026·11 min read

Metabase + DuckDB: Local-First Analytics Setup Guide [2026]

Connect Metabase to DuckDB to run a fast local-first BI stack on Parquet, CSV and SQLite files. Setup steps, Docker config, gotchas and when to scale beyond it.

By Adriano Sanges·March 23, 2026·15 min read

SQL Joins and GROUP BY: 7 Pitfalls That Break Your Warehouse

Fan-out joins, grain drift, NULL keys silently dropped by INNER JOIN: the 7 SQL join and GROUP BY bugs that inflate warehouse metrics — each with a detector query and a fix.

Data Modeling

SQL

dbt

By Adriano Sanges·March 23, 2026·14 min read

SQL vs Python for Data Transformations: A Practical Decision Framework

A concrete, opinionated decision framework to choose between SQL and Python for your data pipeline transformation layer — with flowchart, scoring table, and side-by-side code comparisons.

By Adriano Sanges·March 8, 2026·14 min read

Apache Kafka for Data Engineers: Architecture, Use Cases & Getting Started

Learn Apache Kafka architecture, key concepts, and practical use cases. Includes Python examples, Docker setup, and comparisons with Pub/Sub and Kinesis.

Python & PySpark

Streaming

By Adriano Sanges·March 8, 2026·16 min read

Data Engineering System Design Interview: Framework + 3 Examples

Pass the data engineering system design interview: a 5-step framework, 3 worked pipeline examples (batch, streaming, CDC) and the patterns interviewers expect.

Streaming

Career

By Adriano Sanges·March 8, 2026·14 min read

Data Pipeline Design Patterns: Idempotency, DLQ, CDC and 5 More (2026)

8 production-grade pipeline patterns explained with Python and SQL: idempotency, backfilling, dead letter queues, CDC, schema evolution. The patterns that keep ETL running at 3 AM without paging you.

Data Pipelines

By Adriano Sanges·March 8, 2026·13 min read

Data Warehouse vs Data Lake vs Lakehouse [2026 Comparison]

Side-by-side comparison of data warehouse, data lake and lakehouse architectures: OLTP vs OLAP, medallion layers, Snowflake vs Databricks, and how to choose.

Data Storage

Infrastructure

By Adriano Sanges·March 8, 2026·16 min read

dbt for Analytics Engineering: Transform Your Data Warehouse

Learn dbt from scratch — models, materializations, testing, documentation, macros, incremental models, and project structure best practices.

By Adriano Sanges·March 8, 2026·14 min read

Docker for Data Engineers: Containerize Your Data Pipelines

Learn Docker essentials for data engineering — Dockerfiles, multi-stage builds, Docker Compose for local data stacks, and production best practices.

Infrastructure

By Adriano Sanges·March 8, 2026·12 min read

ETL vs ELT: Which Wins in 2026 and When You Actually Need Both

ETL still wins for compliance and on-prem; ELT dominates cloud warehouses. The hybrid setup most teams actually run (Fivetran + dbt + Snowflake), the exceptions, and how to decide for your stack.

Data Pipelines

dbt

By Adriano Sanges·March 8, 2026·15 min read

How to Become a Data Engineer in 2026: Complete Career Guide

A practical roadmap to becoming a data engineer in 2026 covering skills, tools, projects, interview prep, certifications, and salary expectations.

Career

By Adriano Sanges·March 8, 2026·15 min read

SQL Window Functions: ROW_NUMBER, RANK, LAG & LEAD Explained

Complete guide to SQL window functions: ROW_NUMBER vs RANK vs DENSE_RANK, LAG/LEAD, running totals and frame clauses — with dialect notes for PostgreSQL, BigQuery and SQLite.

SQL

By Adriano Sanges·March 8, 2026·13 min read

Star Schema vs Snowflake Schema: Data Modeling for Analytics

Master dimensional modeling with star and snowflake schemas. Learn fact tables, dimension tables, SCD types, and when to use each approach.

Data Modeling

By Nicola De Lillo·March 7, 2026·14 min read

Keeping Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles) Modular with Jinja2

Learn how to use Jinja2 templating to keep Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles / DABs) DRY, composable, and environment-aware with reusable fragments and conditional logic.

Python & PySpark

Infrastructure

By Nicola De Lillo·March 7, 2026·12 min read

Databricks PySpark Best Practices: Modular Pipeline Patterns

Production-grade Databricks projects: modular PySpark transformations, thin notebook entrypoints, unit testing, and deployment with Databricks Asset Bundles.

Data Storage

Python & PySpark

Infrastructure

By Nicola De Lillo·February 10, 2026·12 min read

Data Contracts for Data Engineers: Stop Breaking Downstream Pipelines

Learn how data contracts prevent breaking changes, reduce pipeline incidents, and improve trust across producers and consumers with practical implementation patterns.

Data Pipelines

Streaming

Put Theory Into Practice

Reading is a great start, but hands-on experience is what sets you apart. Explore our structured roadmaps and real-world projects to apply what you learn.

View Roadmaps Explore Projects