Data Engineering Blog

    In-depth tutorials, guides, and best practices for data engineers. From foundational concepts to advanced design patterns, learn what it takes to build robust and scalable data platforms.

    By Adriano Sanges··11 min read

    Metabase + DuckDB: Local-First Analytics Setup Guide [2026]

    Connect Metabase to DuckDB to run a fast local-first BI stack on Parquet, CSV and SQLite files. Setup steps, Docker config, gotchas and when to scale beyond it.

    By Adriano Sanges··15 min read

    SQL Joins and GROUP BY in Data Warehousing: 7 Pitfalls That Silently Break Your Analytics

    A diagnostic guide to the most common join and aggregation errors in warehouse SQL — fan-outs, grain mismatches, NULL key drops, and non-additive metric traps — with detector queries and fix patterns.

    Data Modeling
    SQL
    dbt
    By Adriano Sanges··14 min read

    SQL vs Python for Data Transformations: A Practical Decision Framework

    A concrete, opinionated decision framework to choose between SQL and Python for your data pipeline transformation layer — with flowchart, scoring table, and side-by-side code comparisons.

    Data Pipelines
    SQL
    Python & PySpark
    dbt
    By Adriano Sanges··14 min read

    Apache Kafka for Data Engineers: Architecture, Use Cases & Getting Started

    Learn Apache Kafka architecture, key concepts, and practical use cases. Includes Python examples, Docker setup, and comparisons with Pub/Sub and Kinesis.

    Python & PySpark
    Streaming
    By Adriano Sanges··16 min read

    Data Engineering System Design Interview: Framework + 3 Examples

    Pass the data engineering system design interview: a 5-step framework, 3 worked pipeline examples (batch, streaming, CDC) and the patterns interviewers expect.

    Streaming
    Career
    By Adriano Sanges··14 min read

    Data Pipeline Design Patterns: Idempotency, DLQ, CDC and 5 More (2026)

    8 production-grade pipeline patterns explained with Python and SQL: idempotency, backfilling, dead letter queues, CDC, schema evolution. The patterns that keep ETL running at 3 AM without paging you.

    Data Pipelines
    By Adriano Sanges··13 min read

    Data Warehouse vs Data Lake vs Lakehouse [2026 Comparison]

    Side-by-side comparison of data warehouse, data lake and lakehouse architectures: OLTP vs OLAP, medallion layers, Snowflake vs Databricks, and how to choose.

    Data Storage
    Infrastructure
    By Adriano Sanges··16 min read

    dbt for Analytics Engineering: Transform Your Data Warehouse

    Learn dbt from scratch — models, materializations, testing, documentation, macros, incremental models, and project structure best practices.

    Data Modeling
    Data Storage
    SQL
    dbt
    By Adriano Sanges··14 min read

    Docker for Data Engineers: Containerize Your Data Pipelines

    Learn Docker essentials for data engineering — Dockerfiles, multi-stage builds, Docker Compose for local data stacks, and production best practices.

    Infrastructure
    By Adriano Sanges··12 min read

    ETL vs ELT: Which Wins in 2026 and When You Actually Need Both

    ETL still wins for compliance and on-prem; ELT dominates cloud warehouses. The hybrid setup most teams actually run (Fivetran + dbt + Snowflake), the exceptions, and how to decide for your stack.

    Data Pipelines
    dbt
    By Adriano Sanges··15 min read

    How to Become a Data Engineer in 2026: Complete Career Guide

    A practical roadmap to becoming a data engineer in 2026 covering skills, tools, projects, interview prep, certifications, and salary expectations.

    Career
    By Adriano Sanges··15 min read

    SQL Window Functions: The Complete Guide for Data Engineers

    Master SQL window functions with practical examples. Learn ROW_NUMBER, RANK, DENSE_RANK, LEAD/LAG, running totals, and advanced frame clauses.

    SQL
    By Adriano Sanges··13 min read

    Star Schema vs Snowflake Schema: Data Modeling for Analytics

    Master dimensional modeling with star and snowflake schemas. Learn fact tables, dimension tables, SCD types, and when to use each approach.

    Data Modeling
    By Nicola De Lillo··14 min read

    Keeping Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles) Modular with Jinja2

    Learn how to use Jinja2 templating to keep Databricks Declarative Automation Bundles (formerly Databricks Asset Bundles / DABs) DRY, composable, and environment-aware with reusable fragments and conditional logic.

    Python & PySpark
    Infrastructure
    By Nicola De Lillo··12 min read

    Databricks PySpark Best Practices: Modular Pipeline Patterns

    Production-grade Databricks projects: modular PySpark transformations, thin notebook entrypoints, unit testing, and deployment with Databricks Asset Bundles.

    Data Storage
    Python & PySpark
    Infrastructure
    By Nicola De Lillo··12 min read

    Data Contracts for Data Engineers: Stop Breaking Downstream Pipelines

    Learn how data contracts prevent breaking changes, reduce pipeline incidents, and improve trust across producers and consumers with practical implementation patterns.

    Data Pipelines
    Streaming

    Put Theory Into Practice

    Reading is a great start, but hands-on experience is what sets you apart. Explore our structured roadmaps and real-world projects to apply what you learn.