TL;DR: Data warehouses store structured, curated data optimized for SQL analytics and BI. Data lakes store all data types cheaply on object storage but require governance to avoid becoming data swamps. Data lakehouses combine both using open table formats (Delta Lake, Apache Iceberg) to deliver warehouse-level reliability on lake-level storage. Most mature platforms use a combination of all three.
Data Warehouse vs Data Lake vs Data Lakehouse: Choosing the Right Architecture
One of the most consequential architectural decisions in data engineering is how you store and organize your data. Data warehouses, data lakes, and data lakehouses each represent a fundamentally different philosophy about how data should be managed. This guide breaks down all three, compares them rigorously, and helps you choose the right architecture for your use case.
Understanding the Foundations: OLTP vs OLAP
Before diving into warehouses and lakes, you need to understand the two fundamental database workload types, because this distinction drives everything else.
OLTP (Online Transaction Processing)
OLTP systems handle day-to-day operations — the application databases powering your web app, e-commerce platform, or CRM.
- Optimized for: Writes, updates, and point lookups
- Query pattern:
SELECT * FROM orders WHERE order_id = 12345 - Row-oriented storage: Efficient for reading/writing complete records
- Normalized schemas: 3NF to minimize data redundancy
- Examples: PostgreSQL, MySQL, MongoDB, DynamoDB
OLAP (Online Analytical Processing)
OLAP systems handle analytical queries — aggregations, joins across large datasets, trend analysis, and business intelligence.
- Optimized for: Reads across many rows, aggregations, scans
- Query pattern:
SELECT region, SUM(revenue) FROM sales GROUP BY region - Columnar storage: Efficient for reading specific columns across millions of rows
- Denormalized schemas: Star/snowflake schemas for query performance
- Examples: Snowflake, BigQuery, Redshift, ClickHouse
| Characteristic | OLTP | OLAP |
|---|---|---|
| Purpose | Run the business | Analyze the business |
| Query type | Simple, point lookups | Complex, aggregations |
| Data freshness | Real-time | Minutes to hours |
| Users | Applications, APIs | Analysts, data scientists |
| Schema design | Normalized (3NF) | Denormalized (star schema) |
| Storage format | Row-oriented | Column-oriented |
| Concurrent users | Thousands | Dozens to hundreds |
Data engineering is largely about moving and transforming data from OLTP to OLAP systems — from operational databases to analytical platforms.
What Is a Data Warehouse?
A data warehouse is a centralized repository of structured, curated data optimized for analytical queries and business intelligence.
Key Characteristics
- Schema-on-write: Data must conform to a predefined schema before loading
- Structured data only: Tables with well-defined columns, types, and relationships
- Columnar storage: Data stored by column for fast analytical queries
- SQL interface: Queried primarily with SQL
- Governed and curated: Data quality is enforced on ingestion
How a Data Warehouse Works
[Source Systems] --> [ETL/ELT Pipeline] --> [Data Warehouse]
|
[Staging Layer]
|
[Data Marts]
|
[BI / Dashboards]
Data flows from operational systems through transformation pipelines into the warehouse, where it's organized into staging areas and data marts that serve specific business domains.
Major Cloud Data Warehouses
Snowflake separates storage and compute completely. You can scale compute independently, run multiple workloads without contention, and pay only for what you use. Its multi-cluster architecture makes it excellent for concurrent BI workloads.
Google BigQuery is serverless — there's no infrastructure to manage. You write SQL, it executes. Pricing is based on bytes scanned, which rewards well-structured queries and partitioned tables. BigQuery ML lets you train models directly in SQL.
Amazon Redshift is tightly integrated with the AWS ecosystem. Redshift Serverless reduces operational overhead, while Redshift Spectrum extends queries to data in S3 without loading it first.
Azure Synapse Analytics combines data warehousing with big data analytics in Microsoft's cloud. It integrates with Power BI and the broader Azure ecosystem.
What Is a Data Lake?
A data lake is a large-scale storage system that holds raw data in its native format — structured, semi-structured, or unstructured — at a low cost.
Key Characteristics
- Schema-on-read: Data is stored as-is; structure is applied at query time
- All data types: CSV, JSON, Parquet, Avro, images, logs, video
- Object storage foundation: Built on S3, GCS, or ADLS
- Low cost: Object storage is 10-100x cheaper than warehouse storage
- Flexible processing: SQL, Python, Spark, ML frameworks
The Data Lake Architecture
[Source Systems] --> [Ingestion Layer] --> [Object Storage (S3/GCS/ADLS)]
|
[Raw Zone] (landing)
|
[Processed Zone] (cleaned)
|
[Curated Zone] (business-ready)
|
[Query Engines / ML / BI Tools]
The Data Swamp Problem
A data lake without governance becomes a data swamp — a disorganized dumping ground where nobody knows what data exists, what it means, or whether it's reliable. The key difference between a lake and a swamp is governance:
- Metadata catalogs (AWS Glue Catalog, Apache Hive Metastore) to track what's in the lake
- Data quality checks to validate incoming data
- Access controls to manage who can read and write
- Naming conventions and folder structures to keep things organized
- Data lineage to track where data comes from and how it's transformed
What Is a Data Lakehouse?
The data lakehouse is a modern architecture that combines the low-cost, flexible storage of a data lake with the performance, reliability, and governance features of a data warehouse.
Key Characteristics
- Open file formats: Data stored in Parquet, ORC, or other open formats on object storage
- ACID transactions: Reliable reads and writes, even with concurrent users
- Schema enforcement: Enforced schemas on write, like a warehouse
- Time travel: Query historical versions of your data
- Unified platform: Supports both BI/SQL and ML/data science workloads
How Lakehouse Technology Works
The secret sauce of the lakehouse is the table format layer — an open-source metadata layer that sits on top of object storage and provides warehouse-like capabilities:
[Query Engines: Spark, Trino, Presto, Flink]
|
[Table Format: Delta Lake / Apache Iceberg / Apache Hudi]
|
[Object Storage: S3, GCS, ADLS]
Delta Lake
Created by Databricks, Delta Lake adds ACID transactions, schema enforcement, and time travel to Parquet files on object storage. It's the foundation of the Databricks Lakehouse Platform.
-- Delta Lake: Time travel to query yesterday's version
SELECT * FROM my_table TIMESTAMP AS OF '2026-03-07';
-- Delta Lake: Restore a table to a previous version
RESTORE TABLE my_table TO VERSION AS OF 42;
Apache Iceberg
An open table format backed by Netflix, Apple, and others. Iceberg excels at handling large tables (petabyte-scale) and supports hidden partitioning, which separates the physical layout from the logical query interface.
-- Iceberg: Partition evolution without rewriting data
ALTER TABLE orders ADD PARTITION FIELD month(order_date);
-- Iceberg: Snapshot expiration for storage management
CALL system.expire_snapshots('db.orders', TIMESTAMP '2026-02-01');
Apache Hudi
Focused on incremental data processing, Hudi is ideal for CDC (Change Data Capture) workloads where you need efficient upserts on massive datasets.
The Medallion Architecture
The medallion architecture (also called multi-hop architecture) is a data organization pattern commonly used in lakehouses. It organizes data into three layers of increasing quality.
Bronze Layer (Raw)
- Exact copy of source data as ingested
- No transformations, deduplication, or cleaning
- Immutable — serves as the system of record
- Retained for reprocessing and auditing
-- Bronze: Raw events as ingested from Kafka
CREATE TABLE bronze.raw_events (
event_id STRING,
payload STRING, -- Raw JSON string
source_topic STRING,
ingested_at TIMESTAMP,
kafka_offset BIGINT
);
Silver Layer (Cleaned)
- Deduplicated, validated, and typed data
- Joins across sources to create unified entities
- Handles null values, data type corrections, and standardization
- Conforms to an enterprise data model
-- Silver: Cleaned and typed order data
CREATE TABLE silver.orders AS
SELECT
CAST(json_extract(payload, '$.order_id') AS BIGINT) AS order_id,
CAST(json_extract(payload, '$.customer_id') AS BIGINT) AS customer_id,
CAST(json_extract(payload, '$.amount') AS DECIMAL(10,2)) AS amount,
CAST(json_extract(payload, '$.currency') AS VARCHAR(3)) AS currency,
CAST(json_extract(payload, '$.created_at') AS TIMESTAMP) AS order_date,
ingested_at
FROM bronze.raw_events
WHERE source_topic = 'orders'
AND json_extract(payload, '$.order_id') IS NOT NULL;
Gold Layer (Business-Ready)
- Business-level aggregations, metrics, and KPIs
- Optimized for BI tool consumption and dashboards
- Follows dimensional modeling (fact and dimension tables)
- Documented with business definitions
-- Gold: Daily revenue metrics by product category
CREATE TABLE gold.daily_revenue AS
SELECT
o.order_date::DATE AS date,
p.category,
p.subcategory,
COUNT(DISTINCT o.order_id) AS total_orders,
COUNT(DISTINCT o.customer_id) AS unique_customers,
SUM(o.amount) AS gross_revenue,
AVG(o.amount) AS avg_order_value
FROM silver.orders o
JOIN silver.products p ON o.product_id = p.product_id
GROUP BY 1, 2, 3;
Learn more about organizing data models effectively in our Data Modeling fundamentals guide.
Comprehensive Comparison
| Feature | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Data types | Structured only | All types | All types |
| Schema | Schema-on-write | Schema-on-read | Both (enforced on write) |
| Storage cost | $$$ | $ | $ |
| Query performance | Excellent for SQL | Variable | Good to excellent |
| ACID transactions | Yes | No (without table formats) | Yes |
| Time travel | Limited | No | Yes |
| BI/SQL workloads | Excellent | Poor to moderate | Good |
| ML/data science | Limited | Excellent | Excellent |
| Governance | Built-in | Manual / external | Built-in |
| Open formats | Proprietary | Yes | Yes |
| Vendor lock-in | Higher | Lower | Lower |
| Operational complexity | Low (managed) | High | Medium |
| Best for | BI and reporting | Raw data, ML | Unified analytics |
Decision Framework: How to Choose
Choose a Data Warehouse When
- Your primary consumers are business analysts using BI tools
- Your data is structured and comes from well-defined sources
- You need predictable, fast query performance for dashboards
- Your organization values simplicity over flexibility
- Budget allows for premium storage costs
Example: A mid-size company with 10 data sources, a BI team using Looker, and no ML workloads. Snowflake or BigQuery is the right choice.
Choose a Data Lake When
- You work with diverse data types — logs, JSON, images, sensor data
- Data science and ML are primary workloads
- You need petabyte-scale storage at minimal cost
- Your team has strong engineering skills to build and maintain governance
- You want to avoid vendor lock-in with open formats
Example: A large enterprise with IoT data, clickstream logs, and an ML team that needs to train models on raw data. S3 + Spark + a data catalog is the right choice.
Choose a Data Lakehouse When
- You need both BI and ML on the same data platform
- You want warehouse-level governance without warehouse-level storage costs
- You're building a greenfield data platform and want a modern architecture
- You value open formats and don't want to be locked into a single vendor
- Your team can adopt tools like Delta Lake, Iceberg, or Hudi
Example: A growing company that needs dashboards for the business team AND feature engineering for the ML team. Databricks or an Iceberg-based platform avoids maintaining two separate systems.
Real-World Architecture: Combining Approaches
Many organizations don't pick just one. A common production architecture looks like:
[Source Systems]
|
v
[Data Lake (S3 + Iceberg)] -- Bronze & Silver layers
|
|---> [Data Warehouse (Snowflake)] -- Gold layer for BI
|
|---> [ML Platform (Databricks)] -- Feature store, training
|
|---> [Streaming (Kafka + Flink)] -- Real-time use cases
The lake stores everything cheaply. The warehouse serves BI with fast queries. The ML platform accesses the lake directly. This is the modern data platform pattern used by companies like Netflix, Airbnb, and Spotify.
You can explore how all these pieces fit together in our Analytics Dashboard project and the Data Modeling fundamentals guide.
Key Takeaways
- OLTP is for running the business; OLAP is for analyzing it — data engineering bridges the two.
- Data warehouses excel at structured analytics and BI but are expensive and inflexible with data types.
- Data lakes offer cheap, flexible storage but require significant governance to avoid becoming data swamps.
- Data lakehouses combine the best of both using open table formats like Delta Lake and Apache Iceberg.
- The medallion architecture (bronze/silver/gold) is the standard pattern for organizing lakehouse data.
- Most mature platforms use multiple approaches — a lake for raw storage, a warehouse for BI, and lakehouse technology to bridge them.
- Your choice depends on workloads, team skills, and budget — there's no universal "best" architecture.
The data storage landscape continues to evolve rapidly. Table formats like Apache Iceberg are converging warehouse and lake capabilities, and the distinction between these architectures is blurring. As a data engineer, understanding all three gives you the flexibility to design the right solution for any situation.
Frequently Asked Questions
What is a data warehouse?
A data warehouse is a centralized repository of structured, curated data optimized for analytical queries and business intelligence. It uses schema-on-write (data must conform to a predefined schema before loading), columnar storage for fast aggregations, and is queried primarily with SQL. Leading cloud data warehouses include Snowflake, Google BigQuery, Amazon Redshift, and Azure Synapse Analytics.
What is a data lake?
A data lake is a large-scale storage system that holds raw data in its native format — structured, semi-structured, or unstructured — at low cost on object storage like Amazon S3, Google Cloud Storage, or Azure Data Lake Storage. It uses schema-on-read, meaning structure is applied at query time rather than on ingestion. Data lakes require strong governance to avoid becoming disorganized "data swamps."
What is a data lakehouse?
A data lakehouse is a modern architecture that combines the low-cost, flexible storage of a data lake with the performance, ACID transactions, and governance features of a data warehouse. It is enabled by open table format layers like Delta Lake, Apache Iceberg, and Apache Hudi, which sit on top of object storage and provide schema enforcement, time travel, and reliable concurrent access.
What is the medallion architecture?
The medallion architecture (also called multi-hop architecture) organizes lakehouse data into three layers of increasing quality: Bronze (raw, immutable copy of source data), Silver (cleaned, deduplicated, and validated data), and Gold (business-ready aggregations and metrics optimized for BI consumption). It is the standard organizational pattern for data lakehouses.
Should I use a data warehouse or data lake?
Choose a data warehouse when your primary consumers are business analysts using BI tools and your data is structured. Choose a data lake when you work with diverse data types (logs, JSON, images), need petabyte-scale storage at minimal cost, or have significant ML workloads. Choose a data lakehouse when you need both BI and ML on the same platform with warehouse-level governance. Most organizations end up using a combination based on their specific workloads.