SQL Joins and GROUP BY in Data Warehousing: 7 Pitfalls That Silently Break Your Analytics
Your revenue report doubled overnight, but nobody closed a single extra deal. Your dashboard shows 40% more orders this quarter, yet the operations team sees nothing unusual. These phantom numbers are almost never a data loading bug — they are join and aggregation errors baked into your warehouse SQL, silently inflating every downstream metric.
This article is not a textbook walkthrough of join syntax. If you need that, start with our SQL fundamentals resource. Instead, this is a diagnostic guide: seven pitfalls that break analytics in data warehouses, each with the symptoms you will see, the queries to detect the root cause, and the fix patterns you can ship into your pipeline on Monday.
Why Joins and GROUP BY Behave Differently in a Warehouse
If you have written SQL against a production application database, you already know joins and GROUP BY. But warehouse SQL is not just "bigger SQL." Three structural differences change the failure modes entirely.
First, the schemas are different. Warehouses use star and snowflake schemas that separate facts (events, transactions) from dimensions (descriptive context). In OLTP, you join normalized tables that were designed for write safety. In a warehouse, you join denormalized or semi-denormalized tables designed for read speed — and the cardinality relationships between them are far less predictable.
Second, scale amplifies mistakes. A fan-out bug on a 500-row OLTP table might add a few hundred duplicates. The same bug on a 200-million-row fact table inflates your revenue figure by millions of dollars — and nobody notices until a finance review weeks later.
Third, grain is the contract. Every fact table has a grain: one row equals one line_item, one click, one daily_balance. Every pitfall in this article traces back to the same root cause — something violated that grain contract. If you are new to data warehousing concepts, internalizing grain is the single highest-leverage thing you can do.
The mental model to carry forward: before you write a JOIN, you should be able to state the grain of both sides. If you cannot, stop and figure it out. The query can wait.
How Fact and Dimension Joins Actually Work (And Where They Break)
A clean star-schema join is straightforward. Your fact table has a foreign key, your dimension table has a primary key, and the cardinality is many-to-one:
SELECT
f.order_id,
f.revenue,
d.product_name,
d.category
FROM fact_order_lines f
INNER JOIN dim_product d
ON f.product_key = d.product_key;Each order line matches exactly one product. Row count in equals row count out. Life is good.
Where it breaks: fan-out joins
Now suppose you need promotion data, and a single order can have multiple promotions applied through a bridge table:
SELECT
f.order_id,
f.revenue,
p.promotion_name
FROM fact_order_lines f
INNER JOIN bridge_order_promotions bp
ON f.order_id = bp.order_id
INNER JOIN dim_promotion p
ON bp.promotion_key = p.promotion_key;If an order has three promotions, that order line now appears three times. Your SUM(revenue) just tripled for that order — with zero warnings.
Symptom: SELECT COUNT(*) FROM fact_order_lines returns 1,000,000. After the join, COUNT(*) returns 2,400,000. Revenue jumped 140%.
Detector query:
SELECT
f.order_id,
COUNT(*) AS join_multiplier
FROM fact_order_lines f
INNER JOIN bridge_order_promotions bp
ON f.order_id = bp.order_id
GROUP BY f.order_id
HAVING COUNT(*) > 1
ORDER BY join_multiplier DESC
LIMIT 20;If this returns rows, you have a fan-out. The fix depends on what you actually need — often you should aggregate the bridge table first, or restructure the query to avoid joining the measure-bearing fact table directly to a many-side table.
Denormalization trade-offs
Sometimes teams pre-join dimensions into the fact table to avoid repeated joins at query time. This is fact table enrichment, and it works well when managed carefully. But it creates risks: stale dimension attributes if the dimension changes and the fact is not refreshed, update anomalies when you need to correct a dimension value in thousands of fact rows, and bloated table sizes. The decision of where to join — at query time or in the transform layer of your ELT pipeline — is a data modeling decision, not a performance optimization.
The Grain Mismatch Problem — The #1 Silent Killer
Symptom: your monthly revenue report shows exactly 2x what finance expects. Not 2.1x, not 1.8x — exactly double.
This almost always means you joined two tables at different grains without realizing it. Here is the classic scenario:
Your fact_line_items table has one row per order line. Your fact_shipments table has one row per shipment. An order with two line items ships in one box, but each line item is recorded against that shipment:
| order_id | line_item_id | revenue |
|---|---|---|
| 1001 | A | 50 |
| 1001 | B | 30 |
| order_id | shipment_id | shipped_date |
|---|---|---|
| 1001 | S1 | 2025-03-15 |
Joining on order_id:
SELECT
li.order_id,
li.line_item_id,
li.revenue,
s.shipment_id,
s.shipped_date
FROM fact_line_items li
INNER JOIN fact_shipments s
ON li.order_id = s.order_id;This looks fine — two rows in, two rows out. But now add a second shipment (a split shipment):
| order_id | shipment_id | shipped_date |
|---|---|---|
| 1001 | S1 | 2025-03-15 |
| 1001 | S2 | 2025-03-18 |
Suddenly the join produces four rows. Revenue for order 1001 doubles from 80 to 160.
Diagnostic query — run this before any join:
-- Check: is order_id unique in the shipments table?
SELECT
order_id,
COUNT(*) AS row_count
FROM fact_shipments
GROUP BY order_id
HAVING COUNT(*) > 1;If this returns results, order_id is not the grain of fact_shipments, and joining on it will fan out.
Fix patterns
The correct approach is to aggregate one side to the grain of the other before joining:
WITH shipment_summary AS (
SELECT
order_id,
MIN(shipped_date) AS first_shipped_date,
MAX(shipped_date) AS last_shipped_date,
COUNT(*) AS shipment_count
FROM fact_shipments
GROUP BY order_id
)
SELECT
li.order_id,
li.line_item_id,
li.revenue,
ss.first_shipped_date,
ss.shipment_count
FROM fact_line_items li
LEFT JOIN shipment_summary ss
ON li.order_id = ss.order_id;Now the join is many-to-one again, and revenue stays correct.
You may see people reach for DISTINCT or ROW_NUMBER() as a dedup technique. These work as band-aids, but they obscure the real problem — the join grain is wrong — and they silently discard data that might matter.
NULL Join Columns — The Data You Silently Drop
Symptom: your customer count dropped 12% after you refactored a query, but no customers were actually deleted.
In SQL, NULL = NULL evaluates to NULL, not TRUE. This means any row with a NULL join key vanishes from an INNER JOIN — and even a LEFT JOIN will not match it to anything on the right side.
Real-world causes of NULL join keys are everywhere: late-arriving dimensions where a customer record has not been loaded yet, CDC pipelines that produce NULLs during slowly changing dimension transitions, or upstream systems that allow optional foreign keys.
Detector query:
-- Audit NULL rates on join keys
SELECT
COUNT(*) AS total_rows,
COUNT(customer_key) AS non_null_keys,
COUNT(*) - COUNT(customer_key) AS null_keys,
ROUND(100.0 * (COUNT(*) - COUNT(customer_key)) / COUNT(*), 2) AS null_pct
FROM fact_orders;| total_rows | non_null_keys | null_keys | null_pct |
|---|---|---|---|
| 1,000,000 | 882,350 | 117,650 | 11.77 |
Nearly 12% of your orders have no customer key. An inner join to dim_customer silently drops them.
Fix patterns
Option 1: COALESCE with a sentinel value
SELECT
f.order_id,
f.revenue,
COALESCE(d.customer_name, 'Unknown') AS customer_name
FROM fact_orders f
LEFT JOIN dim_customer d
ON f.customer_key = d.customer_key;Option 2: The Kimball "Unknown" dimension row
Insert a dedicated row into dim_customer with a surrogate key of -1 and descriptive attributes set to "Unknown" or "Not Available." Then default NULL foreign keys in the fact table to -1 during the load process. This is the cleanest approach because every fact row always joins successfully, and your data quality audits can monitor the percentage of facts pointing at the Unknown member instead of silently losing rows.
Health-check query to run in production:
SELECT
'fact_orders.customer_key' AS audit_column,
COUNT(*) FILTER (WHERE customer_key IS NULL) AS null_count,
COUNT(*) AS total_count
FROM fact_orders;GROUP BY as a Band-Aid — When Aggregation Masks Modeling Problems
Symptom: a teammate adds GROUP BY and SUM() to a query "to remove duplicates," and the numbers suddenly look correct. But they are not.
Here is the anti-pattern in action. After a fan-out join, you have accidental duplicates:
-- Broken: fan-out produces 2.4M rows from a 1M-row fact table
SELECT
d.category,
SUM(f.revenue) AS total_revenue -- Inflated!
FROM fact_order_lines f
INNER JOIN bridge_order_promotions bp ON f.order_id = bp.order_id
INNER JOIN dim_product d ON f.product_key = d.product_key
GROUP BY d.category;| category | total_revenue |
|---|---|
| Electronics | 14,200,000 |
| Apparel | 8,750,000 |
The "fix" someone applies:
-- "Fixed" with DISTINCT inside the aggregation — but is it?
SELECT
d.category,
SUM(DISTINCT f.revenue) AS total_revenue
FROM fact_order_lines f
INNER JOIN bridge_order_promotions bp ON f.order_id = bp.order_id
INNER JOIN dim_product d ON f.product_key = d.product_key
GROUP BY d.category;SUM(DISTINCT revenue) deduplicates by value, not by row. If two different order lines both happen to have a revenue of 49.99, one of them vanishes. The numbers look plausible but are still wrong — just wrong in a less obvious direction.
The principle: if you need GROUP BY to make your numbers right, your join is wrong. GROUP BY should aggregate intended rows — line items into category totals, events into daily counts — not compress accidental duplicates created by a bad join. Fix the join first.
The correct approach: isolate the measures before the fan-out join, or restructure the query so the measure-bearing table is never multiplied.
-- Correct: aggregate revenue at the right grain first
WITH order_revenue AS (
SELECT
product_key,
SUM(revenue) AS total_revenue
FROM fact_order_lines
GROUP BY product_key
)
SELECT
d.category,
SUM(r.total_revenue) AS total_revenue
FROM order_revenue r
INNER JOIN dim_product d ON r.product_key = d.product_key
GROUP BY d.category;Non-Additive Metrics and Aggregation Traps
Not every number can be summed. Understanding which metrics can be aggregated — and how — prevents an entire class of silent errors.
Three metric types:
- Additive: safe to sum across all dimensions. Revenue, quantity, cost.
- Semi-additive: safe to sum across some dimensions but not others. Account balances can be summed across accounts but not across time (you would use the latest snapshot, not a sum of daily balances).
- Non-additive: cannot be summed at all. Ratios, percentages, distinct counts, averages.
The classic trap: averaging an average
-- Wrong: averaging pre-computed averages
SELECT
region,
AVG(avg_order_value) AS regional_aov -- Nonsense!
FROM monthly_store_metrics
GROUP BY region;If Store A has 10,000 orders at 50 AOV and Store B has 100 orders at 200 AOV, the simple average gives you 125. The correct weighted answer is (10,000 * 50 + 100 * 200) / 10,100 = 51.49. The error is 142%.
Fix pattern: always aggregate base measures first, then compute the ratio.
SELECT
region,
SUM(total_revenue) / SUM(total_orders) AS regional_aov
FROM monthly_store_metrics
GROUP BY region;Distinct counts do not add up
COUNT(DISTINCT customer_id) across stores does not equal the company-wide distinct count if customers shop at multiple stores. There is no algebraic shortcut here — you must re-run the distinct count at the desired grain, which can be expensive at scale. Approximate alternatives like APPROX_COUNT_DISTINCT (available in BigQuery, Snowflake, and others) trade a small error margin for significant performance gains.
For semi-additive metrics like account balances, the standard approach is a snapshot fact table — one row per account per day — and you filter to the desired date rather than summing across dates.
A Defensive Query Checklist for Warehouse SQL
Before you push any join-heavy query to production, run through this checklist:
- State the grain. Write it down: "One row = one ___." Do it for both sides of every join.
- Check join key cardinality. Run the
GROUP BY key HAVING COUNT(*) > 1diagnostic on each join key column. - Audit NULLs in join columns. Use the NULL rate query from the section above on every join key.
- Compare row counts before and after each join. If
COUNT(*)increases, you have a fan-out. If it decreases more than expected, you are dropping rows. - Separate base-measure aggregation from ratio computation. Never
SUM()a ratio orAVG()an average. Compute from raw components.
Encoding checks as automated dbt tests
If you use dbt for analytics engineering, these checks become automated pipeline gates:
# models/marts/fct_order_lines.yml
version: 2
models:
- name: fct_order_lines
columns:
- name: order_line_id
tests:
- unique
- not_null
- name: product_key
tests:
- not_null
- relationships:
to: ref('dim_product')
field: product_key
- name: customer_key
tests:
- not_null # Fails if NULLs sneak inFor row-count checks after joins, you can add a custom dbt test or use the dbt_expectations package:
- name: revenue
tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 100000
# Catches inflated values from fan-outsThe goal is a "trust but verify" pipeline: every model builds, then every model proves its grain and key integrity before downstream models consume it.
Putting It All Together — A Realistic Worked Example
You need to build a "revenue by product category by month" report. Your source tables:
fact_order_lines— grain: one row per order linedim_product— grain: one row per productdim_date— grain: one row per calendar datefact_shipments— grain: one row per shipment
Attempt 1: the naive join (broken)
SELECT
dd.month_name,
dp.category,
SUM(fol.revenue) AS total_revenue,
COUNT(DISTINCT fol.customer_key) AS unique_customers
FROM fact_order_lines fol
INNER JOIN dim_date dd ON fol.order_date_key = dd.date_key
INNER JOIN dim_product dp ON fol.product_key = dp.product_key
INNER JOIN fact_shipments fs ON fol.order_id = fs.order_id
GROUP BY dd.month_name, dp.category;Problem 1: Grain mismatch. fact_shipments has multiple rows per order_id (split shipments). Revenue fans out.
Problem 2: NULL keys. 8% of customer_key values are NULL — COUNT(DISTINCT customer_key) silently undercounts.
Run the diagnostic:
-- Detect fan-out
SELECT COUNT(*) AS before_join FROM fact_order_lines;
-- Returns: 1,000,000
SELECT COUNT(*) AS after_join
FROM fact_order_lines fol
INNER JOIN fact_shipments fs ON fol.order_id = fs.order_id;
-- Returns: 1,340,000 ← 34% inflationAttempt 2: fix the grain, handle NULLs (correct)
WITH shipment_summary AS (
SELECT
order_id,
MIN(shipped_date) AS first_ship_date,
COUNT(*) AS shipment_count
FROM fact_shipments
GROUP BY order_id
),
order_lines_enriched AS (
SELECT
fol.order_line_id,
fol.revenue,
COALESCE(fol.customer_key, -1) AS customer_key,
fol.order_date_key,
fol.product_key,
ss.first_ship_date,
ss.shipment_count
FROM fact_order_lines fol
LEFT JOIN shipment_summary ss
ON fol.order_id = ss.order_id
)
SELECT
dd.month_name,
dp.category,
SUM(ole.revenue) AS total_revenue,
COUNT(DISTINCT ole.customer_key) AS unique_customers,
SUM(ole.shipment_count) AS total_shipments
FROM order_lines_enriched ole
INNER JOIN dim_date dd ON ole.order_date_key = dd.date_key
INNER JOIN dim_product dp ON ole.product_key = dp.product_key
GROUP BY dd.month_name, dp.category
ORDER BY dd.month_name, dp.category;What changed:
fact_shipmentsis pre-aggregated to one row perorder_idin the CTE, eliminating the fan-out.- NULL
customer_keyvalues are replaced with-1(pointing to an "Unknown" dimension row), so no customers are silently dropped from the distinct count. - The
LEFT JOINto shipment summary ensures orders without shipments still appear in the report.
Validation:
SELECT COUNT(*) FROM order_lines_enriched;
-- Returns: 1,000,000 ← matches source fact table. No fan-out.As a dbt model chain
In a dbt project, this query naturally decomposes into the staging-intermediate-mart pattern:
- stg_order_lines — clean and cast the raw order lines.
- stg_shipments — clean and cast raw shipments.
- int_shipment_summary — aggregate shipments to the
order_idgrain. - int_order_lines_enriched — join order lines to shipment summary, handle NULL keys.
- mart_revenue_by_category_month — final aggregation with dimension joins.
Each intermediate model carries its own unique and not_null tests on the expected grain key. If a source data change introduces a fan-out, the pipeline fails loudly at the int_shipment_summary stage instead of silently inflating the mart.
Frequently Asked Questions
How do I know if my join is causing a fan-out?
Compare COUNT(*) before and after the join. If the count increases, rows are being duplicated. Then run SELECT join_key, COUNT(*) FROM right_table GROUP BY join_key HAVING COUNT(*) > 1 to confirm the right side of the join has duplicate keys at the join grain.
Should I always use LEFT JOIN instead of INNER JOIN in a warehouse?
Not always, but default to LEFT JOIN when joining facts to dimensions. An INNER JOIN silently drops fact rows that lack a matching dimension record (due to NULLs, late-arriving data, or load failures). A LEFT JOIN preserves those rows and makes missing dimensions visible rather than invisible.
What is the best way to handle NULL join keys?
The Kimball approach is the most robust: create a dedicated "Unknown" or "Not Applicable" row in each dimension table with a known surrogate key (commonly -1), and default NULL foreign keys in the fact table to that key during the ETL/ELT load. This ensures every fact row always joins successfully.
Can GROUP BY ever legitimately fix duplicate rows?
GROUP BY is the correct tool when you intentionally want to aggregate detail rows to a higher grain — for example, summing line-item revenue to order-level totals. It is the wrong tool when duplicates exist because of a bad join. The test: if your source fact table already has the duplicates (before any joins), GROUP BY is appropriate. If the duplicates only appear after a join, fix the join.
How do I aggregate non-additive metrics like conversion rate across dimensions?
Never sum or average a pre-computed ratio. Instead, carry the base measures (e.g., total_conversions and total_sessions) through your pipeline and compute the ratio at the final aggregation step: SUM(total_conversions) / SUM(total_sessions). This produces the correctly weighted result at any grain.
Where do these checks fit in a data pipeline?
Ideally, grain and cardinality checks run as automated tests after each transformation step. In dbt, use built-in unique and not_null schema tests on grain keys, relationships tests on foreign keys, and the dbt_expectations package for row-count and value-range assertions. These catch fan-outs and NULL key issues before bad data reaches your dashboards.
If this article clarified how joins and aggregation actually work in a warehouse context, you might find it useful to explore the Beginner Data Engineer Roadmap for a structured learning path that covers these skills — and many others — from the ground up.