TL;DR: SQL window functions perform calculations across related rows without collapsing them, enabling rankings (ROW_NUMBER, RANK, DENSE_RANK), row comparisons (LEAD, LAG), running totals, and moving averages with clean, efficient syntax. They replace complex self-joins and subqueries, and are essential for data engineering tasks like deduplication, session analysis, and time-series calculations.
SQL Window Functions: The Complete Guide for Data Engineers
Window functions are one of the most powerful features in SQL, yet many data engineers underuse them. Unlike regular aggregate functions that collapse rows into a single result, window functions let you perform calculations across a set of rows related to the current row without losing the underlying detail. If you can master window functions, you will write cleaner, faster, and more expressive SQL.
This guide covers every major window function with real-world examples you will encounter in production data engineering work. Whether you are building dashboards, analyzing user behavior, or preparing data for machine learning, window functions are indispensable.
Why Window Functions Matter
Before window functions existed, tasks like ranking, running totals, and comparing a row to its neighbors required self-joins, correlated subqueries, or application-level logic. These approaches were verbose, error-prone, and slow. Window functions solve all of that with a clean, declarative syntax that the query optimizer can execute efficiently.
Consider a simple scenario: you have an orders table and want to show each order alongside the customer's total spend. Without window functions, you would need a subquery or join. With a window function, it is a single expression:
-- Without window functions: requires a subquery
SELECT o.order_id, o.customer_id, o.amount,
(SELECT SUM(amount) FROM orders o2 WHERE o2.customer_id = o.customer_id) AS total_spend
FROM orders o;
-- With window functions: clean and efficient
SELECT order_id, customer_id, amount,
SUM(amount) OVER (PARTITION BY customer_id) AS total_spend
FROM orders;
Both queries return the same result, but the window function version is easier to read, easier to extend, and typically runs faster.
Anatomy of a Window Function
Every window function follows this structure:
function_name(arguments) OVER (
[PARTITION BY column1, column2, ...]
[ORDER BY column3, column4, ...]
[frame_clause]
)
- PARTITION BY divides rows into groups (like GROUP BY but without collapsing them).
- ORDER BY defines the order of rows within each partition.
- Frame clause narrows the window to a subset of the partition (e.g., preceding 3 rows).
Not every clause is required for every function. Ranking functions need ORDER BY. Aggregate window functions work with or without ORDER BY, and the frame clause only applies to certain functions.
Ranking Functions
Ranking functions assign a position to each row within a partition. They are the most commonly used window functions in data engineering.
ROW_NUMBER()
Assigns a unique sequential integer to each row within a partition. No ties — if two rows have the same ordering value, they get different numbers (arbitrarily).
-- Assign a unique row number to each order per customer, newest first
SELECT
customer_id,
order_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS row_num
FROM orders;
Common use case: deduplication. When you have duplicate records and need to keep only one per group, ROW_NUMBER is your go-to:
-- Keep only the most recent order per customer
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS rn
FROM orders
)
SELECT * FROM ranked WHERE rn = 1;
This pattern appears constantly in production pipelines, especially when dealing with CDC (change data capture) data or slowly changing dimensions.
RANK() and DENSE_RANK()
Both handle ties differently from ROW_NUMBER:
- RANK() — Tied rows get the same rank, but the next rank skips. (1, 2, 2, 4)
- DENSE_RANK() — Tied rows get the same rank, and the next rank does not skip. (1, 2, 2, 3)
-- Compare all three ranking functions on exam scores
SELECT
student_name,
score,
ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
RANK() OVER (ORDER BY score DESC) AS rank,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
FROM exam_results;
| student_name | score | row_num | rank | dense_rank |
|---|---|---|---|---|
| Alice | 95 | 1 | 1 | 1 |
| Bob | 92 | 2 | 2 | 2 |
| Carol | 92 | 3 | 2 | 2 |
| Dave | 88 | 4 | 4 | 3 |
Use RANK when gaps after ties are acceptable (e.g., competition results). Use DENSE_RANK when you want consecutive ranks regardless of ties (e.g., "top 3 categories" where ties should not push others out). These distinctions come up frequently in data engineering interviews.
NTILE(n)
Divides the partition into n roughly equal buckets and assigns a bucket number to each row. Useful for percentile analysis and balanced distribution.
-- Split customers into 4 quartiles by total spend
SELECT
customer_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_summary;
Quartile 1 contains the highest spenders, quartile 4 the lowest. This is invaluable for segmentation and reporting.
Offset Functions: LEAD and LAG
LEAD and LAG let you access values from other rows relative to the current row without a self-join.
- LAG(column, offset, default) — Looks backward (previous rows).
- LEAD(column, offset, default) — Looks forward (next rows).
-- Calculate day-over-day revenue change
SELECT
report_date,
revenue,
LAG(revenue, 1) OVER (ORDER BY report_date) AS prev_day_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY report_date) AS daily_change,
ROUND(
(revenue - LAG(revenue, 1) OVER (ORDER BY report_date))
/ LAG(revenue, 1) OVER (ORDER BY report_date) * 100, 2
) AS pct_change
FROM daily_revenue;
Year-over-Year Comparison
A classic analytics query: compare each month's revenue to the same month in the prior year.
-- Year-over-year revenue comparison by month
SELECT
year,
month,
revenue,
LAG(revenue, 12) OVER (ORDER BY year, month) AS revenue_prev_year,
ROUND(
(revenue - LAG(revenue, 12) OVER (ORDER BY year, month))
/ NULLIF(LAG(revenue, 12) OVER (ORDER BY year, month), 0) * 100, 2
) AS yoy_growth_pct
FROM monthly_revenue
ORDER BY year, month;
The NULLIF prevents division by zero when there is no prior year data.
FIRST_VALUE and LAST_VALUE
These functions return the first or last value in the window frame. They are useful for comparisons against a baseline.
-- Compare each employee's salary to the highest and lowest in their department
SELECT
department,
employee_name,
salary,
FIRST_VALUE(salary) OVER (
PARTITION BY department ORDER BY salary DESC
) AS highest_salary,
salary - FIRST_VALUE(salary) OVER (
PARTITION BY department ORDER BY salary DESC
) AS diff_from_top
FROM employees;
Warning about LAST_VALUE: By default, the window frame extends from the start of the partition to the current row, which means LAST_VALUE returns the current row's value — probably not what you want. Always specify the frame explicitly:
-- Correct usage of LAST_VALUE with explicit frame
SELECT
department,
employee_name,
salary,
LAST_VALUE(salary) OVER (
PARTITION BY department
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS lowest_salary
FROM employees;
Aggregate Window Functions
Any standard aggregate function (SUM, AVG, COUNT, MIN, MAX) can be used as a window function by adding OVER().
Running Totals
Running totals are one of the most common analytical patterns. They are trivial with window functions.
-- Running total of daily sales
SELECT
sale_date,
daily_amount,
SUM(daily_amount) OVER (
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM daily_sales;
The frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is actually the default when ORDER BY is specified, so you can omit it. But being explicit improves readability.
Moving Averages
Moving averages smooth out noise in time series data. Here is a 7-day moving average:
-- 7-day moving average of daily active users
SELECT
report_date,
daily_active_users,
ROUND(
AVG(daily_active_users) OVER (
ORDER BY report_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0
) AS moving_avg_7d
FROM user_activity;
Note: 6 PRECEDING AND CURRENT ROW gives you 7 rows total (6 before + current). For the first 6 rows, the average will be computed over fewer than 7 data points — keep this in mind when interpreting early values.
Percentage of Total
Calculate what percentage each row contributes to the group total:
-- Each product's share of category revenue
SELECT
category,
product_name,
revenue,
ROUND(
revenue * 100.0 / SUM(revenue) OVER (PARTITION BY category), 2
) AS pct_of_category
FROM product_revenue
ORDER BY category, revenue DESC;
Understanding Frame Clauses
The frame clause controls exactly which rows are included in the window calculation. There are two types:
- ROWS — Counts physical rows.
- RANGE — Groups rows with the same ORDER BY value together.
-- ROWS frame: exactly 2 rows before and 2 rows after
SUM(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
)
-- RANGE frame: all rows within a value range
SUM(amount) OVER (
ORDER BY order_date
RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW
)
The RANGE-based frame with intervals is supported in PostgreSQL and some other databases. It is particularly useful when your data has gaps (e.g., no sales on weekends) and you want a true calendar-based window rather than a row-count-based one.
Frame Clause Defaults
Understanding the defaults prevents subtle bugs:
| Scenario | Default Frame |
|---|---|
| No ORDER BY | Entire partition |
| ORDER BY present, no frame | RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW |
The second default is why SUM() OVER (ORDER BY date) gives you a running total — it implicitly includes all rows from the start of the partition up to the current row.
Real-World Example: Session Analysis
Let's put it all together with a practical user session analysis. Given a table of page views, identify sessions (gap > 30 minutes = new session) and calculate session metrics.
-- Step 1: Detect session boundaries using LAG
WITH page_views_with_gap AS (
SELECT
user_id,
page_url,
view_timestamp,
LAG(view_timestamp) OVER (
PARTITION BY user_id ORDER BY view_timestamp
) AS prev_view_timestamp,
CASE
WHEN view_timestamp - LAG(view_timestamp) OVER (
PARTITION BY user_id ORDER BY view_timestamp
) > INTERVAL '30 minutes'
OR LAG(view_timestamp) OVER (
PARTITION BY user_id ORDER BY view_timestamp
) IS NULL
THEN 1
ELSE 0
END AS is_new_session
FROM page_views
),
-- Step 2: Assign session IDs using a running sum of session boundaries
sessions AS (
SELECT
*,
SUM(is_new_session) OVER (
PARTITION BY user_id ORDER BY view_timestamp
) AS session_id
FROM page_views_with_gap
)
-- Step 3: Calculate session-level metrics
SELECT
user_id,
session_id,
COUNT(*) AS pages_viewed,
MIN(view_timestamp) AS session_start,
MAX(view_timestamp) AS session_end,
MAX(view_timestamp) - MIN(view_timestamp) AS session_duration
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_start;
This query chains three window functions (LAG, CASE with LAG, running SUM) to convert raw page views into meaningful sessions — a pattern used daily in analytics engineering.
Performance Tips
- Indexes matter. If you PARTITION BY customer_id and ORDER BY created_at, an index on
(customer_id, created_at)can dramatically speed up the query. - Reduce partition size. The larger the partition, the more work the database does. Filter early with WHERE clauses.
- Reuse windows. If multiple expressions share the same window definition, use a named window (supported in PostgreSQL and others):
SELECT
order_id,
amount,
SUM(amount) OVER w AS running_total,
AVG(amount) OVER w AS running_avg,
ROW_NUMBER() OVER w AS row_num
FROM orders
WINDOW w AS (PARTITION BY customer_id ORDER BY order_date);
- Avoid unnecessary sorting. Each distinct OVER clause may require a separate sort operation. Consolidate window definitions when possible.
Common Mistakes to Avoid
- Forgetting that ROW_NUMBER is nondeterministic on ties. If two rows can have the same ordering value, add a tiebreaker column to get stable results.
- Using LAST_VALUE without an explicit frame. The default frame ends at the current row, not the partition end.
- Confusing ROWS and RANGE. ROWS counts physical rows; RANGE groups rows with equal ORDER BY values. In most cases, ROWS is what you want.
- Over-partitioning. PARTITION BY with high-cardinality columns can be expensive. Profile your queries.
Window Functions in Data Modeling
Window functions play a key role in dimensional modeling, particularly when building star schema or snowflake schema designs. They help with:
- Slowly Changing Dimensions (SCD Type 2): Using LEAD to set end dates when a new version of a dimension row arrives.
- Fact table enrichment: Adding running totals, moving averages, and rankings directly in your transformation layer.
- Data quality checks: Detecting duplicates, gaps in sequences, and outliers using ROW_NUMBER and LAG.
If you want to deepen your SQL fundamentals beyond window functions, check out our SQL fundamentals guide for a comprehensive foundation.
Conclusion
Window functions transform how you think about SQL. Instead of writing complex self-joins and subqueries, you express analytical logic directly and clearly. The key concepts to remember:
- ROW_NUMBER, RANK, DENSE_RANK for positioning and deduplication
- LEAD and LAG for row-to-row comparisons
- SUM, AVG, COUNT OVER for running and cumulative calculations
- PARTITION BY to scope your calculations
- Frame clauses to control exactly which rows participate
Practice these patterns on real data, and you will find yourself reaching for window functions as naturally as you reach for WHERE and GROUP BY. They are a core skill that will serve you well in every data engineering role — and they are a favorite topic in technical interviews.
Frequently Asked Questions
What are SQL window functions?
SQL window functions perform calculations across a set of rows related to the current row, defined by the OVER() clause, without collapsing the result into a single row like GROUP BY does. They enable operations like ranking, running totals, moving averages, and row-to-row comparisons while preserving the detail of every individual row in the output.
What is the difference between RANK and DENSE_RANK?
RANK and DENSE_RANK both assign the same rank to tied rows, but they differ in how they handle the next rank. RANK skips ranks after ties (e.g., 1, 2, 2, 4), while DENSE_RANK assigns consecutive ranks without gaps (e.g., 1, 2, 2, 3). Use RANK for competition-style results where gaps are acceptable, and DENSE_RANK when you want consecutive rankings like "top 3 categories" where ties should not push others out.
What is PARTITION BY in SQL window functions?
PARTITION BY divides the rows into groups (partitions) for the window function calculation, similar to how GROUP BY divides rows for aggregate functions, but without collapsing rows. For example, SUM(amount) OVER (PARTITION BY customer_id) calculates the total amount per customer while keeping every individual row visible. Each partition is processed independently by the window function.
When should I use window functions vs GROUP BY?
Use GROUP BY when you need a single summary row per group (e.g., total revenue per month). Use window functions when you need to perform calculations across related rows while keeping every individual row in the result (e.g., each order alongside the customer's total spend, or a running total). Window functions are also essential for ranking, deduplication, and comparing a row to its neighbors using LEAD and LAG.