SQL Window Functions: The Complete Guide for Data Engineers

    Master SQL window functions with practical examples. Learn ROW_NUMBER, RANK, DENSE_RANK, LEAD/LAG, running totals, and advanced frame clauses.

    By Adriano Sanges--15 min read
    SQL
    window functions
    data engineering
    analytics
    RANK
    ROW_NUMBER
    PARTITION BY

    TL;DR: SQL window functions perform calculations across related rows without collapsing them, enabling rankings (ROW_NUMBER, RANK, DENSE_RANK), row comparisons (LEAD, LAG), running totals, and moving averages with clean, efficient syntax. They replace complex self-joins and subqueries, and are essential for data engineering tasks like deduplication, session analysis, and time-series calculations.

    SQL Window Functions: The Complete Guide for Data Engineers

    Window functions are one of the most powerful features in SQL, yet many data engineers underuse them. Unlike regular aggregate functions that collapse rows into a single result, window functions let you perform calculations across a set of rows related to the current row without losing the underlying detail. If you can master window functions, you will write cleaner, faster, and more expressive SQL.

    This guide covers every major window function with real-world examples you will encounter in production data engineering work. Whether you are building dashboards, analyzing user behavior, or preparing data for machine learning, window functions are indispensable.

    Why Window Functions Matter

    Before window functions existed, tasks like ranking, running totals, and comparing a row to its neighbors required self-joins, correlated subqueries, or application-level logic. These approaches were verbose, error-prone, and slow. Window functions solve all of that with a clean, declarative syntax that the query optimizer can execute efficiently.

    Consider a simple scenario: you have an orders table and want to show each order alongside the customer's total spend. Without window functions, you would need a subquery or join. With a window function, it is a single expression:

    -- Without window functions: requires a subquery
    SELECT o.order_id, o.customer_id, o.amount,
           (SELECT SUM(amount) FROM orders o2 WHERE o2.customer_id = o.customer_id) AS total_spend
    FROM orders o;
    
    -- With window functions: clean and efficient
    SELECT order_id, customer_id, amount,
           SUM(amount) OVER (PARTITION BY customer_id) AS total_spend
    FROM orders;
    

    Both queries return the same result, but the window function version is easier to read, easier to extend, and typically runs faster.

    Anatomy of a Window Function

    Every window function follows this structure:

    function_name(arguments) OVER (
        [PARTITION BY column1, column2, ...]
        [ORDER BY column3, column4, ...]
        [frame_clause]
    )
    
    • PARTITION BY divides rows into groups (like GROUP BY but without collapsing them).
    • ORDER BY defines the order of rows within each partition.
    • Frame clause narrows the window to a subset of the partition (e.g., preceding 3 rows).

    Not every clause is required for every function. Ranking functions need ORDER BY. Aggregate window functions work with or without ORDER BY, and the frame clause only applies to certain functions.

    Ranking Functions

    Ranking functions assign a position to each row within a partition. They are the most commonly used window functions in data engineering.

    ROW_NUMBER()

    Assigns a unique sequential integer to each row within a partition. No ties — if two rows have the same ordering value, they get different numbers (arbitrarily).

    -- Assign a unique row number to each order per customer, newest first
    SELECT
        customer_id,
        order_id,
        order_date,
        amount,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_date DESC
        ) AS row_num
    FROM orders;
    

    Common use case: deduplication. When you have duplicate records and need to keep only one per group, ROW_NUMBER is your go-to:

    -- Keep only the most recent order per customer
    WITH ranked AS (
        SELECT *,
               ROW_NUMBER() OVER (
                   PARTITION BY customer_id
                   ORDER BY order_date DESC
               ) AS rn
        FROM orders
    )
    SELECT * FROM ranked WHERE rn = 1;
    

    This pattern appears constantly in production pipelines, especially when dealing with CDC (change data capture) data or slowly changing dimensions.

    RANK() and DENSE_RANK()

    Both handle ties differently from ROW_NUMBER:

    • RANK() — Tied rows get the same rank, but the next rank skips. (1, 2, 2, 4)
    • DENSE_RANK() — Tied rows get the same rank, and the next rank does not skip. (1, 2, 2, 3)
    -- Compare all three ranking functions on exam scores
    SELECT
        student_name,
        score,
        ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
        RANK()       OVER (ORDER BY score DESC) AS rank,
        DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rank
    FROM exam_results;
    
    student_name score row_num rank dense_rank
    Alice 95 1 1 1
    Bob 92 2 2 2
    Carol 92 3 2 2
    Dave 88 4 4 3

    Use RANK when gaps after ties are acceptable (e.g., competition results). Use DENSE_RANK when you want consecutive ranks regardless of ties (e.g., "top 3 categories" where ties should not push others out). These distinctions come up frequently in data engineering interviews.

    NTILE(n)

    Divides the partition into n roughly equal buckets and assigns a bucket number to each row. Useful for percentile analysis and balanced distribution.

    -- Split customers into 4 quartiles by total spend
    SELECT
        customer_id,
        total_spend,
        NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
    FROM customer_summary;
    

    Quartile 1 contains the highest spenders, quartile 4 the lowest. This is invaluable for segmentation and reporting.

    Offset Functions: LEAD and LAG

    LEAD and LAG let you access values from other rows relative to the current row without a self-join.

    • LAG(column, offset, default) — Looks backward (previous rows).
    • LEAD(column, offset, default) — Looks forward (next rows).
    -- Calculate day-over-day revenue change
    SELECT
        report_date,
        revenue,
        LAG(revenue, 1) OVER (ORDER BY report_date) AS prev_day_revenue,
        revenue - LAG(revenue, 1) OVER (ORDER BY report_date) AS daily_change,
        ROUND(
            (revenue - LAG(revenue, 1) OVER (ORDER BY report_date))
            / LAG(revenue, 1) OVER (ORDER BY report_date) * 100, 2
        ) AS pct_change
    FROM daily_revenue;
    

    Year-over-Year Comparison

    A classic analytics query: compare each month's revenue to the same month in the prior year.

    -- Year-over-year revenue comparison by month
    SELECT
        year,
        month,
        revenue,
        LAG(revenue, 12) OVER (ORDER BY year, month) AS revenue_prev_year,
        ROUND(
            (revenue - LAG(revenue, 12) OVER (ORDER BY year, month))
            / NULLIF(LAG(revenue, 12) OVER (ORDER BY year, month), 0) * 100, 2
        ) AS yoy_growth_pct
    FROM monthly_revenue
    ORDER BY year, month;
    

    The NULLIF prevents division by zero when there is no prior year data.

    FIRST_VALUE and LAST_VALUE

    These functions return the first or last value in the window frame. They are useful for comparisons against a baseline.

    -- Compare each employee's salary to the highest and lowest in their department
    SELECT
        department,
        employee_name,
        salary,
        FIRST_VALUE(salary) OVER (
            PARTITION BY department ORDER BY salary DESC
        ) AS highest_salary,
        salary - FIRST_VALUE(salary) OVER (
            PARTITION BY department ORDER BY salary DESC
        ) AS diff_from_top
    FROM employees;
    

    Warning about LAST_VALUE: By default, the window frame extends from the start of the partition to the current row, which means LAST_VALUE returns the current row's value — probably not what you want. Always specify the frame explicitly:

    -- Correct usage of LAST_VALUE with explicit frame
    SELECT
        department,
        employee_name,
        salary,
        LAST_VALUE(salary) OVER (
            PARTITION BY department
            ORDER BY salary DESC
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        ) AS lowest_salary
    FROM employees;
    

    Aggregate Window Functions

    Any standard aggregate function (SUM, AVG, COUNT, MIN, MAX) can be used as a window function by adding OVER().

    Running Totals

    Running totals are one of the most common analytical patterns. They are trivial with window functions.

    -- Running total of daily sales
    SELECT
        sale_date,
        daily_amount,
        SUM(daily_amount) OVER (
            ORDER BY sale_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS running_total
    FROM daily_sales;
    

    The frame clause ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is actually the default when ORDER BY is specified, so you can omit it. But being explicit improves readability.

    Moving Averages

    Moving averages smooth out noise in time series data. Here is a 7-day moving average:

    -- 7-day moving average of daily active users
    SELECT
        report_date,
        daily_active_users,
        ROUND(
            AVG(daily_active_users) OVER (
                ORDER BY report_date
                ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
            ), 0
        ) AS moving_avg_7d
    FROM user_activity;
    

    Note: 6 PRECEDING AND CURRENT ROW gives you 7 rows total (6 before + current). For the first 6 rows, the average will be computed over fewer than 7 data points — keep this in mind when interpreting early values.

    Percentage of Total

    Calculate what percentage each row contributes to the group total:

    -- Each product's share of category revenue
    SELECT
        category,
        product_name,
        revenue,
        ROUND(
            revenue * 100.0 / SUM(revenue) OVER (PARTITION BY category), 2
        ) AS pct_of_category
    FROM product_revenue
    ORDER BY category, revenue DESC;
    

    Understanding Frame Clauses

    The frame clause controls exactly which rows are included in the window calculation. There are two types:

    • ROWS — Counts physical rows.
    • RANGE — Groups rows with the same ORDER BY value together.
    -- ROWS frame: exactly 2 rows before and 2 rows after
    SUM(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
    )
    
    -- RANGE frame: all rows within a value range
    SUM(amount) OVER (
        ORDER BY order_date
        RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW
    )
    

    The RANGE-based frame with intervals is supported in PostgreSQL and some other databases. It is particularly useful when your data has gaps (e.g., no sales on weekends) and you want a true calendar-based window rather than a row-count-based one.

    Frame Clause Defaults

    Understanding the defaults prevents subtle bugs:

    Scenario Default Frame
    No ORDER BY Entire partition
    ORDER BY present, no frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

    The second default is why SUM() OVER (ORDER BY date) gives you a running total — it implicitly includes all rows from the start of the partition up to the current row.

    Real-World Example: Session Analysis

    Let's put it all together with a practical user session analysis. Given a table of page views, identify sessions (gap > 30 minutes = new session) and calculate session metrics.

    -- Step 1: Detect session boundaries using LAG
    WITH page_views_with_gap AS (
        SELECT
            user_id,
            page_url,
            view_timestamp,
            LAG(view_timestamp) OVER (
                PARTITION BY user_id ORDER BY view_timestamp
            ) AS prev_view_timestamp,
            CASE
                WHEN view_timestamp - LAG(view_timestamp) OVER (
                    PARTITION BY user_id ORDER BY view_timestamp
                ) > INTERVAL '30 minutes'
                OR LAG(view_timestamp) OVER (
                    PARTITION BY user_id ORDER BY view_timestamp
                ) IS NULL
                THEN 1
                ELSE 0
            END AS is_new_session
        FROM page_views
    ),
    
    -- Step 2: Assign session IDs using a running sum of session boundaries
    sessions AS (
        SELECT
            *,
            SUM(is_new_session) OVER (
                PARTITION BY user_id ORDER BY view_timestamp
            ) AS session_id
        FROM page_views_with_gap
    )
    
    -- Step 3: Calculate session-level metrics
    SELECT
        user_id,
        session_id,
        COUNT(*) AS pages_viewed,
        MIN(view_timestamp) AS session_start,
        MAX(view_timestamp) AS session_end,
        MAX(view_timestamp) - MIN(view_timestamp) AS session_duration
    FROM sessions
    GROUP BY user_id, session_id
    ORDER BY user_id, session_start;
    

    This query chains three window functions (LAG, CASE with LAG, running SUM) to convert raw page views into meaningful sessions — a pattern used daily in analytics engineering.

    Performance Tips

    1. Indexes matter. If you PARTITION BY customer_id and ORDER BY created_at, an index on (customer_id, created_at) can dramatically speed up the query.
    2. Reduce partition size. The larger the partition, the more work the database does. Filter early with WHERE clauses.
    3. Reuse windows. If multiple expressions share the same window definition, use a named window (supported in PostgreSQL and others):
    SELECT
        order_id,
        amount,
        SUM(amount) OVER w AS running_total,
        AVG(amount) OVER w AS running_avg,
        ROW_NUMBER() OVER w AS row_num
    FROM orders
    WINDOW w AS (PARTITION BY customer_id ORDER BY order_date);
    
    1. Avoid unnecessary sorting. Each distinct OVER clause may require a separate sort operation. Consolidate window definitions when possible.

    Common Mistakes to Avoid

    • Forgetting that ROW_NUMBER is nondeterministic on ties. If two rows can have the same ordering value, add a tiebreaker column to get stable results.
    • Using LAST_VALUE without an explicit frame. The default frame ends at the current row, not the partition end.
    • Confusing ROWS and RANGE. ROWS counts physical rows; RANGE groups rows with equal ORDER BY values. In most cases, ROWS is what you want.
    • Over-partitioning. PARTITION BY with high-cardinality columns can be expensive. Profile your queries.

    Window Functions in Data Modeling

    Window functions play a key role in dimensional modeling, particularly when building star schema or snowflake schema designs. They help with:

    • Slowly Changing Dimensions (SCD Type 2): Using LEAD to set end dates when a new version of a dimension row arrives.
    • Fact table enrichment: Adding running totals, moving averages, and rankings directly in your transformation layer.
    • Data quality checks: Detecting duplicates, gaps in sequences, and outliers using ROW_NUMBER and LAG.

    If you want to deepen your SQL fundamentals beyond window functions, check out our SQL fundamentals guide for a comprehensive foundation.

    Conclusion

    Window functions transform how you think about SQL. Instead of writing complex self-joins and subqueries, you express analytical logic directly and clearly. The key concepts to remember:

    • ROW_NUMBER, RANK, DENSE_RANK for positioning and deduplication
    • LEAD and LAG for row-to-row comparisons
    • SUM, AVG, COUNT OVER for running and cumulative calculations
    • PARTITION BY to scope your calculations
    • Frame clauses to control exactly which rows participate

    Practice these patterns on real data, and you will find yourself reaching for window functions as naturally as you reach for WHERE and GROUP BY. They are a core skill that will serve you well in every data engineering role — and they are a favorite topic in technical interviews.

    Frequently Asked Questions

    What are SQL window functions?

    SQL window functions perform calculations across a set of rows related to the current row, defined by the OVER() clause, without collapsing the result into a single row like GROUP BY does. They enable operations like ranking, running totals, moving averages, and row-to-row comparisons while preserving the detail of every individual row in the output.

    What is the difference between RANK and DENSE_RANK?

    RANK and DENSE_RANK both assign the same rank to tied rows, but they differ in how they handle the next rank. RANK skips ranks after ties (e.g., 1, 2, 2, 4), while DENSE_RANK assigns consecutive ranks without gaps (e.g., 1, 2, 2, 3). Use RANK for competition-style results where gaps are acceptable, and DENSE_RANK when you want consecutive rankings like "top 3 categories" where ties should not push others out.

    What is PARTITION BY in SQL window functions?

    PARTITION BY divides the rows into groups (partitions) for the window function calculation, similar to how GROUP BY divides rows for aggregate functions, but without collapsing rows. For example, SUM(amount) OVER (PARTITION BY customer_id) calculates the total amount per customer while keeping every individual row visible. Each partition is processed independently by the window function.

    When should I use window functions vs GROUP BY?

    Use GROUP BY when you need a single summary row per group (e.g., total revenue per month). Use window functions when you need to perform calculations across related rows while keeping every individual row in the result (e.g., each order alongside the customer's total spend, or a running total). Window functions are also essential for ranking, deduplication, and comparing a row to its neighbors using LEAD and LAG.

    About the Author

    Adriano Sanges is a data engineering professional and the creator of dataskew.io. With years of experience building data platforms at scale, he shares practical insights and hands-on guides to help aspiring data engineers advance their careers.

    Ready to Apply What You Learned?

    Take the next step in your data engineering journey with structured roadmaps and hands-on projects designed for real-world experience.