Cloud Cost Optimization for Data Engineering

    Master cloud cost management strategies for BigQuery, Snowflake, and AWS data services. Learn pricing models, optimization techniques, and monitoring best practices.

    Level:Intermediate
    Tools:
    BigQuerySnowflakeAWS S3AWS EMRAWS GlueTerraformCloud Console

    Skills You'll Learn:

    Cloud cost modelingQuery optimization for costStorage tier managementBudget monitoring and alertingResource right-sizingFinOps fundamentals

    Module 1: FinOps Fundamentals for Data Engineers

    This module introduces the financial dimension of cloud data engineering. As organizations migrate workloads to the cloud and data volumes grow exponentially, cost management has become a core competency for data engineers -- not just a concern for finance teams. Understanding how your architectural decisions translate into cloud bills is what separates a senior data engineer from a junior one.

    1.1 Why Cost Matters in Data Engineering

    Data engineering is consistently one of the largest contributors to cloud spending in modern organizations. Unlike a web application that serves relatively predictable traffic, data pipelines routinely process terabytes or petabytes of data, and small inefficiencies can compound into enormous costs at scale. A single poorly written query in BigQuery scanning a multi-petabyte table can cost hundreds or even thousands of dollars in a matter of minutes. A Snowflake warehouse left running over a weekend can burn through thousands of credits with no one watching.

    The exponential growth of data makes this problem worse every quarter. IDC estimates that global data creation will exceed 180 zettabytes by 2025, and most organizations see their data volumes double every two to three years. As the data grows, so do the compute resources required to process it, the storage costs to retain it, and the network costs to move it between services and regions.

    Data engineers sit at the intersection of all three major cloud cost categories:

    1. Compute costs: Running Spark jobs, executing warehouse queries, powering ETL pipelines. Compute is typically the largest line item in a data engineering budget, and it is directly influenced by the efficiency of your code and architecture.

    2. Storage costs: Maintaining raw data lakes, curated data warehouses, staging areas, backups, and historical snapshots. While storage is cheaper per unit than compute, it accumulates relentlessly and is often overlooked because it grows silently.

    3. Network egress costs: Moving data between regions, across clouds, or out to external consumers. Egress charges are the hidden cost that surprises many teams, particularly when architectures span multiple cloud providers or regions.

    The data engineer's role in cost management is unique because technical decisions made during pipeline design have a direct and measurable impact on the cloud bill. Choosing the right partitioning strategy, selecting appropriate warehouse sizes, deciding between materialized and live views, and designing incremental processing patterns are all engineering decisions with immediate financial consequences.

    Organizations increasingly expect data engineers to think about cost as a first-class requirement alongside correctness, performance, and reliability. This is not about being cheap -- it is about being efficient, ensuring that every dollar spent on cloud infrastructure delivers maximum value to the business.

    1.2 Understanding Cloud Pricing Models

    Before you can optimize cloud costs, you need to understand how cloud providers charge for their services. While the details vary between providers, there are several universal pricing concepts that apply across all major platforms.

    On-Demand Pricing

    On-demand pricing is the default and most flexible model. You pay for exactly what you use, with no upfront commitment. Rates are typically billed per second, per minute, or per hour for compute, and per GB-month for storage.

    Advantages:

    • No upfront commitment or risk
    • Scale up and down freely
    • Simple to understand and budget for small workloads
    • Ideal for variable or unpredictable workloads

    Disadvantages:

    • Highest per-unit cost
    • Costs can spike unpredictably with usage growth
    • No incentive to plan capacity

    Reserved / Committed Use Pricing

    All major cloud providers offer significant discounts (typically 30-70%) in exchange for committing to a certain level of usage over one to three years. AWS calls these Reserved Instances and Savings Plans, GCP offers Committed Use Discounts, and Snowflake offers pre-purchased capacity.

    Advantages:

    • Substantial cost savings for predictable workloads
    • Predictable monthly costs for budgeting
    • Can be combined with on-demand for burst capacity

    Disadvantages:

    • Requires upfront commitment (financial risk if needs change)
    • Requires accurate capacity planning
    • Unused reservations are wasted money

    Spot / Preemptible Pricing

    Cloud providers offer deeply discounted compute instances (60-90% off on-demand pricing) that can be reclaimed with short notice. AWS calls these Spot Instances, GCP calls them Preemptible VMs (or Spot VMs), and Azure calls them Spot VMs.

    Advantages:

    • Dramatic cost savings for fault-tolerant workloads
    • Ideal for batch processing, ETL jobs, and data pipelines
    • Can be combined with on-demand instances for resilience

    Disadvantages:

    • Instances can be terminated with as little as two minutes notice
    • Not suitable for long-running, stateful, or latency-sensitive workloads
    • Requires application-level fault tolerance

    The Three Pillars of Cloud Cost

    When analyzing any cloud bill, costs break down into three fundamental categories:

    Category Description Typical % of Data Eng. Bill
    Compute Processing power (CPUs, GPUs, memory) 50-70%
    Storage Data at rest (object storage, block storage, databases) 15-30%
    Network Data transfer (egress, cross-region, cross-AZ) 5-15%

    Understanding this breakdown for your specific workloads is the first step toward optimization. Many teams discover that one category dominates their bill, and focusing optimization efforts there yields the greatest returns.

    Total Cost of Ownership (TCO)

    TCO extends beyond the direct cloud bill to include the full cost of running data infrastructure:

    • Direct cloud costs: Compute, storage, network as described above
    • Engineering time: Hours spent building, maintaining, and debugging pipelines
    • Operational overhead: Monitoring, incident response, on-call rotations
    • Opportunity cost: What could your team build if they were not managing infrastructure?
    • Data quality costs: The business impact of late, incorrect, or missing data

    A more expensive managed service that reduces engineering time can have a lower TCO than a cheaper self-managed solution. For example, using AWS Glue at a higher per-DPU cost might be cheaper in TCO than running a self-managed Spark cluster on EMR, once you account for the operational overhead of cluster management.

    1.3 The FinOps Framework

    FinOps (Cloud Financial Operations) is an evolving discipline and cultural practice that brings financial accountability to the variable-spend model of cloud computing. It provides a structured approach for organizations to manage cloud costs without sacrificing speed or innovation.

    The Three Phases of FinOps

    Phase 1: Inform

    The Inform phase is about creating visibility into cloud spending. You cannot optimize what you cannot see. Key activities include:

    • Establishing cost allocation through tagging and account structure
    • Building dashboards that show cost trends and anomalies
    • Attributing costs to teams, projects, and business units
    • Understanding the relationship between technical metrics and financial metrics
    • Creating a shared vocabulary for discussing cloud costs

    Phase 2: Optimize

    The Optimize phase focuses on reducing waste and improving efficiency. This is where data engineers have the most direct impact:

    • Right-sizing compute resources to match actual workload requirements
    • Eliminating idle or unused resources
    • Implementing reserved pricing for predictable baseline workloads
    • Optimizing queries and pipelines for cost efficiency
    • Choosing the most cost-effective services and architectures

    Phase 3: Operate

    The Operate phase embeds cost awareness into ongoing operations:

    • Setting budgets and alerts for anomaly detection
    • Establishing governance policies for resource provisioning
    • Integrating cost considerations into design reviews and architecture decisions
    • Automating cost optimization actions (e.g., auto-scaling, scheduled shutdowns)
    • Conducting regular cost reviews and retrospectives

    How Data Teams Fit Into FinOps

    Data engineering teams are uniquely positioned in the FinOps model because they are both major consumers of cloud resources and possessors of the technical knowledge needed to optimize those costs. Effective FinOps for data teams requires:

    1. Shared accountability: Every data engineer should understand the cost implications of their design decisions. Cost is not solely the domain of the finance team or a centralized FinOps function.

    2. Cost as a metric: Just as data teams track pipeline latency, data quality, and uptime, they should track cost per pipeline run, cost per TB processed, and cost trends over time.

    3. Architectural decision records: When choosing between approaches (e.g., full refresh vs. incremental, materialized view vs. live query), document the cost implications alongside performance and maintainability.

    4. Collaboration with finance: Data engineers should participate in cloud billing reviews and help translate technical metrics into business context. A line item showing "$15,000 for BigQuery compute" is more actionable when it can be attributed to specific pipelines and business use cases.


    Module 2: BigQuery Cost Optimization

    Google BigQuery is a fully managed, serverless data warehouse that excels at analyzing large datasets quickly. Its serverless nature means you do not manage infrastructure, but it also means you must understand its pricing model deeply to avoid costly surprises. BigQuery's pricing is fundamentally different from traditional databases, and optimizing for cost requires a different mindset than optimizing for performance alone.

    2.1 BigQuery Pricing Models

    BigQuery offers two primary pricing models for compute, and understanding when to use each is one of the most impactful cost decisions you will make.

    On-Demand Pricing (Per TB Scanned)

    With on-demand pricing, you pay based on the amount of data your queries scan. As of the current pricing, this costs $6.25 per TB scanned in most regions. The first 1 TB of data scanned per month is free.

    Key characteristics:

    • You pay only when you run queries
    • Cost is proportional to the data scanned, not the complexity of the query
    • No cost when queries are not running
    • Ideal for ad-hoc analysis and variable workloads

    Example cost calculation:

    Suppose you have a 10 TB table and run a query that scans the entire table:

    Query cost = 10 TB x $6.25/TB = $62.50 per query execution
    

    If this query runs daily as part of a pipeline:

    Monthly cost = $62.50 x 30 days = $1,875/month
    

    Now suppose you add a partition filter that reduces the scan to 500 GB:

    Query cost = 0.5 TB x $6.25/TB = $3.13 per query execution
    Monthly cost = $3.13 x 30 days = $93.75/month
    

    That single optimization saves $1,781.25 per month -- a 95% reduction.

    BigQuery Editions (Slot-Based Pricing)

    BigQuery Editions use a capacity-based model where you purchase "slots" -- units of computational capacity. You pay for slot-hours regardless of how much data your queries scan.

    BigQuery offers three editions:

    Edition Use Case Baseline Pricing (per slot-hour)
    Standard Development, ad-hoc ~$0.04
    Enterprise Production workloads ~$0.06
    Enterprise Plus Mission-critical, multi-region ~$0.10

    Slots can be purchased as autoscaling (pay-as-you-go, billed per second) or baseline + autoscaling (committed baseline with burst capacity).

    Example cost calculation:

    Suppose your team consistently uses 500 slots for 8 hours per day:

    Daily cost (autoscaling, Enterprise) = 500 slots x 8 hours x $0.06 = $240/day
    Monthly cost = $240 x 30 = $7,200/month
    

    With a committed baseline of 500 slots (annual commitment, ~40% discount):

    Monthly cost = 500 slots x 730 hours x $0.06 x 0.6 = ~$13,140/month
    

    Wait -- that looks more expensive. The key insight is that baseline commitments cover 24/7, so they only make sense if you are using those slots around the clock. For 8 hours/day, autoscaling is cheaper.

    Choosing the Right Model

    Use on-demand when:

    • Monthly data scanned is under 5-10 TB
    • Workloads are unpredictable or infrequent
    • You are in early stages and still learning your usage patterns
    • Ad-hoc analysis is the primary use case

    Use Editions (slot-based) when:

    • Monthly data scanned exceeds 10-20 TB consistently
    • You want predictable, budgetable costs
    • You have many concurrent users or queries
    • You need performance guarantees (slots provide dedicated capacity)

    Break-even analysis:

    To find your break-even point, compare your monthly on-demand spend with what you would pay for the equivalent slot capacity:

    Break-even TB scanned = (Slot cost per month) / $6.25
    

    For example, if 100 Enterprise autoscaling slots cost approximately $4,380/month:

    Break-even = $4,380 / $6.25 = ~700 TB scanned per month
    

    If your queries scan more than 700 TB per month, slot-based pricing is likely cheaper.

    2.2 Query Optimization for Cost

    In BigQuery's on-demand model, cost is directly tied to the amount of data scanned. Every optimization that reduces the bytes scanned reduces your bill. Even in slot-based models, efficient queries complete faster, freeing slots for other workloads.

    Avoid SELECT *

    The single most impactful habit for BigQuery cost optimization is to never use SELECT * in production queries. Because BigQuery uses columnar storage, it only reads the columns you reference. Selecting all columns forces BigQuery to read the entire table.

    Before (expensive):

    -- Scans ALL columns in the table
    -- If the table has 50 columns but you only need 3, you're paying for 47 unnecessary columns
    SELECT *
    FROM `project.dataset.events`
    WHERE event_date = '2024-01-15'
    

    After (optimized):

    -- Scans only the 3 columns you need
    SELECT user_id, event_type, event_timestamp
    FROM `project.dataset.events`
    WHERE event_date = '2024-01-15'
    

    On a table with 50 columns averaging 100 bytes each and 1 billion rows, SELECT * scans approximately 5 TB, while selecting 3 specific columns scans approximately 300 GB -- a 94% reduction in cost.

    Partition Pruning

    Partitioned tables in BigQuery divide data into segments based on a column value (typically a date or timestamp). When your query includes a filter on the partition column, BigQuery only scans the relevant partitions.

    Creating a partitioned table:

    CREATE TABLE `project.dataset.events`
    (
      event_id STRING,
      user_id STRING,
      event_type STRING,
      event_timestamp TIMESTAMP,
      event_date DATE,
      payload JSON
    )
    PARTITION BY event_date
    OPTIONS (
      partition_expiration_days = 365,
      require_partition_filter = TRUE
    );
    

    The require_partition_filter = TRUE option is a powerful safety net -- it prevents queries from running without a partition filter, ensuring that no one accidentally scans the entire table.

    Query with partition pruning:

    -- Only scans partitions for the specified date range
    SELECT user_id, event_type, event_timestamp
    FROM `project.dataset.events`
    WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
    

    If the table contains 3 years of data (approximately 1,095 daily partitions), filtering to one month means BigQuery scans roughly 31/1095 = 2.8% of the data.

    Clustering for Filtered Queries

    Clustering organizes data within each partition based on the values of specified columns. When you filter on clustered columns, BigQuery can skip irrelevant blocks of data within a partition, further reducing the bytes scanned.

    CREATE TABLE `project.dataset.events`
    (
      event_id STRING,
      user_id STRING,
      event_type STRING,
      event_timestamp TIMESTAMP,
      event_date DATE,
      country STRING
    )
    PARTITION BY event_date
    CLUSTER BY country, event_type;
    

    Query benefiting from clustering:

    -- Partition pruning on event_date AND block pruning on country and event_type
    SELECT user_id, event_timestamp
    FROM `project.dataset.events`
    WHERE event_date = '2024-01-15'
      AND country = 'US'
      AND event_type = 'purchase'
    

    Clustering is most effective when the clustered columns have high cardinality and are frequently used in WHERE clauses or JOIN conditions. You can cluster on up to four columns, and the order matters -- put the most frequently filtered column first.

    Materialized Views

    Materialized views precompute and store query results, allowing BigQuery to read from the materialized data instead of scanning the base tables. BigQuery automatically maintains materialized views, refreshing them when the underlying data changes.

    CREATE MATERIALIZED VIEW `project.dataset.daily_event_summary`
    AS
    SELECT
      event_date,
      event_type,
      country,
      COUNT(*) as event_count,
      COUNT(DISTINCT user_id) as unique_users
    FROM `project.dataset.events`
    GROUP BY event_date, event_type, country;
    

    When you query the base table with a compatible aggregation, BigQuery automatically routes the query to the materialized view:

    -- BigQuery may automatically use the materialized view for this query
    SELECT event_date, SUM(event_count) as total_events
    FROM `project.dataset.daily_event_summary`
    WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
    GROUP BY event_date
    

    Materialized views incur storage costs and maintenance overhead, so they are most valuable for frequently run aggregation queries on large tables.

    Auditing Query Costs with INFORMATION_SCHEMA

    BigQuery provides detailed query metadata through the INFORMATION_SCHEMA views, allowing you to identify expensive queries and track cost trends.

    Find the most expensive queries in the last 30 days:

    SELECT
      user_email,
      job_id,
      query,
      total_bytes_processed,
      ROUND(total_bytes_processed / POW(1024, 4), 2) AS tb_processed,
      ROUND(total_bytes_processed / POW(1024, 4) * 6.25, 2) AS estimated_cost_usd,
      creation_time,
      total_slot_ms
    FROM `region-us`.INFORMATION_SCHEMA.JOBS
    WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
      AND job_type = 'QUERY'
      AND state = 'DONE'
    ORDER BY total_bytes_processed DESC
    LIMIT 50;
    

    Find cost by user over the last 7 days:

    SELECT
      user_email,
      COUNT(*) AS query_count,
      ROUND(SUM(total_bytes_processed) / POW(1024, 4), 2) AS total_tb_scanned,
      ROUND(SUM(total_bytes_processed) / POW(1024, 4) * 6.25, 2) AS estimated_cost_usd
    FROM `region-us`.INFORMATION_SCHEMA.JOBS
    WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
      AND job_type = 'QUERY'
      AND state = 'DONE'
    GROUP BY user_email
    ORDER BY estimated_cost_usd DESC;
    

    Running these audits regularly (weekly or even daily) helps you catch cost anomalies early, identify users or pipelines that need optimization, and track the impact of your optimization efforts over time.

    2.3 Storage Optimization

    BigQuery charges for data stored in its tables. While storage is cheaper than compute, it accumulates over time and can become a significant cost driver if left unmanaged.

    Active vs. Long-Term Storage

    BigQuery automatically classifies table data into two pricing tiers:

    Tier Condition Price (per GB/month)
    Active Modified in the last 90 days ~$0.02
    Long-term Not modified for 90+ days ~$0.01

    The transition to long-term storage is automatic -- you do not need to do anything. However, you should be aware that modifying a table (even appending a single row) resets the 90-day clock for the entire table. This means that append-only tables with daily loads will always be charged at the active storage rate.

    Strategy: For large historical tables, consider separating current data (which is frequently appended to) from historical data (which is rarely modified). The historical table will naturally transition to long-term storage pricing.

    Table Expiration Policies

    Setting expiration policies ensures that temporary or staging tables do not accumulate indefinitely:

    -- Set table expiration to 7 days from creation
    CREATE TABLE `project.dataset.staging_events`
    (
      event_id STRING,
      event_data JSON
    )
    OPTIONS (
      expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
    );
    
    -- Set default expiration for all new tables in a dataset
    ALTER SCHEMA `project.dataset`
    SET OPTIONS (
      default_table_expiration_days = 30
    );
    

    For partitioned tables, you can set partition-level expiration, which is even more powerful:

    -- Automatically delete partitions older than 365 days
    CREATE TABLE `project.dataset.events`
    (
      event_id STRING,
      event_date DATE
    )
    PARTITION BY event_date
    OPTIONS (
      partition_expiration_days = 365
    );
    

    This approach ensures that your table always contains at most one year of data, with old partitions automatically cleaned up.

    Removing Duplicate Data

    Duplicate data wastes storage and increases query costs. BigQuery does not enforce unique constraints, so duplicates can easily creep in through pipeline retries or overlapping data loads.

    Identify duplicates:

    SELECT event_id, COUNT(*) as duplicate_count
    FROM `project.dataset.events`
    GROUP BY event_id
    HAVING COUNT(*) > 1
    ORDER BY duplicate_count DESC
    LIMIT 100;
    

    Deduplicate using MERGE:

    -- Create a deduplicated version of the table
    CREATE OR REPLACE TABLE `project.dataset.events_deduped` AS
    SELECT * EXCEPT(row_num)
    FROM (
      SELECT *,
        ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_timestamp DESC) as row_num
      FROM `project.dataset.events`
    )
    WHERE row_num = 1;
    

    Choosing the Right Partitioning Strategy

    The choice of partition column and granularity has a significant impact on both query cost and storage efficiency:

    Strategy Best For Partition Count
    Daily partitioning on date Time-series data with daily queries ~365/year
    Monthly partitioning Historical analysis with monthly grain ~12/year
    Integer range partitioning Non-time-series data (e.g., customer segments) Varies
    Ingestion-time partitioning When no natural partition column exists ~365/year

    Guidelines:

    • Choose a column that is frequently used in WHERE clauses
    • Ensure partitions are not too small (under 1 GB) or too large (over 5 TB)
    • Daily partitioning is the most common and works well for most time-series data
    • Always require partition filters on large tables to prevent accidental full scans

    Module 3: Snowflake Cost Optimization

    Snowflake's architecture separates storage and compute, giving you independent control over each. This separation is powerful but introduces a pricing model that differs significantly from both traditional databases and other cloud warehouses. Understanding Snowflake's credit-based system is essential for cost management.

    3.1 Understanding Snowflake Credits

    Snowflake charges for compute using a unit called a "credit." The cost of a credit depends on your edition and region, but typically ranges from $2 to $4 per credit. Credits are consumed whenever a virtual warehouse is running, regardless of whether it is actively processing queries.

    Warehouse Sizes and Credit Consumption

    Each warehouse size consumes a fixed number of credits per hour:

    Warehouse Size Credits/Hour Approx. Nodes
    X-Small (XS) 1 1
    Small (S) 2 2
    Medium (M) 4 4
    Large (L) 8 8
    X-Large (XL) 16 16
    2X-Large 32 32
    3X-Large 64 64
    4X-Large 128 128

    Notice that each size step doubles the credit consumption. This means upgrading from Medium to Large doubles your cost, even if the query only runs marginally faster.

    Calculating Query Costs

    The cost of a query depends on the warehouse size, how long it runs, and the per-credit price. Snowflake bills in one-second increments with a minimum of 60 seconds per warehouse resume.

    Example:

    Warehouse size: Medium (4 credits/hour)
    Query runtime: 45 seconds
    Credit price: $3.00
    
    Cost = (4 credits/hour) x (45 seconds / 3600 seconds) x $3.00
    Cost = 4 x 0.0125 x $3.00
    Cost = $0.15 per query execution
    

    Now consider running the same query on an X-Large warehouse where it completes in 12 seconds:

    Cost = (16 credits/hour) x (12 seconds / 3600 seconds) x $3.00
    Cost = 16 x 0.00333 x $3.00
    Cost = $0.16 per query execution
    

    Even though the XL warehouse is 4x more powerful and finishes 3.75x faster, the cost is nearly identical for this single query. However, for a 10-minute query on Medium that takes 2.5 minutes on XL:

    Medium: 4 x (600/3600) x $3.00 = $2.00
    XL:     16 x (150/3600) x $3.00 = $2.00
    

    The costs remain similar because Snowflake's linear scaling means larger warehouses finish proportionally faster. The real savings come from auto-suspend behavior -- a larger warehouse that finishes quickly and suspends sooner can actually cost less than a smaller warehouse that stays active longer processing a queue of queries.

    3.2 Warehouse Management

    Effective warehouse management is the single most impactful lever for Snowflake cost optimization. A poorly configured warehouse can burn credits 24/7 with no one noticing.

    Auto-Suspend and Auto-Resume

    Auto-suspend shuts down a warehouse after a period of inactivity. Auto-resume starts it automatically when a query arrives. These two settings are your first line of defense against wasted credits.

    -- Create a warehouse with aggressive auto-suspend (60 seconds)
    CREATE WAREHOUSE analytics_wh
      WITH WAREHOUSE_SIZE = 'MEDIUM'
      AUTO_SUSPEND = 60        -- Suspend after 60 seconds of inactivity
      AUTO_RESUME = TRUE       -- Automatically resume when queries arrive
      INITIALLY_SUSPENDED = TRUE;
    
    -- Modify an existing warehouse
    ALTER WAREHOUSE etl_wh SET
      AUTO_SUSPEND = 120
      AUTO_RESUME = TRUE;
    

    Recommended auto-suspend values by use case:

    Use Case Auto-Suspend (seconds) Rationale
    ETL / Batch pipelines 60 Jobs are scheduled; no waiting users
    BI / Dashboard queries 300 Users may run follow-up queries
    Ad-hoc analysis 300-600 Analysts explore iteratively
    Data science / ML 600-900 Long think-time between queries

    The 60-second minimum charge per resume means that warehouses resuming frequently can accumulate costs. If your ETL pipeline runs 100 small queries, each separated by 2 minutes of inactivity, the warehouse will suspend and resume 100 times, paying 100 minutes of minimum charges instead of the actual compute time. In this case, a longer auto-suspend period (e.g., 300 seconds) or batching queries together is more cost-effective.

    Right-Sizing Warehouses

    Different workloads have different resource requirements. Using a single warehouse for all workloads means you are either over-provisioning (wasting money) or under-provisioning (degrading performance).

    -- Dedicated warehouse for ETL: needs power, runs intermittently
    CREATE WAREHOUSE etl_wh
      WITH WAREHOUSE_SIZE = 'LARGE'
      AUTO_SUSPEND = 60
      AUTO_RESUME = TRUE;
    
    -- Dedicated warehouse for BI dashboards: moderate, concurrent users
    CREATE WAREHOUSE bi_wh
      WITH WAREHOUSE_SIZE = 'MEDIUM'
      AUTO_SUSPEND = 300
      AUTO_RESUME = TRUE;
    
    -- Dedicated warehouse for ad-hoc analysis: small, variable usage
    CREATE WAREHOUSE adhoc_wh
      WITH WAREHOUSE_SIZE = 'SMALL'
      AUTO_SUSPEND = 600
      AUTO_RESUME = TRUE;
    

    How to determine the right size:

    1. Start with a Small warehouse for new workloads
    2. Monitor query performance using the QUERY_HISTORY view
    3. If queries are queuing (spillage to remote storage or excessive queue time), try the next size up
    4. If the warehouse is idle most of the time, try a smaller size
    5. Check the query profile for "bytes spilled to local storage" and "bytes spilled to remote storage" -- spilling is a sign the warehouse is undersized
    -- Check for spilling in recent queries
    SELECT
      query_id,
      warehouse_name,
      warehouse_size,
      execution_time / 1000 as execution_seconds,
      bytes_spilled_to_local_storage,
      bytes_spilled_to_remote_storage,
      query_text
    FROM snowflake.account_usage.query_history
    WHERE start_time > DATEADD('day', -7, CURRENT_TIMESTAMP())
      AND bytes_spilled_to_remote_storage > 0
    ORDER BY bytes_spilled_to_remote_storage DESC
    LIMIT 20;
    

    Multi-Cluster Warehouses

    For workloads with variable concurrency (e.g., many BI users querying simultaneously during business hours), multi-cluster warehouses automatically scale out by adding clusters as demand increases.

    CREATE WAREHOUSE bi_wh
      WITH WAREHOUSE_SIZE = 'MEDIUM'
      MIN_CLUSTER_COUNT = 1
      MAX_CLUSTER_COUNT = 4
      SCALING_POLICY = 'STANDARD'
      AUTO_SUSPEND = 300
      AUTO_RESUME = TRUE;
    

    Multi-cluster warehouses prevent query queuing without requiring a permanently large warehouse. The STANDARD scaling policy adds clusters when queries start queuing and removes them after the load decreases.

    Cost implication: A 4-cluster Medium warehouse at peak costs the same as a single X-Large warehouse (both consume 16 credits/hour), but it automatically scales back down to 1 cluster during off-peak hours.

    Resource Monitors

    Resource monitors set credit consumption limits and trigger actions (alerts or warehouse suspension) when thresholds are reached.

    -- Create a resource monitor with a monthly budget
    CREATE RESOURCE MONITOR monthly_etl_budget
      WITH CREDIT_QUOTA = 5000
      FREQUENCY = MONTHLY
      START_TIMESTAMP = IMMEDIATELY
      TRIGGERS
        ON 75 PERCENT DO NOTIFY
        ON 90 PERCENT DO NOTIFY
        ON 100 PERCENT DO SUSPEND;
    
    -- Apply the monitor to a specific warehouse
    ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = monthly_etl_budget;
    
    -- Create an account-level resource monitor
    CREATE RESOURCE MONITOR account_monthly_budget
      WITH CREDIT_QUOTA = 20000
      FREQUENCY = MONTHLY
      START_TIMESTAMP = IMMEDIATELY
      TRIGGERS
        ON 80 PERCENT DO NOTIFY
        ON 95 PERCENT DO NOTIFY
        ON 100 PERCENT DO SUSPEND_IMMEDIATE;
    

    Resource monitors are essential governance controls. Without them, a runaway query or misconfigured pipeline can burn through your entire budget. The SUSPEND_IMMEDIATE action will cancel running queries, while SUSPEND allows currently running queries to complete before suspending.

    3.3 Query and Storage Optimization

    Beyond warehouse management, Snowflake offers several features for optimizing query performance and storage costs.

    Clustering Keys

    Snowflake automatically organizes data into micro-partitions, but for very large tables (multi-TB), the default organization may not align with your query patterns. Clustering keys tell Snowflake to co-locate related data, improving partition pruning.

    -- Add clustering keys to an existing table
    ALTER TABLE events CLUSTER BY (event_date, country);
    
    -- Verify clustering efficiency
    SELECT SYSTEM$CLUSTERING_INFORMATION('events', '(event_date, country)');
    

    Clustering is an ongoing, automatic process that consumes credits in the background. It is most cost-effective for:

    • Tables larger than 1 TB
    • Tables with well-defined query patterns (consistent WHERE clause columns)
    • Tables where the natural data loading order does not match query patterns

    Avoid over-clustering: Adding clustering keys to small tables or tables that are already well-ordered wastes credits. Monitor the automatic clustering costs in your account usage views.

    Micro-Partition Pruning

    Snowflake stores data in micro-partitions (50-500 MB compressed). When a query includes a filter, Snowflake can skip entire micro-partitions that do not contain matching data. Effective pruning depends on data co-location, which is influenced by clustering.

    -- Check how many partitions are being pruned
    SELECT *
    FROM TABLE(information_schema.query_history())
    WHERE query_text ILIKE '%events%'
    ORDER BY start_time DESC
    LIMIT 10;
    
    -- Look at partitions_scanned vs. partitions_total in the query profile
    

    Time Travel and Fail-Safe Cost Implications

    Snowflake's Time Travel feature allows you to access historical data that has been modified or deleted. Fail-safe provides an additional recovery period. Both features consume storage.

    Feature Duration Storage Cost
    Time Travel 0-90 days (configurable) Charged for changed data
    Fail-safe 7 days (non-configurable) Charged for changed data

    For tables with frequent updates or deletes, Time Travel storage can be substantial because Snowflake retains the old versions of changed micro-partitions.

    -- Reduce Time Travel retention for staging tables
    ALTER TABLE staging_events SET DATA_RETENTION_TIME_IN_DAYS = 1;
    
    -- Set to 0 for truly temporary tables (disables Time Travel)
    ALTER TABLE tmp_processing SET DATA_RETENTION_TIME_IN_DAYS = 0;
    
    -- Keep longer retention for critical production tables
    ALTER TABLE production_facts SET DATA_RETENTION_TIME_IN_DAYS = 30;
    

    Recommendation: Use the minimum Time Travel retention that meets your recovery needs. For staging and temporary tables, 0-1 days is usually sufficient. For production tables, 7-14 days provides a reasonable safety net without excessive storage costs.

    Zero-Copy Cloning

    Zero-copy cloning creates an instant copy of a table, schema, or database without physically duplicating the data. The clone shares the underlying micro-partitions with the original, so it consumes no additional storage until data in either the clone or the original is modified.

    -- Clone a production database for development
    CREATE DATABASE dev_analytics CLONE prod_analytics;
    
    -- Clone a specific table for testing
    CREATE TABLE events_test CLONE events;
    

    Zero-copy cloning is a powerful cost optimization tool for:

    • Development environments: Give each developer a full clone of production data without doubling storage costs
    • Testing: Create test datasets that mirror production
    • Backups: Take point-in-time snapshots before risky operations
    • Data sharing: Share data across teams without copying

    The cost only increases as the clone and original diverge (i.e., as modifications create new micro-partitions that are not shared).


    Module 4: AWS Data Service Costs

    AWS offers a wide range of data services, each with its own pricing model. For data engineers, the key services to understand are S3 for storage, and EMR, Glue, and Athena for compute. Making the right choice between these services can save tens of thousands of dollars per month.

    4.1 S3 Storage Economics

    Amazon S3 is the foundation of nearly every AWS data architecture. Understanding its tiered storage model and pricing nuances is essential for cost management.

    Storage Tiers

    S3 offers multiple storage classes optimized for different access patterns:

    Storage Class Monthly Cost (per GB) Min. Storage Duration Retrieval Cost Use Case
    Standard $0.023 None None Frequently accessed data
    Intelligent-Tiering $0.023 + monitoring fee None None Unknown access patterns
    Standard-IA $0.0125 30 days $0.01/GB Infrequent access
    One Zone-IA $0.01 30 days $0.01/GB Non-critical, infrequent
    Glacier Instant $0.004 90 days $0.03/GB Archive, instant access
    Glacier Flexible $0.0036 90 days $0.01/GB (expedited) Archive, minutes-hours
    Glacier Deep Archive $0.00099 180 days $0.02/GB Long-term archive

    The cost difference between Standard and Deep Archive is over 23x. For a 100 TB data lake, the annual storage cost difference is:

    Standard:     100,000 GB x $0.023 x 12 = $27,600/year
    Deep Archive: 100,000 GB x $0.00099 x 12 = $1,188/year
    Savings: $26,412/year (96% reduction)
    

    Of course, Deep Archive data takes hours to retrieve, so it is only appropriate for data that is rarely or never accessed.

    Lifecycle Policies

    S3 Lifecycle policies automate the transition of data between storage classes based on age:

    {
      "Rules": [
        {
          "ID": "DataLakeLifecycle",
          "Status": "Enabled",
          "Filter": {
            "Prefix": "raw/"
          },
          "Transitions": [
            {
              "Days": 30,
              "StorageClass": "STANDARD_IA"
            },
            {
              "Days": 90,
              "StorageClass": "GLACIER_IR"
            },
            {
              "Days": 365,
              "StorageClass": "DEEP_ARCHIVE"
            }
          ],
          "Expiration": {
            "Days": 2555
          }
        }
      ]
    }
    

    This policy moves raw data to Standard-IA after 30 days, Glacier Instant Retrieval after 90 days, Deep Archive after 1 year, and deletes it after 7 years. For a data lake ingesting 1 TB/day, this policy can reduce storage costs by 60-80% compared to keeping everything in Standard.

    Terraform implementation:

    resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
      bucket = aws_s3_bucket.data_lake.id
    
      rule {
        id     = "raw-data-lifecycle"
        status = "Enabled"
    
        filter {
          prefix = "raw/"
        }
    
        transition {
          days          = 30
          storage_class = "STANDARD_IA"
        }
    
        transition {
          days          = 90
          storage_class = "GLACIER_IR"
        }
    
        transition {
          days          = 365
          storage_class = "DEEP_ARCHIVE"
        }
    
        expiration {
          days = 2555
        }
      }
    }
    

    Request Pricing and Data Transfer

    S3 charges not just for storage but also for API requests and data transfer:

    Operation Cost
    PUT, COPY, POST, LIST $0.005 per 1,000 requests
    GET, SELECT $0.0004 per 1,000 requests
    Data Transfer OUT to Internet $0.09/GB (first 10 TB/month)
    Data Transfer to same region Free
    Data Transfer to different region $0.02/GB

    Request costs matter when you have many small files. A pipeline that processes 1 million small files makes 1 million GET requests, costing $0.40 just in request fees. Compacting small files into larger ones (e.g., using Apache Spark's coalesce() or repartition()) reduces both request costs and processing time.

    Data transfer is the hidden cost killer. Moving 10 TB out of AWS to the internet costs $900. Moving data between regions costs $200 per 10 TB. This is why data locality matters -- keep compute and storage in the same region.

    S3 Select

    S3 Select allows you to retrieve a subset of data from an object using SQL expressions, reducing the amount of data transferred and processed:

    import boto3
    
    s3_client = boto3.client('s3')
    
    response = s3_client.select_object_content(
        Bucket='my-data-lake',
        Key='events/2024/01/events.parquet',
        Expression="SELECT user_id, event_type FROM s3object WHERE country = 'US'",
        ExpressionType='SQL',
        InputSerialization={
            'Parquet': {}
        },
        OutputSerialization={
            'JSON': {}
        }
    )
    

    S3 Select is priced at $0.002 per GB scanned and $0.0007 per GB returned. For large Parquet or CSV files where you only need a subset of columns or rows, this can be significantly cheaper than downloading the entire file.

    4.2 Compute: EMR vs. Glue vs. Athena

    AWS offers three primary services for data processing, each with distinct pricing models, operational characteristics, and ideal use cases.

    Amazon EMR (Elastic MapReduce)

    EMR provides managed Hadoop/Spark clusters. You pay for EC2 instances plus an EMR surcharge (typically 15-25% above EC2 pricing).

    Pricing model:

    EMR cost = EC2 instance cost + EMR surcharge
    

    Example: Medium Spark cluster

    3x m5.xlarge instances (1 master + 2 core)
    EC2 cost: 3 x $0.192/hour = $0.576/hour
    EMR surcharge: 3 x $0.048/hour = $0.144/hour
    Total: $0.720/hour
    
    Running 4 hours/day for ETL:
    Monthly cost: $0.720 x 4 x 30 = $86.40/month
    

    With Spot Instances for core/task nodes:

    1x m5.xlarge on-demand (master): $0.192 + $0.048 = $0.240/hour
    2x m5.xlarge spot (core): 2 x ($0.058 + $0.048) = $0.212/hour
    Total: $0.452/hour (37% savings)
    
    Monthly cost: $0.452 x 4 x 30 = $54.24/month
    

    EMR is ideal for complex Spark jobs, custom frameworks, or workloads requiring fine-grained control over the execution environment.

    AWS Glue

    Glue is a fully serverless ETL service. You pay per DPU-hour (Data Processing Unit), where each DPU provides 4 vCPUs and 16 GB of memory.

    Pricing model:

    Glue cost = Number of DPUs x Runtime (hours) x $0.44/DPU-hour
    

    Example:

    10 DPUs running for 30 minutes:
    Cost = 10 x 0.5 x $0.44 = $2.20 per job run
    
    Running daily:
    Monthly cost: $2.20 x 30 = $66/month
    

    Glue 4.0 supports auto-scaling, which means it starts with a minimum number of workers and scales up based on workload. This avoids paying for idle DPUs during the less intensive phases of a job.

    # Glue job configuration with auto-scaling
    glueContext = GlueContext(SparkContext.getOrCreate())
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    # With auto-scaling enabled:
    # --enable-auto-scaling true
    # --number-of-workers 2  (minimum)
    # --worker-type G.1X
    

    Glue is ideal for teams that want serverless simplicity, built-in data catalog integration, and do not need fine-grained infrastructure control.

    Amazon Athena

    Athena is a serverless query engine that charges per TB scanned, similar to BigQuery's on-demand model.

    Pricing model:

    Athena cost = $5.00 per TB scanned
    

    Example:

    Query scanning 500 GB of Parquet data:
    Cost = 0.5 TB x $5.00 = $2.50
    
    100 such queries per month:
    Monthly cost: $2.50 x 100 = $250/month
    

    Athena benefits enormously from Parquet/ORC formats (columnar compression reduces data scanned), partitioned data (Hive-style partitions in S3), and bucketing (similar to BigQuery clustering).

    Cost Comparison

    Let us compare the three services for a common scenario: processing 500 GB of data daily.

    Service Configuration Daily Cost Monthly Cost
    EMR (On-Demand) 3x m5.xlarge, 2 hours $1.44 $43.20
    EMR (Spot) 1 on-demand + 2 spot, 2 hours $0.90 $27.12
    Glue 10 DPUs, 1 hour $4.40 $132.00
    Athena 500 GB scan per query $2.50 $75.00

    EMR with Spot instances is the cheapest for large batch workloads, but requires operational overhead for cluster management. Athena is the cheapest for occasional, ad-hoc queries on well-partitioned data. Glue sits in the middle, offering serverless simplicity at a moderate cost.

    When to use each:

    • EMR: Large-scale batch processing, complex Spark pipelines, ML training, workloads requiring custom libraries or specific Spark configurations
    • Glue: Serverless ETL, tight integration with AWS data catalog, teams without dedicated infrastructure expertise
    • Athena: Ad-hoc SQL queries, interactive analysis, data exploration, lightweight scheduled queries

    4.3 Reserved and Spot Strategies

    For predictable workloads, reserved pricing and spot instances can dramatically reduce compute costs.

    Reserved Instances and Savings Plans

    AWS offers two mechanisms for commitment-based discounts:

    Reserved Instances (RIs):

    • Commit to a specific instance type in a specific region
    • 1-year or 3-year terms
    • Up to 72% discount over on-demand
    • Best for always-on workloads (e.g., persistent EMR clusters)

    Savings Plans:

    • Commit to a dollar amount of compute per hour
    • More flexible than RIs (apply across instance families and regions)
    • Compute Savings Plans: Up to 66% discount, apply to EC2, Fargate, and Lambda
    • EC2 Instance Savings Plans: Up to 72% discount, specific to instance family and region

    Example:

    On-demand: 10x m5.xlarge instances, 24/7
    Cost: 10 x $0.192 x 730 hours = $1,401.60/month
    
    3-year All Upfront Reserved:
    Cost: 10 x $0.074 x 730 hours = $540.20/month
    Savings: $861.40/month (61%)
    

    Spot Instances for Batch Processing

    Spot instances are ideal for fault-tolerant data engineering workloads. EMR natively supports Spot instances with automatic fallback to on-demand.

    EMR Spot Fleet configuration (Terraform):

    resource "aws_emr_cluster" "spark_etl" {
      name          = "spark-etl-cluster"
      release_label = "emr-7.0.0"
      applications  = ["Spark"]
    
      master_instance_group {
        instance_type  = "m5.xlarge"
        instance_count = 1
      }
    
      core_instance_group {
        instance_type  = "m5.2xlarge"
        instance_count = 2
        bid_price      = "0.10"  # Spot bid price
    
        ebs_config {
          size = 100
          type = "gp3"
        }
      }
    
      # Task instance fleet for additional spot capacity
      ec2_attributes {
        instance_profile = aws_iam_instance_profile.emr.arn
      }
    }
    

    Best practices for Spot in data engineering:

    • Always use on-demand for the master node
    • Use Spot for core and task nodes with instance fleet diversification
    • Configure multiple instance types to increase spot capacity availability
    • Use checkpointing in Spark to handle spot interruptions gracefully
    • Set up CloudWatch alarms for spot interruption notices

    Graviton Instances

    AWS Graviton processors (ARM-based) offer 20-40% better price-performance compared to x86 instances for many data workloads.

    Instance vCPUs Memory On-Demand Price Savings vs x86
    m5.xlarge (x86) 4 16 GB $0.192/hour Baseline
    m6g.xlarge (Graviton2) 4 16 GB $0.154/hour 20%
    m7g.xlarge (Graviton3) 4 16 GB $0.163/hour 15%

    EMR, Glue, and most AWS data services support Graviton instances. Spark, Hive, Presto, and Python workloads generally work without modification on Graviton.

    # Use Graviton instances in EMR for better price-performance
    core_instance_group {
      instance_type  = "m6g.2xlarge"  # Graviton2
      instance_count = 4
    }
    

    Module 5: Cost Monitoring and Governance

    Optimization without monitoring is guesswork. This module covers the tools and practices for maintaining ongoing visibility into cloud costs and ensuring that optimization gains are sustained over time.

    5.1 Tagging Strategy

    Resource tagging is the foundation of cost allocation. Without consistent tags, you cannot attribute costs to teams, projects, or environments, making it impossible to know where money is being spent.

    Designing a Tagging Schema

    A well-designed tagging schema should answer four questions:

    1. Who is responsible for this resource? (team, owner)
    2. What is this resource used for? (project, application)
    3. Where is this resource in the lifecycle? (environment)
    4. Why does this resource exist? (cost center, business unit)

    Recommended tags:

    Tag Key Example Values Purpose
    team data-engineering, analytics, ml Cost attribution to teams
    project customer-360, revenue-pipeline Cost attribution to projects
    environment production, staging, development Separate prod from non-prod costs
    cost-center CC-1234 Finance allocation
    owner jane.doe@company.com Accountability
    managed-by terraform, manual, cloudformation Infrastructure management
    data-classification public, internal, confidential Governance

    Enforcing Tags with Policy

    Tags are only useful if they are consistently applied. Use policy-as-code to enforce tagging compliance:

    AWS - Tag Policy (via AWS Organizations):

    {
      "tags": {
        "team": {
          "tag_key": {
            "@@assign": "team"
          },
          "enforced_for": {
            "@@assign": [
              "s3:bucket",
              "emr:cluster",
              "glue:job"
            ]
          }
        },
        "environment": {
          "tag_key": {
            "@@assign": "environment"
          },
          "tag_value": {
            "@@assign": ["production", "staging", "development"]
          }
        }
      }
    }
    

    Terraform - Required tags module:

    variable "required_tags" {
      type = object({
        team        = string
        project     = string
        environment = string
        cost_center = string
        owner       = string
      })
    }
    
    # Apply to all resources
    resource "aws_s3_bucket" "data_lake" {
      bucket = "my-data-lake"
    
      tags = merge(var.required_tags, {
        Name = "data-lake-bucket"
      })
    }
    
    resource "aws_emr_cluster" "etl" {
      name = "etl-cluster"
    
      tags = merge(var.required_tags, {
        Name = "etl-spark-cluster"
      })
    }
    

    GCP - Organization Policy:

    GCP uses labels (their equivalent of tags) and organization policies to enforce consistent labeling:

    # gcp-label-policy.yaml
    constraint: constraints/compute.requireLabels
    listPolicy:
      allValues: DENY
      allowedValues:
        - team
        - project
        - environment
    

    5.2 Budgets and Alerts

    Budgets and alerts are your early warning system for cost anomalies. They do not prevent overspending by themselves, but they ensure you know about it quickly enough to take action.

    AWS Budgets

    AWS Budgets allows you to set custom cost thresholds and receive notifications when spending approaches or exceeds those thresholds.

    resource "aws_budgets_budget" "data_engineering_monthly" {
      name         = "data-engineering-monthly"
      budget_type  = "COST"
      limit_amount = "10000"
      limit_unit   = "USD"
      time_unit    = "MONTHLY"
    
      cost_filter {
        name   = "TagKeyValue"
        values = ["user:team$data-engineering"]
      }
    
      notification {
        comparison_operator       = "GREATER_THAN"
        threshold                 = 75
        threshold_type            = "PERCENTAGE"
        notification_type         = "ACTUAL"
        subscriber_email_addresses = ["data-eng-leads@company.com"]
      }
    
      notification {
        comparison_operator       = "GREATER_THAN"
        threshold                 = 90
        threshold_type            = "PERCENTAGE"
        notification_type         = "ACTUAL"
        subscriber_email_addresses = ["data-eng-leads@company.com", "finance@company.com"]
      }
    
      notification {
        comparison_operator       = "GREATER_THAN"
        threshold                 = 100
        threshold_type            = "PERCENTAGE"
        notification_type         = "FORECASTED"
        subscriber_email_addresses = ["data-eng-leads@company.com"]
        subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
      }
    }
    

    GCP Budget Alerts

    resource "google_billing_budget" "data_engineering" {
      billing_account = var.billing_account_id
      display_name    = "Data Engineering Monthly Budget"
    
      budget_filter {
        projects = ["projects/${var.project_id}"]
        labels = {
          team = ["data-engineering"]
        }
      }
    
      amount {
        specified_amount {
          currency_code = "USD"
          units         = "10000"
        }
      }
    
      threshold_rules {
        threshold_percent = 0.5
        spend_basis       = "CURRENT_SPEND"
      }
    
      threshold_rules {
        threshold_percent = 0.75
        spend_basis       = "CURRENT_SPEND"
      }
    
      threshold_rules {
        threshold_percent = 0.9
        spend_basis       = "FORECASTED_SPEND"
      }
    
      threshold_rules {
        threshold_percent = 1.0
        spend_basis       = "CURRENT_SPEND"
      }
    
      all_updates_rule {
        monitoring_notification_channels = [
          google_monitoring_notification_channel.email.name
        ]
        pubsub_topic = google_pubsub_topic.budget_alerts.id
      }
    }
    

    Snowflake Budget Controls

    Snowflake provides resource monitors (covered in Module 3) and account-level spending views:

    -- Monitor daily credit consumption
    SELECT
      DATE_TRUNC('day', start_time) AS usage_date,
      warehouse_name,
      SUM(credits_used) AS credits_consumed,
      SUM(credits_used) * 3.00 AS estimated_cost_usd
    FROM snowflake.account_usage.warehouse_metering_history
    WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
    GROUP BY 1, 2
    ORDER BY 1 DESC, credits_consumed DESC;
    
    -- Detect anomalies: days where spending exceeded 2x the average
    WITH daily_spend AS (
      SELECT
        DATE_TRUNC('day', start_time) AS usage_date,
        SUM(credits_used) AS daily_credits
      FROM snowflake.account_usage.warehouse_metering_history
      WHERE start_time >= DATEADD('day', -90, CURRENT_TIMESTAMP())
      GROUP BY 1
    ),
    stats AS (
      SELECT
        AVG(daily_credits) AS avg_daily,
        STDDEV(daily_credits) AS stddev_daily
      FROM daily_spend
    )
    SELECT
      d.usage_date,
      d.daily_credits,
      s.avg_daily,
      ROUND((d.daily_credits - s.avg_daily) / NULLIF(s.stddev_daily, 0), 2) AS z_score
    FROM daily_spend d
    CROSS JOIN stats s
    WHERE d.daily_credits > s.avg_daily * 2
    ORDER BY d.usage_date DESC;
    

    Automated Responses

    Beyond alerts, you can automate responses to cost anomalies using cloud-native event systems:

    AWS: Lambda triggered by budget alert via SNS:

    import boto3
    import json
    
    def lambda_handler(event, context):
        """
        Triggered when data engineering budget exceeds threshold.
        Suspends non-critical EMR clusters and notifies the team.
        """
        emr_client = boto3.client('emr')
        sns_client = boto3.client('sns')
    
        # List running clusters tagged as non-critical
        clusters = emr_client.list_clusters(
            ClusterStates=['RUNNING', 'WAITING']
        )
    
        terminated = []
        for cluster in clusters['Clusters']:
            tags = emr_client.describe_cluster(
                ClusterId=cluster['Id']
            )['Cluster']['Tags']
    
            tag_dict = {t['Key']: t['Value'] for t in tags}
    
            if tag_dict.get('environment') != 'production':
                emr_client.terminate_job_flows(
                    JobFlowIds=[cluster['Id']]
                )
                terminated.append(cluster['Name'])
    
        # Notify the team
        sns_client.publish(
            TopicArn='arn:aws:sns:us-east-1:123456789:data-eng-alerts',
            Subject='Budget Alert: Non-critical EMR clusters terminated',
            Message=json.dumps({
                'action': 'auto-terminate',
                'clusters': terminated,
                'reason': 'Monthly budget threshold exceeded'
            })
        )
    
        return {'terminated_clusters': terminated}
    

    5.3 Cost Dashboards and Reporting

    Cost dashboards transform raw billing data into actionable insights. A good dashboard answers three questions at a glance: How much are we spending? Where is the money going? Is spending trending up or down?

    AWS Cost Explorer

    AWS Cost Explorer provides built-in visualization of your AWS spending. For data engineering teams, the most useful views are:

    • Service breakdown: See which AWS services (S3, EMR, Glue, Athena) consume the most budget
    • Tag-based filtering: Filter by team, project, or environment tags
    • Daily granularity: Spot anomalies and correlate with pipeline schedules
    • Forecasting: AWS provides 12-month cost forecasts based on historical trends

    For custom dashboards, use the AWS Cost and Usage Report (CUR) exported to S3 and queried via Athena:

    -- Query the Cost and Usage Report in Athena
    SELECT
      line_item_product_code AS service,
      resource_tags_user_team AS team,
      resource_tags_user_project AS project,
      DATE_FORMAT(line_item_usage_start_date, '%Y-%m') AS month,
      SUM(line_item_unblended_cost) AS total_cost
    FROM cost_and_usage_report
    WHERE line_item_usage_start_date >= DATE '2024-01-01'
      AND resource_tags_user_team = 'data-engineering'
    GROUP BY 1, 2, 3, 4
    ORDER BY total_cost DESC;
    

    GCP Billing Reports

    GCP Billing provides similar capabilities with BigQuery export:

    -- Query GCP billing export in BigQuery
    SELECT
      service.description AS service_name,
      labels.value AS team,
      FORMAT_TIMESTAMP('%Y-%m', usage_start_time) AS month,
      SUM(cost) AS total_cost,
      SUM(credits.amount) AS total_credits,
      SUM(cost) + SUM(credits.amount) AS net_cost
    FROM `billing_dataset.gcp_billing_export`
    LEFT JOIN UNNEST(labels) AS labels ON labels.key = 'team'
    LEFT JOIN UNNEST(credits) AS credits
    WHERE usage_start_time >= TIMESTAMP('2024-01-01')
    GROUP BY 1, 2, 3
    ORDER BY net_cost DESC;
    

    Snowflake ACCOUNT_USAGE Views

    Snowflake provides comprehensive usage tracking through the SNOWFLAKE.ACCOUNT_USAGE schema:

    -- Weekly cost report by warehouse
    SELECT
      DATE_TRUNC('week', start_time) AS week,
      warehouse_name,
      SUM(credits_used) AS total_credits,
      ROUND(SUM(credits_used) * 3.00, 2) AS estimated_cost_usd,
      COUNT(DISTINCT DATE_TRUNC('day', start_time)) AS active_days,
      ROUND(SUM(credits_used) / COUNT(DISTINCT DATE_TRUNC('day', start_time)), 2) AS avg_credits_per_day
    FROM snowflake.account_usage.warehouse_metering_history
    WHERE start_time >= DATEADD('week', -12, CURRENT_TIMESTAMP())
    GROUP BY 1, 2
    ORDER BY 1 DESC, total_credits DESC;
    
    -- Storage cost breakdown
    SELECT
      DATE_TRUNC('month', usage_date) AS month,
      ROUND(AVG(storage_bytes) / POW(1024, 4), 2) AS avg_storage_tb,
      ROUND(AVG(stage_bytes) / POW(1024, 4), 2) AS avg_stage_tb,
      ROUND(AVG(failsafe_bytes) / POW(1024, 4), 2) AS avg_failsafe_tb,
      ROUND((AVG(storage_bytes) + AVG(stage_bytes) + AVG(failsafe_bytes))
        / POW(1024, 4) * 23, 2) AS estimated_monthly_cost_usd
    FROM snowflake.account_usage.storage_usage
    WHERE usage_date >= DATEADD('month', -6, CURRENT_DATE())
    GROUP BY 1
    ORDER BY 1 DESC;
    

    Weekly Cost Review Cadence

    Establishing a regular review cadence ensures cost awareness stays embedded in team culture:

    Weekly (15-minute standup):

    • Review week-over-week cost trend
    • Identify any anomalies or spikes
    • Check if any budget thresholds were triggered
    • Assign follow-up investigation for anomalies

    Monthly (30-minute review):

    • Deep dive into cost by service, team, and project
    • Compare actual vs. budgeted spend
    • Review optimization opportunities identified
    • Track savings from implemented optimizations
    • Plan next month's optimization initiatives

    Quarterly (1-hour planning):

    • Review reserved instance / savings plan coverage
    • Evaluate architecture decisions for cost efficiency
    • Update budgets and forecasts
    • Align with finance on cost allocation and chargeback

    Module 6: Real-World Cost Patterns

    This module applies the concepts from the previous modules to common data engineering patterns. These are the decisions you will face repeatedly in your career, and understanding their cost implications will help you make better choices.

    6.1 Incremental Processing vs. Full Refresh

    One of the most consequential architectural decisions in data engineering is whether to process data incrementally or do a full refresh. This decision has profound cost implications that scale with data volume.

    Full Refresh

    A full refresh drops and recreates a table by processing all source data from scratch.

    Cost characteristics:

    • Processes the entire dataset every run
    • Cost grows linearly with data volume
    • Simple to implement and reason about
    • No state management required
    • Guaranteed correctness (no edge cases with late-arriving data)

    Example (BigQuery on-demand):

    -- Full refresh: scan the entire source table every time
    CREATE OR REPLACE TABLE `project.dataset.daily_metrics` AS
    SELECT
      DATE(event_timestamp) AS metric_date,
      event_type,
      COUNT(*) AS event_count,
      COUNT(DISTINCT user_id) AS unique_users
    FROM `project.dataset.events`  -- 10 TB table
    GROUP BY 1, 2;
    
    -- Cost per run: 10 TB x $6.25 = $62.50
    -- Daily cost: $62.50
    -- Monthly cost: $1,875
    -- Annual cost: $22,500
    

    Incremental Processing

    Incremental processing only handles new or changed data since the last run.

    Cost characteristics:

    • Processes only new/changed data
    • Cost is proportional to the data change rate, not total volume
    • More complex to implement (needs watermarks, change detection)
    • Requires state management (tracking what has been processed)
    • Must handle late-arriving data and out-of-order events

    Example (BigQuery on-demand):

    -- Incremental: only scan today's partition
    MERGE INTO `project.dataset.daily_metrics` AS target
    USING (
      SELECT
        DATE(event_timestamp) AS metric_date,
        event_type,
        COUNT(*) AS event_count,
        COUNT(DISTINCT user_id) AS unique_users
      FROM `project.dataset.events`
      WHERE event_date = CURRENT_DATE()  -- Only today's partition (~30 GB)
      GROUP BY 1, 2
    ) AS source
    ON target.metric_date = source.metric_date
      AND target.event_type = source.event_type
    WHEN MATCHED THEN
      UPDATE SET
        event_count = source.event_count,
        unique_users = source.unique_users
    WHEN NOT MATCHED THEN
      INSERT (metric_date, event_type, event_count, unique_users)
      VALUES (source.metric_date, source.event_type, source.event_count, source.unique_users);
    
    -- Cost per run: 30 GB x $6.25/TB = $0.19
    -- Daily cost: $0.19
    -- Monthly cost: $5.70
    -- Annual cost: $68.40
    

    The incremental approach costs $22,431.60 less per year for this single pipeline. Across an organization with hundreds of pipelines, the savings are transformational.

    dbt Incremental Models

    dbt (data build tool) provides a framework for incremental processing that handles much of the complexity:

    -- models/daily_metrics.sql
    {{
      config(
        materialized='incremental',
        unique_key=['metric_date', 'event_type'],
        partition_by={
          "field": "metric_date",
          "data_type": "date",
          "granularity": "day"
        },
        cluster_by=['event_type']
      )
    }}
    
    SELECT
      DATE(event_timestamp) AS metric_date,
      event_type,
      COUNT(*) AS event_count,
      COUNT(DISTINCT user_id) AS unique_users
    FROM {{ source('raw', 'events') }}
    
    {% if is_incremental() %}
      WHERE event_date > (SELECT MAX(metric_date) FROM {{ this }})
    {% endif %}
    
    GROUP BY 1, 2
    

    dbt handles the merge logic, partition management, and the incremental/full-refresh toggle automatically. Running dbt run processes incrementally; running dbt run --full-refresh rebuilds from scratch.

    When to Use Each Approach

    Factor Full Refresh Incremental
    Data volume Small (< 100 GB) Large (> 100 GB)
    Update frequency Entire dataset changes Small % of data changes per run
    Complexity tolerance Low (want simplicity) High (can handle complexity)
    Late-arriving data Naturally handles it Needs explicit handling
    Cost sensitivity Low (budget is flexible) High (cost is a constraint)
    Source data No reliable timestamp/watermark Has reliable change tracking

    6.2 Materialized Views vs. Live Queries

    The decision to precompute results (materialize) vs. compute on-the-fly (live query) is a classic space-time trade-off with direct cost implications.

    Live Queries

    Live queries compute results from base tables every time they are executed.

    Cost characteristics:

    • No additional storage cost
    • Full compute cost on every execution
    • Always returns the latest data
    • Cost scales with query frequency

    Example:

    -- This query scans 5 TB every time it runs
    SELECT
      DATE(event_timestamp) AS day,
      country,
      COUNT(DISTINCT user_id) AS dau
    FROM `project.dataset.events`
    WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE()
    GROUP BY 1, 2;
    
    -- If run 50 times/day by different dashboard users:
    -- Daily cost: 50 x 5 TB x $6.25 = $1,562.50
    -- Monthly cost: $46,875
    

    Materialized Views / Pre-Aggregated Tables

    Materializing results stores precomputed aggregations that are queried instead of base tables.

    Cost characteristics:

    • Additional storage cost for the materialized data
    • Compute cost for refreshing the materialization
    • Much lower cost per query (smaller data to scan)
    • Data freshness depends on refresh frequency
    • Cost scales with refresh frequency, not query frequency

    Example:

    -- Materialized view refreshed automatically (BigQuery)
    CREATE MATERIALIZED VIEW `project.dataset.daily_country_dau`
    AS
    SELECT
      DATE(event_timestamp) AS day,
      country,
      COUNT(DISTINCT user_id) AS dau
    FROM `project.dataset.events`
    GROUP BY 1, 2;
    
    -- Storage cost: ~1 GB of aggregated data = $0.02/month
    -- Refresh cost: automatic, incremental
    -- Query cost per execution: 1 GB x $6.25/TB = $0.006
    
    -- If queried 50 times/day:
    -- Daily cost: 50 x $0.006 = $0.30
    -- Monthly cost: $9.00 + refresh costs
    

    The materialized approach reduces costs from $46,875/month to under $100/month for this pattern -- a 99.8% reduction.

    Decision Framework

    Materialize when:

    • The same expensive query is run frequently (more than 5-10 times per day)
    • The base data is much larger than the aggregated result
    • Slight data staleness is acceptable
    • Multiple users or dashboards consume the same aggregation

    Query live when:

    • The query is run rarely (less than once per day)
    • The base data is small
    • Real-time freshness is required
    • The query pattern is ad-hoc and unpredictable

    6.3 Development Environment Cost Control

    Development and testing environments are notorious for wasting money. They often mirror production in scale but receive a fraction of the traffic, resulting in massively over-provisioned resources. Controlling dev environment costs can save 30-50% of total cloud spend.

    Using Smaller Datasets

    Production-scale data is rarely needed for development. Creating representative subsets reduces both compute and storage costs:

    -- BigQuery: Create a 1% sample of production data for development
    CREATE TABLE `dev_project.dataset.events_sample` AS
    SELECT *
    FROM `prod_project.dataset.events`
    WHERE MOD(ABS(FARM_FINGERPRINT(user_id)), 100) = 0  -- Deterministic 1% sample
      AND event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);  -- Last 30 days only
    
    -- Snowflake: Use SAMPLE for quick development datasets
    CREATE TABLE dev_db.schema.events_sample AS
    SELECT *
    FROM prod_db.schema.events
    SAMPLE (1);  -- 1% sample
    

    Snowflake Zero-Copy Clones for Development

    As covered in Module 3, zero-copy clones provide full-fidelity development environments at minimal cost:

    -- Create a dev environment with production data structure and sample data
    CREATE DATABASE dev_analytics CLONE prod_analytics;
    
    -- Each developer gets their own clone
    CREATE SCHEMA dev_analytics.developer_jane CLONE prod_analytics.public;
    

    The clone shares storage with production until data is modified, so the incremental cost is minimal. Combined with a small development warehouse (X-Small or Small), this provides a powerful and cost-effective development experience.

    BigQuery Sandboxes and Development Projects

    Use separate GCP projects for development with restricted quotas:

    # Terraform: Create a dev project with custom quotas
    resource "google_project" "data_eng_dev" {
      name       = "data-eng-dev"
      project_id = "data-eng-dev-12345"
      org_id     = var.org_id
    
      labels = {
        team        = "data-engineering"
        environment = "development"
      }
    }
    
    # Set a custom quota for BigQuery to limit dev spending
    resource "google_project_service" "bigquery" {
      project = google_project.data_eng_dev.project_id
      service = "bigquery.googleapis.com"
    }
    

    Configure BigQuery custom cost controls per project to prevent developers from accidentally running expensive queries:

    -- In BigQuery, set maximum bytes billed per query
    -- This can be set at the project level or per query
    -- Queries exceeding this limit will fail before executing
    SELECT *
    FROM `project.dataset.large_table`
    OPTIONS (maximum_bytes_billed = 10737418240);  -- 10 GB limit
    

    Terraform Workspace Isolation

    Use Terraform workspaces to manage separate infrastructure for each environment, with environment-specific sizing:

    # variables.tf
    variable "environment" {
      type = string
    }
    
    locals {
      env_config = {
        production = {
          emr_instance_type  = "m5.2xlarge"
          emr_instance_count = 10
          s3_lifecycle_days  = 365
          glue_workers       = 20
        }
        staging = {
          emr_instance_type  = "m5.xlarge"
          emr_instance_count = 3
          s3_lifecycle_days  = 30
          glue_workers       = 5
        }
        development = {
          emr_instance_type  = "m5.large"
          emr_instance_count = 2
          s3_lifecycle_days  = 7
          glue_workers       = 2
        }
      }
    
      config = local.env_config[var.environment]
    }
    
    resource "aws_emr_cluster" "etl" {
      name = "${var.environment}-etl-cluster"
    
      master_instance_group {
        instance_type  = local.config.emr_instance_type
        instance_count = 1
      }
    
      core_instance_group {
        instance_type  = local.config.emr_instance_type
        instance_count = local.config.emr_instance_count
      }
    
      tags = {
        environment = var.environment
        team        = "data-engineering"
      }
    }
    

    Scheduling Dev Resources to Shut Down Overnight

    Development resources sitting idle overnight and on weekends waste significant money. Automate their shutdown:

    AWS Lambda for scheduled EMR termination:

    import boto3
    from datetime import datetime
    
    def lambda_handler(event, context):
        """
        Runs daily at 8 PM via CloudWatch Events.
        Terminates all non-production EMR clusters.
        """
        emr = boto3.client('emr')
        clusters = emr.list_clusters(ClusterStates=['RUNNING', 'WAITING'])
    
        for cluster in clusters['Clusters']:
            details = emr.describe_cluster(ClusterId=cluster['Id'])
            tags = {t['Key']: t['Value'] for t in details['Cluster']['Tags']}
    
            if tags.get('environment') in ('development', 'staging'):
                emr.terminate_job_flows(JobFlowIds=[cluster['Id']])
                print(f"Terminated {cluster['Name']} ({cluster['Id']})")
    

    Snowflake warehouse scheduling:

    -- Create a task to suspend dev warehouses at 7 PM
    CREATE TASK suspend_dev_warehouses
      WAREHOUSE = admin_wh
      SCHEDULE = 'USING CRON 0 19 * * MON-FRI America/New_York'
    AS
      ALTER WAREHOUSE dev_wh SUSPEND;
    
    -- Create a task to resume dev warehouses at 8 AM
    CREATE TASK resume_dev_warehouses
      WAREHOUSE = admin_wh
      SCHEDULE = 'USING CRON 0 8 * * MON-FRI America/New_York'
    AS
      ALTER WAREHOUSE dev_wh RESUME;
    
    ALTER TASK suspend_dev_warehouses RESUME;
    ALTER TASK resume_dev_warehouses RESUME;
    

    Assuming a Medium dev warehouse ($4/credit/hour x 4 credits/hour = $16/hour equivalent):

    • Running 24/7: $16 x 730 hours = $11,680/month
    • Running business hours only (10 hours/day, weekdays): $16 x 10 x 22 = $3,520/month
    • Savings: $8,160/month (70% reduction)

    Next Steps

    After completing this course on cloud cost optimization, you will be equipped to:

    1. Audit your current cloud spend - Use the INFORMATION_SCHEMA queries, Cost Explorer, and ACCOUNT_USAGE views to understand where money is going
    2. Implement quick wins - Partition pruning, column selection, auto-suspend, and lifecycle policies can deliver savings in days
    3. Build a cost monitoring practice - Set up budgets, alerts, and dashboards for ongoing visibility
    4. Advocate for cost-aware architecture - Bring cost analysis into design reviews and architecture decisions
    5. Explore Data Modeling - Understand how data model design impacts query costs
    6. Review Fundamentals - Revisit batch vs. streaming trade-offs with a cost lens

    Recommended Resources:

    1. Cloud FinOps by J.R. Storment and Mike Fuller

      • The definitive guide to the FinOps framework
      • Covers organizational practices, not just technical optimization
      • Essential reading for anyone managing cloud budgets
    2. FinOps Foundation (finops.org)

      • Free resources, frameworks, and community
      • FinOps Certified Practitioner certification
      • Case studies from major organizations
    3. Provider Documentation

    4. dbt Documentation on Incremental Models

    Remember: cloud cost optimization is not a one-time project -- it is an ongoing practice. Data volumes grow, pricing models change, and new services emerge. The data engineers who treat cost as a continuous concern, not an afterthought, deliver the most value to their organizations.

    Additional Resources