Module 1: FinOps Fundamentals for Data Engineers

This module introduces the financial dimension of cloud data engineering. As organizations migrate workloads to the cloud and data volumes grow exponentially, cost management has become a core competency for data engineers -- not just a concern for finance teams. Understanding how your architectural decisions translate into cloud bills is what separates a senior data engineer from a junior one.

1.1 Why Cost Matters in Data Engineering

Data engineering is consistently one of the largest contributors to cloud spending in modern organizations. Unlike a web application that serves relatively predictable traffic, data pipelines routinely process terabytes or petabytes of data, and small inefficiencies can compound into enormous costs at scale. A single poorly written query in BigQuery scanning a multi-petabyte table can cost hundreds or even thousands of dollars in a matter of minutes. A Snowflake warehouse left running over a weekend can burn through thousands of credits with no one watching.

The exponential growth of data makes this problem worse every quarter. IDC estimates that global data creation will exceed 180 zettabytes by 2025, and most organizations see their data volumes double every two to three years. As the data grows, so do the compute resources required to process it, the storage costs to retain it, and the network costs to move it between services and regions.

Data engineers sit at the intersection of all three major cloud cost categories:

Compute costs: Running Spark jobs, executing warehouse queries, powering ETL pipelines. Compute is typically the largest line item in a data engineering budget, and it is directly influenced by the efficiency of your code and architecture.
Storage costs: Maintaining raw data lakes, curated data warehouses, staging areas, backups, and historical snapshots. While storage is cheaper per unit than compute, it accumulates relentlessly and is often overlooked because it grows silently.
Network egress costs: Moving data between regions, across clouds, or out to external consumers. Egress charges are the hidden cost that surprises many teams, particularly when architectures span multiple cloud providers or regions.

The data engineer's role in cost management is unique because technical decisions made during pipeline design have a direct and measurable impact on the cloud bill. Choosing the right partitioning strategy, selecting appropriate warehouse sizes, deciding between materialized and live views, and designing incremental processing patterns are all engineering decisions with immediate financial consequences.

Organizations increasingly expect data engineers to think about cost as a first-class requirement alongside correctness, performance, and reliability. This is not about being cheap -- it is about being efficient, ensuring that every dollar spent on cloud infrastructure delivers maximum value to the business.

1.2 Understanding Cloud Pricing Models

Before you can optimize cloud costs, you need to understand how cloud providers charge for their services. While the details vary between providers, there are several universal pricing concepts that apply across all major platforms.

On-Demand Pricing

On-demand pricing is the default and most flexible model. You pay for exactly what you use, with no upfront commitment. Rates are typically billed per second, per minute, or per hour for compute, and per GB-month for storage.

Advantages:

No upfront commitment or risk
Scale up and down freely
Simple to understand and budget for small workloads
Ideal for variable or unpredictable workloads

Disadvantages:

Highest per-unit cost
Costs can spike unpredictably with usage growth
No incentive to plan capacity

Reserved / Committed Use Pricing

All major cloud providers offer significant discounts (typically 30-70%) in exchange for committing to a certain level of usage over one to three years. AWS calls these Reserved Instances and Savings Plans, GCP offers Committed Use Discounts, and Snowflake offers pre-purchased capacity.

Advantages:

Substantial cost savings for predictable workloads
Predictable monthly costs for budgeting
Can be combined with on-demand for burst capacity

Disadvantages:

Requires upfront commitment (financial risk if needs change)
Requires accurate capacity planning
Unused reservations are wasted money

Spot / Preemptible Pricing

Cloud providers offer deeply discounted compute instances (60-90% off on-demand pricing) that can be reclaimed with short notice. AWS calls these Spot Instances, GCP calls them Preemptible VMs (or Spot VMs), and Azure calls them Spot VMs.

Advantages:

Dramatic cost savings for fault-tolerant workloads
Ideal for batch processing, ETL jobs, and data pipelines
Can be combined with on-demand instances for resilience

Disadvantages:

Instances can be terminated with as little as two minutes notice
Not suitable for long-running, stateful, or latency-sensitive workloads
Requires application-level fault tolerance

The Three Pillars of Cloud Cost

When analyzing any cloud bill, costs break down into three fundamental categories:

Category	Description	Typical % of Data Eng. Bill
Compute	Processing power (CPUs, GPUs, memory)	50-70%
Storage	Data at rest (object storage, block storage, databases)	15-30%
Network	Data transfer (egress, cross-region, cross-AZ)	5-15%

Understanding this breakdown for your specific workloads is the first step toward optimization. Many teams discover that one category dominates their bill, and focusing optimization efforts there yields the greatest returns.

Total Cost of Ownership (TCO)

TCO extends beyond the direct cloud bill to include the full cost of running data infrastructure:

Direct cloud costs: Compute, storage, network as described above
Engineering time: Hours spent building, maintaining, and debugging pipelines
Operational overhead: Monitoring, incident response, on-call rotations
Opportunity cost: What could your team build if they were not managing infrastructure?
Data quality costs: The business impact of late, incorrect, or missing data

A more expensive managed service that reduces engineering time can have a lower TCO than a cheaper self-managed solution. For example, using AWS Glue at a higher per-DPU cost might be cheaper in TCO than running a self-managed Spark cluster on EMR, once you account for the operational overhead of cluster management.

1.3 The FinOps Framework

FinOps (Cloud Financial Operations) is an evolving discipline and cultural practice that brings financial accountability to the variable-spend model of cloud computing. It provides a structured approach for organizations to manage cloud costs without sacrificing speed or innovation.

The Three Phases of FinOps

Phase 1: Inform

The Inform phase is about creating visibility into cloud spending. You cannot optimize what you cannot see. Key activities include:

Establishing cost allocation through tagging and account structure
Building dashboards that show cost trends and anomalies
Attributing costs to teams, projects, and business units
Understanding the relationship between technical metrics and financial metrics
Creating a shared vocabulary for discussing cloud costs

Phase 2: Optimize

The Optimize phase focuses on reducing waste and improving efficiency. This is where data engineers have the most direct impact:

Right-sizing compute resources to match actual workload requirements
Eliminating idle or unused resources
Implementing reserved pricing for predictable baseline workloads
Optimizing queries and pipelines for cost efficiency
Choosing the most cost-effective services and architectures

Phase 3: Operate

The Operate phase embeds cost awareness into ongoing operations:

Setting budgets and alerts for anomaly detection
Establishing governance policies for resource provisioning
Integrating cost considerations into design reviews and architecture decisions
Automating cost optimization actions (e.g., auto-scaling, scheduled shutdowns)
Conducting regular cost reviews and retrospectives

How Data Teams Fit Into FinOps

Data engineering teams are uniquely positioned in the FinOps model because they are both major consumers of cloud resources and possessors of the technical knowledge needed to optimize those costs. Effective FinOps for data teams requires:

Shared accountability: Every data engineer should understand the cost implications of their design decisions. Cost is not solely the domain of the finance team or a centralized FinOps function.
Cost as a metric: Just as data teams track pipeline latency, data quality, and uptime, they should track cost per pipeline run, cost per TB processed, and cost trends over time.
Architectural decision records: When choosing between approaches (e.g., full refresh vs. incremental, materialized view vs. live query), document the cost implications alongside performance and maintainability.
Collaboration with finance: Data engineers should participate in cloud billing reviews and help translate technical metrics into business context. A line item showing "$15,000 for BigQuery compute" is more actionable when it can be attributed to specific pipelines and business use cases.

Module 2: BigQuery Cost Optimization

Google BigQuery is a fully managed, serverless data warehouse that excels at analyzing large datasets quickly. Its serverless nature means you do not manage infrastructure, but it also means you must understand its pricing model deeply to avoid costly surprises. BigQuery's pricing is fundamentally different from traditional databases, and optimizing for cost requires a different mindset than optimizing for performance alone.

2.1 BigQuery Pricing Models

BigQuery offers two primary pricing models for compute, and understanding when to use each is one of the most impactful cost decisions you will make.

On-Demand Pricing (Per TB Scanned)

With on-demand pricing, you pay based on the amount of data your queries scan. As of the current pricing, this costs $6.25 per TB scanned in most regions. The first 1 TB of data scanned per month is free.

Key characteristics:

You pay only when you run queries
Cost is proportional to the data scanned, not the complexity of the query
No cost when queries are not running
Ideal for ad-hoc analysis and variable workloads

Example cost calculation:

Suppose you have a 10 TB table and run a query that scans the entire table:

Query cost = 10 TB x $6.25/TB = $62.50 per query execution

If this query runs daily as part of a pipeline:

Monthly cost = $62.50 x 30 days = $1,875/month

Now suppose you add a partition filter that reduces the scan to 500 GB:

Query cost = 0.5 TB x $6.25/TB = $3.13 per query execution
Monthly cost = $3.13 x 30 days = $93.75/month

That single optimization saves $1,781.25 per month -- a 95% reduction.

BigQuery Editions (Slot-Based Pricing)

BigQuery Editions use a capacity-based model where you purchase "slots" -- units of computational capacity. You pay for slot-hours regardless of how much data your queries scan.

BigQuery offers three editions:

Edition	Use Case	Baseline Pricing (per slot-hour)
Standard	Development, ad-hoc	~$0.04
Enterprise	Production workloads	~$0.06
Enterprise Plus	Mission-critical, multi-region	~$0.10

Slots can be purchased as autoscaling (pay-as-you-go, billed per second) or baseline + autoscaling (committed baseline with burst capacity).

Example cost calculation:

Suppose your team consistently uses 500 slots for 8 hours per day:

Daily cost (autoscaling, Enterprise) = 500 slots x 8 hours x $0.06 = $240/day
Monthly cost = $240 x 30 = $7,200/month

With a committed baseline of 500 slots (annual commitment, ~40% discount):

Monthly cost = 500 slots x 730 hours x $0.06 x 0.6 = ~$13,140/month

Wait -- that looks more expensive. The key insight is that baseline commitments cover 24/7, so they only make sense if you are using those slots around the clock. For 8 hours/day, autoscaling is cheaper.

Choosing the Right Model

Use on-demand when:

Monthly data scanned is under 5-10 TB
Workloads are unpredictable or infrequent
You are in early stages and still learning your usage patterns
Ad-hoc analysis is the primary use case

Use Editions (slot-based) when:

Monthly data scanned exceeds 10-20 TB consistently
You want predictable, budgetable costs
You have many concurrent users or queries
You need performance guarantees (slots provide dedicated capacity)

Break-even analysis:

To find your break-even point, compare your monthly on-demand spend with what you would pay for the equivalent slot capacity:

Break-even TB scanned = (Slot cost per month) / $6.25

For example, if 100 Enterprise autoscaling slots cost approximately $4,380/month:

Break-even = $4,380 / $6.25 = ~700 TB scanned per month

If your queries scan more than 700 TB per month, slot-based pricing is likely cheaper.

2.2 Query Optimization for Cost

In BigQuery's on-demand model, cost is directly tied to the amount of data scanned. Every optimization that reduces the bytes scanned reduces your bill. Even in slot-based models, efficient queries complete faster, freeing slots for other workloads.

Avoid SELECT *

The single most impactful habit for BigQuery cost optimization is to never use SELECT * in production queries. Because BigQuery uses columnar storage, it only reads the columns you reference. Selecting all columns forces BigQuery to read the entire table.

Before (expensive):

-- Scans ALL columns in the table
-- If the table has 50 columns but you only need 3, you're paying for 47 unnecessary columns
SELECT *
FROM `project.dataset.events`
WHERE event_date = '2024-01-15'

After (optimized):

-- Scans only the 3 columns you need
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.events`
WHERE event_date = '2024-01-15'

On a table with 50 columns averaging 100 bytes each and 1 billion rows, SELECT * scans approximately 5 TB, while selecting 3 specific columns scans approximately 300 GB -- a 94% reduction in cost.

Partition Pruning

Partitioned tables in BigQuery divide data into segments based on a column value (typically a date or timestamp). When your query includes a filter on the partition column, BigQuery only scans the relevant partitions.

Creating a partitioned table:

CREATE TABLE `project.dataset.events`
(
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_timestamp TIMESTAMP,
  event_date DATE,
  payload JSON
)
PARTITION BY event_date
OPTIONS (
  partition_expiration_days = 365,
  require_partition_filter = TRUE
);

The require_partition_filter = TRUE option is a powerful safety net -- it prevents queries from running without a partition filter, ensuring that no one accidentally scans the entire table.

Query with partition pruning:

-- Only scans partitions for the specified date range
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.events`
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'

If the table contains 3 years of data (approximately 1,095 daily partitions), filtering to one month means BigQuery scans roughly 31/1095 = 2.8% of the data.

Clustering for Filtered Queries

Clustering organizes data within each partition based on the values of specified columns. When you filter on clustered columns, BigQuery can skip irrelevant blocks of data within a partition, further reducing the bytes scanned.

CREATE TABLE `project.dataset.events`
(
  event_id STRING,
  user_id STRING,
  event_type STRING,
  event_timestamp TIMESTAMP,
  event_date DATE,
  country STRING
)
PARTITION BY event_date
CLUSTER BY country, event_type;

Query benefiting from clustering:

-- Partition pruning on event_date AND block pruning on country and event_type
SELECT user_id, event_timestamp
FROM `project.dataset.events`
WHERE event_date = '2024-01-15'
  AND country = 'US'
  AND event_type = 'purchase'

Clustering is most effective when the clustered columns have high cardinality and are frequently used in WHERE clauses or JOIN conditions. You can cluster on up to four columns, and the order matters -- put the most frequently filtered column first.

Materialized Views

Materialized views precompute and store query results, allowing BigQuery to read from the materialized data instead of scanning the base tables. BigQuery automatically maintains materialized views, refreshing them when the underlying data changes.

CREATE MATERIALIZED VIEW `project.dataset.daily_event_summary`
AS
SELECT
  event_date,
  event_type,
  country,
  COUNT(*) as event_count,
  COUNT(DISTINCT user_id) as unique_users
FROM `project.dataset.events`
GROUP BY event_date, event_type, country;

When you query the base table with a compatible aggregation, BigQuery automatically routes the query to the materialized view:

-- BigQuery may automatically use the materialized view for this query
SELECT event_date, SUM(event_count) as total_events
FROM `project.dataset.daily_event_summary`
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY event_date

Materialized views incur storage costs and maintenance overhead, so they are most valuable for frequently run aggregation queries on large tables.

Auditing Query Costs with INFORMATION_SCHEMA

BigQuery provides detailed query metadata through the INFORMATION_SCHEMA views, allowing you to identify expensive queries and track cost trends.

Find the most expensive queries in the last 30 days:

SELECT
  user_email,
  job_id,
  query,
  total_bytes_processed,
  ROUND(total_bytes_processed / POW(1024, 4), 2) AS tb_processed,
  ROUND(total_bytes_processed / POW(1024, 4) * 6.25, 2) AS estimated_cost_usd,
  creation_time,
  total_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 50;

Find cost by user over the last 7 days:

SELECT
  user_email,
  COUNT(*) AS query_count,
  ROUND(SUM(total_bytes_processed) / POW(1024, 4), 2) AS total_tb_scanned,
  ROUND(SUM(total_bytes_processed) / POW(1024, 4) * 6.25, 2) AS estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
  AND state = 'DONE'
GROUP BY user_email
ORDER BY estimated_cost_usd DESC;

Running these audits regularly (weekly or even daily) helps you catch cost anomalies early, identify users or pipelines that need optimization, and track the impact of your optimization efforts over time.

2.3 Storage Optimization

BigQuery charges for data stored in its tables. While storage is cheaper than compute, it accumulates over time and can become a significant cost driver if left unmanaged.

Active vs. Long-Term Storage

BigQuery automatically classifies table data into two pricing tiers:

Tier	Condition	Price (per GB/month)
Active	Modified in the last 90 days	~$0.02
Long-term	Not modified for 90+ days	~$0.01

The transition to long-term storage is automatic -- you do not need to do anything. However, you should be aware that modifying a table (even appending a single row) resets the 90-day clock for the entire table. This means that append-only tables with daily loads will always be charged at the active storage rate.

Strategy: For large historical tables, consider separating current data (which is frequently appended to) from historical data (which is rarely modified). The historical table will naturally transition to long-term storage pricing.

Table Expiration Policies

Setting expiration policies ensures that temporary or staging tables do not accumulate indefinitely:

-- Set table expiration to 7 days from creation
CREATE TABLE `project.dataset.staging_events`
(
  event_id STRING,
  event_data JSON
)
OPTIONS (
  expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
);

-- Set default expiration for all new tables in a dataset
ALTER SCHEMA `project.dataset`
SET OPTIONS (
  default_table_expiration_days = 30
);

For partitioned tables, you can set partition-level expiration, which is even more powerful:

-- Automatically delete partitions older than 365 days
CREATE TABLE `project.dataset.events`
(
  event_id STRING,
  event_date DATE
)
PARTITION BY event_date
OPTIONS (
  partition_expiration_days = 365
);

This approach ensures that your table always contains at most one year of data, with old partitions automatically cleaned up.

Removing Duplicate Data

Duplicate data wastes storage and increases query costs. BigQuery does not enforce unique constraints, so duplicates can easily creep in through pipeline retries or overlapping data loads.

Identify duplicates:

SELECT event_id, COUNT(*) as duplicate_count
FROM `project.dataset.events`
GROUP BY event_id
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC
LIMIT 100;

Deduplicate using MERGE:

-- Create a deduplicated version of the table
CREATE OR REPLACE TABLE `project.dataset.events_deduped` AS
SELECT * EXCEPT(row_num)
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_timestamp DESC) as row_num
  FROM `project.dataset.events`
)
WHERE row_num = 1;

Choosing the Right Partitioning Strategy

The choice of partition column and granularity has a significant impact on both query cost and storage efficiency:

Strategy	Best For	Partition Count
Daily partitioning on date	Time-series data with daily queries	~365/year
Monthly partitioning	Historical analysis with monthly grain	~12/year
Integer range partitioning	Non-time-series data (e.g., customer segments)	Varies
Ingestion-time partitioning	When no natural partition column exists	~365/year

Guidelines:

Choose a column that is frequently used in WHERE clauses
Ensure partitions are not too small (under 1 GB) or too large (over 5 TB)
Daily partitioning is the most common and works well for most time-series data
Always require partition filters on large tables to prevent accidental full scans

Module 3: Snowflake Cost Optimization

Snowflake's architecture separates storage and compute, giving you independent control over each. This separation is powerful but introduces a pricing model that differs significantly from both traditional databases and other cloud warehouses. Understanding Snowflake's credit-based system is essential for cost management.

3.1 Understanding Snowflake Credits

Snowflake charges for compute using a unit called a "credit." The cost of a credit depends on your edition and region, but typically ranges from $2 to $4 per credit. Credits are consumed whenever a virtual warehouse is running, regardless of whether it is actively processing queries.

Warehouse Sizes and Credit Consumption

Each warehouse size consumes a fixed number of credits per hour:

Warehouse Size	Credits/Hour	Approx. Nodes
X-Small (XS)	1	1
Small (S)	2	2
Medium (M)	4	4
Large (L)	8	8
X-Large (XL)	16	16
2X-Large	32	32
3X-Large	64	64
4X-Large	128	128

Notice that each size step doubles the credit consumption. This means upgrading from Medium to Large doubles your cost, even if the query only runs marginally faster.

Calculating Query Costs

The cost of a query depends on the warehouse size, how long it runs, and the per-credit price. Snowflake bills in one-second increments with a minimum of 60 seconds per warehouse resume.

Example:

Warehouse size: Medium (4 credits/hour)
Query runtime: 45 seconds
Credit price: $3.00

Cost = (4 credits/hour) x (45 seconds / 3600 seconds) x $3.00
Cost = 4 x 0.0125 x $3.00
Cost = $0.15 per query execution

Now consider running the same query on an X-Large warehouse where it completes in 12 seconds:

Cost = (16 credits/hour) x (12 seconds / 3600 seconds) x $3.00
Cost = 16 x 0.00333 x $3.00
Cost = $0.16 per query execution

Even though the XL warehouse is 4x more powerful and finishes 3.75x faster, the cost is nearly identical for this single query. However, for a 10-minute query on Medium that takes 2.5 minutes on XL:

Medium: 4 x (600/3600) x $3.00 = $2.00
XL:     16 x (150/3600) x $3.00 = $2.00

The costs remain similar because Snowflake's linear scaling means larger warehouses finish proportionally faster. The real savings come from auto-suspend behavior -- a larger warehouse that finishes quickly and suspends sooner can actually cost less than a smaller warehouse that stays active longer processing a queue of queries.

3.2 Warehouse Management

Effective warehouse management is the single most impactful lever for Snowflake cost optimization. A poorly configured warehouse can burn credits 24/7 with no one noticing.

Auto-Suspend and Auto-Resume

Auto-suspend shuts down a warehouse after a period of inactivity. Auto-resume starts it automatically when a query arrives. These two settings are your first line of defense against wasted credits.

-- Create a warehouse with aggressive auto-suspend (60 seconds)
CREATE WAREHOUSE analytics_wh
  WITH WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 60        -- Suspend after 60 seconds of inactivity
  AUTO_RESUME = TRUE       -- Automatically resume when queries arrive
  INITIALLY_SUSPENDED = TRUE;

-- Modify an existing warehouse
ALTER WAREHOUSE etl_wh SET
  AUTO_SUSPEND = 120
  AUTO_RESUME = TRUE;

Recommended auto-suspend values by use case:

Use Case	Auto-Suspend (seconds)	Rationale
ETL / Batch pipelines	60	Jobs are scheduled; no waiting users
BI / Dashboard queries	300	Users may run follow-up queries
Ad-hoc analysis	300-600	Analysts explore iteratively
Data science / ML	600-900	Long think-time between queries

The 60-second minimum charge per resume means that warehouses resuming frequently can accumulate costs. If your ETL pipeline runs 100 small queries, each separated by 2 minutes of inactivity, the warehouse will suspend and resume 100 times, paying 100 minutes of minimum charges instead of the actual compute time. In this case, a longer auto-suspend period (e.g., 300 seconds) or batching queries together is more cost-effective.

Right-Sizing Warehouses

Different workloads have different resource requirements. Using a single warehouse for all workloads means you are either over-provisioning (wasting money) or under-provisioning (degrading performance).

-- Dedicated warehouse for ETL: needs power, runs intermittently
CREATE WAREHOUSE etl_wh
  WITH WAREHOUSE_SIZE = 'LARGE'
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE;

-- Dedicated warehouse for BI dashboards: moderate, concurrent users
CREATE WAREHOUSE bi_wh
  WITH WAREHOUSE_SIZE = 'MEDIUM'
  AUTO_SUSPEND = 300
  AUTO_RESUME = TRUE;

-- Dedicated warehouse for ad-hoc analysis: small, variable usage
CREATE WAREHOUSE adhoc_wh
  WITH WAREHOUSE_SIZE = 'SMALL'
  AUTO_SUSPEND = 600
  AUTO_RESUME = TRUE;

How to determine the right size:

Start with a Small warehouse for new workloads
Monitor query performance using the QUERY_HISTORY view
If queries are queuing (spillage to remote storage or excessive queue time), try the next size up
If the warehouse is idle most of the time, try a smaller size
Check the query profile for "bytes spilled to local storage" and "bytes spilled to remote storage" -- spilling is a sign the warehouse is undersized

-- Check for spilling in recent queries
SELECT
  query_id,
  warehouse_name,
  warehouse_size,
  execution_time / 1000 as execution_seconds,
  bytes_spilled_to_local_storage,
  bytes_spilled_to_remote_storage,
  query_text
FROM snowflake.account_usage.query_history
WHERE start_time > DATEADD('day', -7, CURRENT_TIMESTAMP())
  AND bytes_spilled_to_remote_storage > 0
ORDER BY bytes_spilled_to_remote_storage DESC
LIMIT 20;

Multi-Cluster Warehouses

For workloads with variable concurrency (e.g., many BI users querying simultaneously during business hours), multi-cluster warehouses automatically scale out by adding clusters as demand increases.

CREATE WAREHOUSE bi_wh
  WITH WAREHOUSE_SIZE = 'MEDIUM'
  MIN_CLUSTER_COUNT = 1
  MAX_CLUSTER_COUNT = 4
  SCALING_POLICY = 'STANDARD'
  AUTO_SUSPEND = 300
  AUTO_RESUME = TRUE;

Multi-cluster warehouses prevent query queuing without requiring a permanently large warehouse. The STANDARD scaling policy adds clusters when queries start queuing and removes them after the load decreases.

Cost implication: A 4-cluster Medium warehouse at peak costs the same as a single X-Large warehouse (both consume 16 credits/hour), but it automatically scales back down to 1 cluster during off-peak hours.

Resource Monitors

Resource monitors set credit consumption limits and trigger actions (alerts or warehouse suspension) when thresholds are reached.

-- Create a resource monitor with a monthly budget
CREATE RESOURCE MONITOR monthly_etl_budget
  WITH CREDIT_QUOTA = 5000
  FREQUENCY = MONTHLY
  START_TIMESTAMP = IMMEDIATELY
  TRIGGERS
    ON 75 PERCENT DO NOTIFY
    ON 90 PERCENT DO NOTIFY
    ON 100 PERCENT DO SUSPEND;

-- Apply the monitor to a specific warehouse
ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = monthly_etl_budget;

-- Create an account-level resource monitor
CREATE RESOURCE MONITOR account_monthly_budget
  WITH CREDIT_QUOTA = 20000
  FREQUENCY = MONTHLY
  START_TIMESTAMP = IMMEDIATELY
  TRIGGERS
    ON 80 PERCENT DO NOTIFY
    ON 95 PERCENT DO NOTIFY
    ON 100 PERCENT DO SUSPEND_IMMEDIATE;

Resource monitors are essential governance controls. Without them, a runaway query or misconfigured pipeline can burn through your entire budget. The SUSPEND_IMMEDIATE action will cancel running queries, while SUSPEND allows currently running queries to complete before suspending.

3.3 Query and Storage Optimization

Beyond warehouse management, Snowflake offers several features for optimizing query performance and storage costs.

Clustering Keys

Snowflake automatically organizes data into micro-partitions, but for very large tables (multi-TB), the default organization may not align with your query patterns. Clustering keys tell Snowflake to co-locate related data, improving partition pruning.

-- Add clustering keys to an existing table
ALTER TABLE events CLUSTER BY (event_date, country);

-- Verify clustering efficiency
SELECT SYSTEM$CLUSTERING_INFORMATION('events', '(event_date, country)');

Clustering is an ongoing, automatic process that consumes credits in the background. It is most cost-effective for:

Tables larger than 1 TB
Tables with well-defined query patterns (consistent WHERE clause columns)
Tables where the natural data loading order does not match query patterns

Avoid over-clustering: Adding clustering keys to small tables or tables that are already well-ordered wastes credits. Monitor the automatic clustering costs in your account usage views.

Micro-Partition Pruning

Snowflake stores data in micro-partitions (50-500 MB compressed). When a query includes a filter, Snowflake can skip entire micro-partitions that do not contain matching data. Effective pruning depends on data co-location, which is influenced by clustering.

-- Check how many partitions are being pruned
SELECT *
FROM TABLE(information_schema.query_history())
WHERE query_text ILIKE '%events%'
ORDER BY start_time DESC
LIMIT 10;

-- Look at partitions_scanned vs. partitions_total in the query profile

Time Travel and Fail-Safe Cost Implications

Snowflake's Time Travel feature allows you to access historical data that has been modified or deleted. Fail-safe provides an additional recovery period. Both features consume storage.

Feature	Duration	Storage Cost
Time Travel	0-90 days (configurable)	Charged for changed data
Fail-safe	7 days (non-configurable)	Charged for changed data

For tables with frequent updates or deletes, Time Travel storage can be substantial because Snowflake retains the old versions of changed micro-partitions.

-- Reduce Time Travel retention for staging tables
ALTER TABLE staging_events SET DATA_RETENTION_TIME_IN_DAYS = 1;

-- Set to 0 for truly temporary tables (disables Time Travel)
ALTER TABLE tmp_processing SET DATA_RETENTION_TIME_IN_DAYS = 0;

-- Keep longer retention for critical production tables
ALTER TABLE production_facts SET DATA_RETENTION_TIME_IN_DAYS = 30;

Recommendation: Use the minimum Time Travel retention that meets your recovery needs. For staging and temporary tables, 0-1 days is usually sufficient. For production tables, 7-14 days provides a reasonable safety net without excessive storage costs.

Zero-Copy Cloning

Zero-copy cloning creates an instant copy of a table, schema, or database without physically duplicating the data. The clone shares the underlying micro-partitions with the original, so it consumes no additional storage until data in either the clone or the original is modified.

-- Clone a production database for development
CREATE DATABASE dev_analytics CLONE prod_analytics;

-- Clone a specific table for testing
CREATE TABLE events_test CLONE events;

Zero-copy cloning is a powerful cost optimization tool for:

Development environments: Give each developer a full clone of production data without doubling storage costs
Testing: Create test datasets that mirror production
Backups: Take point-in-time snapshots before risky operations
Data sharing: Share data across teams without copying

The cost only increases as the clone and original diverge (i.e., as modifications create new micro-partitions that are not shared).

Module 4: AWS Data Service Costs

AWS offers a wide range of data services, each with its own pricing model. For data engineers, the key services to understand are S3 for storage, and EMR, Glue, and Athena for compute. Making the right choice between these services can save tens of thousands of dollars per month.

4.1 S3 Storage Economics

Amazon S3 is the foundation of nearly every AWS data architecture. Understanding its tiered storage model and pricing nuances is essential for cost management.

Storage Tiers

S3 offers multiple storage classes optimized for different access patterns:

Storage Class	Monthly Cost (per GB)	Min. Storage Duration	Retrieval Cost	Use Case
Standard	$0.023	None	None	Frequently accessed data
Intelligent-Tiering	$0.023 + monitoring fee	None	None	Unknown access patterns
Standard-IA	$0.0125	30 days	$0.01/GB	Infrequent access
One Zone-IA	$0.01	30 days	$0.01/GB	Non-critical, infrequent
Glacier Instant	$0.004	90 days	$0.03/GB	Archive, instant access
Glacier Flexible	$0.0036	90 days	$0.01/GB (expedited)	Archive, minutes-hours
Glacier Deep Archive	$0.00099	180 days	$0.02/GB	Long-term archive

The cost difference between Standard and Deep Archive is over 23x. For a 100 TB data lake, the annual storage cost difference is:

Standard:     100,000 GB x $0.023 x 12 = $27,600/year
Deep Archive: 100,000 GB x $0.00099 x 12 = $1,188/year
Savings: $26,412/year (96% reduction)

Of course, Deep Archive data takes hours to retrieve, so it is only appropriate for data that is rarely or never accessed.

Lifecycle Policies

S3 Lifecycle policies automate the transition of data between storage classes based on age:

{
  "Rules": [
    {
      "ID": "DataLakeLifecycle",
      "Status": "Enabled",
      "Filter": {
        "Prefix": "raw/"
      },
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 2555
      }
    }
  ]
}

This policy moves raw data to Standard-IA after 30 days, Glacier Instant Retrieval after 90 days, Deep Archive after 1 year, and deletes it after 7 years. For a data lake ingesting 1 TB/day, this policy can reduce storage costs by 60-80% compared to keeping everything in Standard.

Terraform implementation:

resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "raw-data-lifecycle"
    status = "Enabled"

    filter {
      prefix = "raw/"
    }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }

    expiration {
      days = 2555
    }
  }
}

Request Pricing and Data Transfer

S3 charges not just for storage but also for API requests and data transfer:

Operation	Cost
PUT, COPY, POST, LIST	$0.005 per 1,000 requests
GET, SELECT	$0.0004 per 1,000 requests
Data Transfer OUT to Internet	$0.09/GB (first 10 TB/month)
Data Transfer to same region	Free
Data Transfer to different region	$0.02/GB

Request costs matter when you have many small files. A pipeline that processes 1 million small files makes 1 million GET requests, costing $0.40 just in request fees. Compacting small files into larger ones (e.g., using Apache Spark's coalesce() or repartition()) reduces both request costs and processing time.

Data transfer is the hidden cost killer. Moving 10 TB out of AWS to the internet costs $900. Moving data between regions costs $200 per 10 TB. This is why data locality matters -- keep compute and storage in the same region.

S3 Select

S3 Select allows you to retrieve a subset of data from an object using SQL expressions, reducing the amount of data transferred and processed:

import boto3

s3_client = boto3.client('s3')

response = s3_client.select_object_content(
    Bucket='my-data-lake',
    Key='events/2024/01/events.parquet',
    Expression="SELECT user_id, event_type FROM s3object WHERE country = 'US'",
    ExpressionType='SQL',
    InputSerialization={
        'Parquet': {}
    },
    OutputSerialization={
        'JSON': {}
    }
)

S3 Select is priced at $0.002 per GB scanned and $0.0007 per GB returned. For large Parquet or CSV files where you only need a subset of columns or rows, this can be significantly cheaper than downloading the entire file.

4.2 Compute: EMR vs. Glue vs. Athena

AWS offers three primary services for data processing, each with distinct pricing models, operational characteristics, and ideal use cases.

Amazon EMR (Elastic MapReduce)

EMR provides managed Hadoop/Spark clusters. You pay for EC2 instances plus an EMR surcharge (typically 15-25% above EC2 pricing).

Pricing model:

EMR cost = EC2 instance cost + EMR surcharge

Example: Medium Spark cluster

3x m5.xlarge instances (1 master + 2 core)
EC2 cost: 3 x $0.192/hour = $0.576/hour
EMR surcharge: 3 x $0.048/hour = $0.144/hour
Total: $0.720/hour

Running 4 hours/day for ETL:
Monthly cost: $0.720 x 4 x 30 = $86.40/month

With Spot Instances for core/task nodes:

1x m5.xlarge on-demand (master): $0.192 + $0.048 = $0.240/hour
2x m5.xlarge spot (core): 2 x ($0.058 + $0.048) = $0.212/hour
Total: $0.452/hour (37% savings)

Monthly cost: $0.452 x 4 x 30 = $54.24/month

EMR is ideal for complex Spark jobs, custom frameworks, or workloads requiring fine-grained control over the execution environment.

AWS Glue

Glue is a fully serverless ETL service. You pay per DPU-hour (Data Processing Unit), where each DPU provides 4 vCPUs and 16 GB of memory.

Pricing model:

Glue cost = Number of DPUs x Runtime (hours) x $0.44/DPU-hour

Example:

10 DPUs running for 30 minutes:
Cost = 10 x 0.5 x $0.44 = $2.20 per job run

Running daily:
Monthly cost: $2.20 x 30 = $66/month

Glue 4.0 supports auto-scaling, which means it starts with a minimum number of workers and scales up based on workload. This avoids paying for idle DPUs during the less intensive phases of a job.

# Glue job configuration with auto-scaling
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# With auto-scaling enabled:
# --enable-auto-scaling true
# --number-of-workers 2  (minimum)
# --worker-type G.1X

Glue is ideal for teams that want serverless simplicity, built-in data catalog integration, and do not need fine-grained infrastructure control.

Amazon Athena

Athena is a serverless query engine that charges per TB scanned, similar to BigQuery's on-demand model.

Pricing model:

Athena cost = $5.00 per TB scanned

Example:

Query scanning 500 GB of Parquet data:
Cost = 0.5 TB x $5.00 = $2.50

100 such queries per month:
Monthly cost: $2.50 x 100 = $250/month

Athena benefits enormously from Parquet/ORC formats (columnar compression reduces data scanned), partitioned data (Hive-style partitions in S3), and bucketing (similar to BigQuery clustering).

Cost Comparison

Let us compare the three services for a common scenario: processing 500 GB of data daily.

Service	Configuration	Daily Cost	Monthly Cost
EMR (On-Demand)	3x m5.xlarge, 2 hours	$1.44	$43.20
EMR (Spot)	1 on-demand + 2 spot, 2 hours	$0.90	$27.12
Glue	10 DPUs, 1 hour	$4.40	$132.00
Athena	500 GB scan per query	$2.50	$75.00

EMR with Spot instances is the cheapest for large batch workloads, but requires operational overhead for cluster management. Athena is the cheapest for occasional, ad-hoc queries on well-partitioned data. Glue sits in the middle, offering serverless simplicity at a moderate cost.

When to use each:

EMR: Large-scale batch processing, complex Spark pipelines, ML training, workloads requiring custom libraries or specific Spark configurations
Glue: Serverless ETL, tight integration with AWS data catalog, teams without dedicated infrastructure expertise
Athena: Ad-hoc SQL queries, interactive analysis, data exploration, lightweight scheduled queries

4.3 Reserved and Spot Strategies

For predictable workloads, reserved pricing and spot instances can dramatically reduce compute costs.

Reserved Instances and Savings Plans

AWS offers two mechanisms for commitment-based discounts:

Reserved Instances (RIs):

Commit to a specific instance type in a specific region
1-year or 3-year terms
Up to 72% discount over on-demand
Best for always-on workloads (e.g., persistent EMR clusters)

Savings Plans:

Commit to a dollar amount of compute per hour
More flexible than RIs (apply across instance families and regions)
Compute Savings Plans: Up to 66% discount, apply to EC2, Fargate, and Lambda
EC2 Instance Savings Plans: Up to 72% discount, specific to instance family and region

Example:

On-demand: 10x m5.xlarge instances, 24/7
Cost: 10 x $0.192 x 730 hours = $1,401.60/month

3-year All Upfront Reserved:
Cost: 10 x $0.074 x 730 hours = $540.20/month
Savings: $861.40/month (61%)

Spot Instances for Batch Processing

Spot instances are ideal for fault-tolerant data engineering workloads. EMR natively supports Spot instances with automatic fallback to on-demand.

EMR Spot Fleet configuration (Terraform):

resource "aws_emr_cluster" "spark_etl" {
  name          = "spark-etl-cluster"
  release_label = "emr-7.0.0"
  applications  = ["Spark"]

  master_instance_group {
    instance_type  = "m5.xlarge"
    instance_count = 1
  }

  core_instance_group {
    instance_type  = "m5.2xlarge"
    instance_count = 2
    bid_price      = "0.10"  # Spot bid price

    ebs_config {
      size = 100
      type = "gp3"
    }
  }

  # Task instance fleet for additional spot capacity
  ec2_attributes {
    instance_profile = aws_iam_instance_profile.emr.arn
  }
}

Best practices for Spot in data engineering:

Always use on-demand for the master node
Use Spot for core and task nodes with instance fleet diversification
Configure multiple instance types to increase spot capacity availability
Use checkpointing in Spark to handle spot interruptions gracefully
Set up CloudWatch alarms for spot interruption notices

Graviton Instances

AWS Graviton processors (ARM-based) offer 20-40% better price-performance compared to x86 instances for many data workloads.

Instance	vCPUs	Memory	On-Demand Price	Savings vs x86
m5.xlarge (x86)	4	16 GB	$0.192/hour	Baseline
m6g.xlarge (Graviton2)	4	16 GB	$0.154/hour	20%
m7g.xlarge (Graviton3)	4	16 GB	$0.163/hour	15%

EMR, Glue, and most AWS data services support Graviton instances. Spark, Hive, Presto, and Python workloads generally work without modification on Graviton.

# Use Graviton instances in EMR for better price-performance
core_instance_group {
  instance_type  = "m6g.2xlarge"  # Graviton2
  instance_count = 4
}

Module 5: Cost Monitoring and Governance

Optimization without monitoring is guesswork. This module covers the tools and practices for maintaining ongoing visibility into cloud costs and ensuring that optimization gains are sustained over time.

5.1 Tagging Strategy

Resource tagging is the foundation of cost allocation. Without consistent tags, you cannot attribute costs to teams, projects, or environments, making it impossible to know where money is being spent.

Designing a Tagging Schema

A well-designed tagging schema should answer four questions:

Who is responsible for this resource? (team, owner)
What is this resource used for? (project, application)
Where is this resource in the lifecycle? (environment)
Why does this resource exist? (cost center, business unit)

Recommended tags:

Tag Key	Example Values	Purpose
`team`	data-engineering, analytics, ml	Cost attribution to teams
`project`	customer-360, revenue-pipeline	Cost attribution to projects
`environment`	production, staging, development	Separate prod from non-prod costs
`cost-center`	CC-1234	Finance allocation
`owner`	jane.doe@company.com	Accountability
`managed-by`	terraform, manual, cloudformation	Infrastructure management
`data-classification`	public, internal, confidential	Governance

Enforcing Tags with Policy

Tags are only useful if they are consistently applied. Use policy-as-code to enforce tagging compliance:

AWS - Tag Policy (via AWS Organizations):

{
  "tags": {
    "team": {
      "tag_key": {
        "@@assign": "team"
      },
      "enforced_for": {
        "@@assign": [
          "s3:bucket",
          "emr:cluster",
          "glue:job"
        ]
      }
    },
    "environment": {
      "tag_key": {
        "@@assign": "environment"
      },
      "tag_value": {
        "@@assign": ["production", "staging", "development"]
      }
    }
  }
}

Terraform - Required tags module:

variable "required_tags" {
  type = object({
    team        = string
    project     = string
    environment = string
    cost_center = string
    owner       = string
  })
}

# Apply to all resources
resource "aws_s3_bucket" "data_lake" {
  bucket = "my-data-lake"

  tags = merge(var.required_tags, {
    Name = "data-lake-bucket"
  })
}

resource "aws_emr_cluster" "etl" {
  name = "etl-cluster"

  tags = merge(var.required_tags, {
    Name = "etl-spark-cluster"
  })
}

GCP - Organization Policy:

GCP uses labels (their equivalent of tags) and organization policies to enforce consistent labeling:

# gcp-label-policy.yaml
constraint: constraints/compute.requireLabels
listPolicy:
  allValues: DENY
  allowedValues:
    - team
    - project
    - environment

5.2 Budgets and Alerts

Budgets and alerts are your early warning system for cost anomalies. They do not prevent overspending by themselves, but they ensure you know about it quickly enough to take action.

AWS Budgets

AWS Budgets allows you to set custom cost thresholds and receive notifications when spending approaches or exceeds those thresholds.

resource "aws_budgets_budget" "data_engineering_monthly" {
  name         = "data-engineering-monthly"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:team$data-engineering"]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 75
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["data-eng-leads@company.com"]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 90
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["data-eng-leads@company.com", "finance@company.com"]
  }

  notification {
    comparison_operator       = "GREATER_THAN"
    threshold                 = 100
    threshold_type            = "PERCENTAGE"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["data-eng-leads@company.com"]
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
  }
}

GCP Budget Alerts

resource "google_billing_budget" "data_engineering" {
  billing_account = var.billing_account_id
  display_name    = "Data Engineering Monthly Budget"

  budget_filter {
    projects = ["projects/${var.project_id}"]
    labels = {
      team = ["data-engineering"]
    }
  }

  amount {
    specified_amount {
      currency_code = "USD"
      units         = "10000"
    }
  }

  threshold_rules {
    threshold_percent = 0.5
    spend_basis       = "CURRENT_SPEND"
  }

  threshold_rules {
    threshold_percent = 0.75
    spend_basis       = "CURRENT_SPEND"
  }

  threshold_rules {
    threshold_percent = 0.9
    spend_basis       = "FORECASTED_SPEND"
  }

  threshold_rules {
    threshold_percent = 1.0
    spend_basis       = "CURRENT_SPEND"
  }

  all_updates_rule {
    monitoring_notification_channels = [
      google_monitoring_notification_channel.email.name
    ]
    pubsub_topic = google_pubsub_topic.budget_alerts.id
  }
}

Snowflake Budget Controls

Snowflake provides resource monitors (covered in Module 3) and account-level spending views:

-- Monitor daily credit consumption
SELECT
  DATE_TRUNC('day', start_time) AS usage_date,
  warehouse_name,
  SUM(credits_used) AS credits_consumed,
  SUM(credits_used) * 3.00 AS estimated_cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1 DESC, credits_consumed DESC;

-- Detect anomalies: days where spending exceeded 2x the average
WITH daily_spend AS (
  SELECT
    DATE_TRUNC('day', start_time) AS usage_date,
    SUM(credits_used) AS daily_credits
  FROM snowflake.account_usage.warehouse_metering_history
  WHERE start_time >= DATEADD('day', -90, CURRENT_TIMESTAMP())
  GROUP BY 1
),
stats AS (
  SELECT
    AVG(daily_credits) AS avg_daily,
    STDDEV(daily_credits) AS stddev_daily
  FROM daily_spend
)
SELECT
  d.usage_date,
  d.daily_credits,
  s.avg_daily,
  ROUND((d.daily_credits - s.avg_daily) / NULLIF(s.stddev_daily, 0), 2) AS z_score
FROM daily_spend d
CROSS JOIN stats s
WHERE d.daily_credits > s.avg_daily * 2
ORDER BY d.usage_date DESC;

Automated Responses

Beyond alerts, you can automate responses to cost anomalies using cloud-native event systems:

AWS: Lambda triggered by budget alert via SNS:

import boto3
import json

def lambda_handler(event, context):
    """
    Triggered when data engineering budget exceeds threshold.
    Suspends non-critical EMR clusters and notifies the team.
    """
    emr_client = boto3.client('emr')
    sns_client = boto3.client('sns')

    # List running clusters tagged as non-critical
    clusters = emr_client.list_clusters(
        ClusterStates=['RUNNING', 'WAITING']
    )

    terminated = []
    for cluster in clusters['Clusters']:
        tags = emr_client.describe_cluster(
            ClusterId=cluster['Id']
        )['Cluster']['Tags']

        tag_dict = {t['Key']: t['Value'] for t in tags}

        if tag_dict.get('environment') != 'production':
            emr_client.terminate_job_flows(
                JobFlowIds=[cluster['Id']]
            )
            terminated.append(cluster['Name'])

    # Notify the team
    sns_client.publish(
        TopicArn='arn:aws:sns:us-east-1:123456789:data-eng-alerts',
        Subject='Budget Alert: Non-critical EMR clusters terminated',
        Message=json.dumps({
            'action': 'auto-terminate',
            'clusters': terminated,
            'reason': 'Monthly budget threshold exceeded'
        })
    )

    return {'terminated_clusters': terminated}

5.3 Cost Dashboards and Reporting

Cost dashboards transform raw billing data into actionable insights. A good dashboard answers three questions at a glance: How much are we spending? Where is the money going? Is spending trending up or down?

AWS Cost Explorer

AWS Cost Explorer provides built-in visualization of your AWS spending. For data engineering teams, the most useful views are:

Service breakdown: See which AWS services (S3, EMR, Glue, Athena) consume the most budget
Tag-based filtering: Filter by team, project, or environment tags
Daily granularity: Spot anomalies and correlate with pipeline schedules
Forecasting: AWS provides 12-month cost forecasts based on historical trends

For custom dashboards, use the AWS Cost and Usage Report (CUR) exported to S3 and queried via Athena:

-- Query the Cost and Usage Report in Athena
SELECT
  line_item_product_code AS service,
  resource_tags_user_team AS team,
  resource_tags_user_project AS project,
  DATE_FORMAT(line_item_usage_start_date, '%Y-%m') AS month,
  SUM(line_item_unblended_cost) AS total_cost
FROM cost_and_usage_report
WHERE line_item_usage_start_date >= DATE '2024-01-01'
  AND resource_tags_user_team = 'data-engineering'
GROUP BY 1, 2, 3, 4
ORDER BY total_cost DESC;

GCP Billing Reports

GCP Billing provides similar capabilities with BigQuery export:

-- Query GCP billing export in BigQuery
SELECT
  service.description AS service_name,
  labels.value AS team,
  FORMAT_TIMESTAMP('%Y-%m', usage_start_time) AS month,
  SUM(cost) AS total_cost,
  SUM(credits.amount) AS total_credits,
  SUM(cost) + SUM(credits.amount) AS net_cost
FROM `billing_dataset.gcp_billing_export`
LEFT JOIN UNNEST(labels) AS labels ON labels.key = 'team'
LEFT JOIN UNNEST(credits) AS credits
WHERE usage_start_time >= TIMESTAMP('2024-01-01')
GROUP BY 1, 2, 3
ORDER BY net_cost DESC;

Snowflake ACCOUNT_USAGE Views

Snowflake provides comprehensive usage tracking through the SNOWFLAKE.ACCOUNT_USAGE schema:

-- Weekly cost report by warehouse
SELECT
  DATE_TRUNC('week', start_time) AS week,
  warehouse_name,
  SUM(credits_used) AS total_credits,
  ROUND(SUM(credits_used) * 3.00, 2) AS estimated_cost_usd,
  COUNT(DISTINCT DATE_TRUNC('day', start_time)) AS active_days,
  ROUND(SUM(credits_used) / COUNT(DISTINCT DATE_TRUNC('day', start_time)), 2) AS avg_credits_per_day
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('week', -12, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1 DESC, total_credits DESC;

-- Storage cost breakdown
SELECT
  DATE_TRUNC('month', usage_date) AS month,
  ROUND(AVG(storage_bytes) / POW(1024, 4), 2) AS avg_storage_tb,
  ROUND(AVG(stage_bytes) / POW(1024, 4), 2) AS avg_stage_tb,
  ROUND(AVG(failsafe_bytes) / POW(1024, 4), 2) AS avg_failsafe_tb,
  ROUND((AVG(storage_bytes) + AVG(stage_bytes) + AVG(failsafe_bytes))
    / POW(1024, 4) * 23, 2) AS estimated_monthly_cost_usd
FROM snowflake.account_usage.storage_usage
WHERE usage_date >= DATEADD('month', -6, CURRENT_DATE())
GROUP BY 1
ORDER BY 1 DESC;

Weekly Cost Review Cadence

Establishing a regular review cadence ensures cost awareness stays embedded in team culture:

Weekly (15-minute standup):

Review week-over-week cost trend
Identify any anomalies or spikes
Check if any budget thresholds were triggered
Assign follow-up investigation for anomalies

Monthly (30-minute review):

Deep dive into cost by service, team, and project
Compare actual vs. budgeted spend
Review optimization opportunities identified
Track savings from implemented optimizations
Plan next month's optimization initiatives

Quarterly (1-hour planning):

Review reserved instance / savings plan coverage
Evaluate architecture decisions for cost efficiency
Update budgets and forecasts
Align with finance on cost allocation and chargeback

Module 6: Real-World Cost Patterns

This module applies the concepts from the previous modules to common data engineering patterns. These are the decisions you will face repeatedly in your career, and understanding their cost implications will help you make better choices.

6.1 Incremental Processing vs. Full Refresh

One of the most consequential architectural decisions in data engineering is whether to process data incrementally or do a full refresh. This decision has profound cost implications that scale with data volume.

Full Refresh

A full refresh drops and recreates a table by processing all source data from scratch.

Cost characteristics:

Processes the entire dataset every run
Cost grows linearly with data volume
Simple to implement and reason about
No state management required
Guaranteed correctness (no edge cases with late-arriving data)

Example (BigQuery on-demand):

-- Full refresh: scan the entire source table every time
CREATE OR REPLACE TABLE `project.dataset.daily_metrics` AS
SELECT
  DATE(event_timestamp) AS metric_date,
  event_type,
  COUNT(*) AS event_count,
  COUNT(DISTINCT user_id) AS unique_users
FROM `project.dataset.events`  -- 10 TB table
GROUP BY 1, 2;

-- Cost per run: 10 TB x $6.25 = $62.50
-- Daily cost: $62.50
-- Monthly cost: $1,875
-- Annual cost: $22,500

Incremental Processing

Incremental processing only handles new or changed data since the last run.

Cost characteristics:

Processes only new/changed data
Cost is proportional to the data change rate, not total volume
More complex to implement (needs watermarks, change detection)
Requires state management (tracking what has been processed)
Must handle late-arriving data and out-of-order events

Example (BigQuery on-demand):

-- Incremental: only scan today's partition
MERGE INTO `project.dataset.daily_metrics` AS target
USING (
  SELECT
    DATE(event_timestamp) AS metric_date,
    event_type,
    COUNT(*) AS event_count,
    COUNT(DISTINCT user_id) AS unique_users
  FROM `project.dataset.events`
  WHERE event_date = CURRENT_DATE()  -- Only today's partition (~30 GB)
  GROUP BY 1, 2
) AS source
ON target.metric_date = source.metric_date
  AND target.event_type = source.event_type
WHEN MATCHED THEN
  UPDATE SET
    event_count = source.event_count,
    unique_users = source.unique_users
WHEN NOT MATCHED THEN
  INSERT (metric_date, event_type, event_count, unique_users)
  VALUES (source.metric_date, source.event_type, source.event_count, source.unique_users);

-- Cost per run: 30 GB x $6.25/TB = $0.19
-- Daily cost: $0.19
-- Monthly cost: $5.70
-- Annual cost: $68.40

The incremental approach costs $22,431.60 less per year for this single pipeline. Across an organization with hundreds of pipelines, the savings are transformational.

dbt Incremental Models

dbt (data build tool) provides a framework for incremental processing that handles much of the complexity:

-- models/daily_metrics.sql
{{
  config(
    materialized='incremental',
    unique_key=['metric_date', 'event_type'],
    partition_by={
      "field": "metric_date",
      "data_type": "date",
      "granularity": "day"
    },
    cluster_by=['event_type']
  )
}}

SELECT
  DATE(event_timestamp) AS metric_date,
  event_type,
  COUNT(*) AS event_count,
  COUNT(DISTINCT user_id) AS unique_users
FROM {{ source('raw', 'events') }}

{% if is_incremental() %}
  WHERE event_date > (SELECT MAX(metric_date) FROM {{ this }})
{% endif %}

GROUP BY 1, 2

dbt handles the merge logic, partition management, and the incremental/full-refresh toggle automatically. Running dbt run processes incrementally; running dbt run --full-refresh rebuilds from scratch.

When to Use Each Approach

Factor	Full Refresh	Incremental
Data volume	Small (< 100 GB)	Large (> 100 GB)
Update frequency	Entire dataset changes	Small % of data changes per run
Complexity tolerance	Low (want simplicity)	High (can handle complexity)
Late-arriving data	Naturally handles it	Needs explicit handling
Cost sensitivity	Low (budget is flexible)	High (cost is a constraint)
Source data	No reliable timestamp/watermark	Has reliable change tracking

6.2 Materialized Views vs. Live Queries

The decision to precompute results (materialize) vs. compute on-the-fly (live query) is a classic space-time trade-off with direct cost implications.

Live Queries

Live queries compute results from base tables every time they are executed.

Cost characteristics:

No additional storage cost
Full compute cost on every execution
Always returns the latest data
Cost scales with query frequency

Example:

-- This query scans 5 TB every time it runs
SELECT
  DATE(event_timestamp) AS day,
  country,
  COUNT(DISTINCT user_id) AS dau
FROM `project.dataset.events`
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE()
GROUP BY 1, 2;

-- If run 50 times/day by different dashboard users:
-- Daily cost: 50 x 5 TB x $6.25 = $1,562.50
-- Monthly cost: $46,875

Materialized Views / Pre-Aggregated Tables

Materializing results stores precomputed aggregations that are queried instead of base tables.

Cost characteristics:

Additional storage cost for the materialized data
Compute cost for refreshing the materialization
Much lower cost per query (smaller data to scan)
Data freshness depends on refresh frequency
Cost scales with refresh frequency, not query frequency

Example:

-- Materialized view refreshed automatically (BigQuery)
CREATE MATERIALIZED VIEW `project.dataset.daily_country_dau`
AS
SELECT
  DATE(event_timestamp) AS day,
  country,
  COUNT(DISTINCT user_id) AS dau
FROM `project.dataset.events`
GROUP BY 1, 2;

-- Storage cost: ~1 GB of aggregated data = $0.02/month
-- Refresh cost: automatic, incremental
-- Query cost per execution: 1 GB x $6.25/TB = $0.006

-- If queried 50 times/day:
-- Daily cost: 50 x $0.006 = $0.30
-- Monthly cost: $9.00 + refresh costs

The materialized approach reduces costs from $46,875/month to under $100/month for this pattern -- a 99.8% reduction.

Decision Framework

Materialize when:

The same expensive query is run frequently (more than 5-10 times per day)
The base data is much larger than the aggregated result
Slight data staleness is acceptable
Multiple users or dashboards consume the same aggregation

Query live when:

The query is run rarely (less than once per day)
The base data is small
Real-time freshness is required
The query pattern is ad-hoc and unpredictable

6.3 Development Environment Cost Control

Development and testing environments are notorious for wasting money. They often mirror production in scale but receive a fraction of the traffic, resulting in massively over-provisioned resources. Controlling dev environment costs can save 30-50% of total cloud spend.

Using Smaller Datasets

Production-scale data is rarely needed for development. Creating representative subsets reduces both compute and storage costs:

-- BigQuery: Create a 1% sample of production data for development
CREATE TABLE `dev_project.dataset.events_sample` AS
SELECT *
FROM `prod_project.dataset.events`
WHERE MOD(ABS(FARM_FINGERPRINT(user_id)), 100) = 0  -- Deterministic 1% sample
  AND event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);  -- Last 30 days only

-- Snowflake: Use SAMPLE for quick development datasets
CREATE TABLE dev_db.schema.events_sample AS
SELECT *
FROM prod_db.schema.events
SAMPLE (1);  -- 1% sample

Snowflake Zero-Copy Clones for Development

As covered in Module 3, zero-copy clones provide full-fidelity development environments at minimal cost:

-- Create a dev environment with production data structure and sample data
CREATE DATABASE dev_analytics CLONE prod_analytics;

-- Each developer gets their own clone
CREATE SCHEMA dev_analytics.developer_jane CLONE prod_analytics.public;

The clone shares storage with production until data is modified, so the incremental cost is minimal. Combined with a small development warehouse (X-Small or Small), this provides a powerful and cost-effective development experience.

BigQuery Sandboxes and Development Projects

Use separate GCP projects for development with restricted quotas:

# Terraform: Create a dev project with custom quotas
resource "google_project" "data_eng_dev" {
  name       = "data-eng-dev"
  project_id = "data-eng-dev-12345"
  org_id     = var.org_id

  labels = {
    team        = "data-engineering"
    environment = "development"
  }
}

# Set a custom quota for BigQuery to limit dev spending
resource "google_project_service" "bigquery" {
  project = google_project.data_eng_dev.project_id
  service = "bigquery.googleapis.com"
}

Configure BigQuery custom cost controls per project to prevent developers from accidentally running expensive queries:

-- In BigQuery, set maximum bytes billed per query
-- This can be set at the project level or per query
-- Queries exceeding this limit will fail before executing
SELECT *
FROM `project.dataset.large_table`
OPTIONS (maximum_bytes_billed = 10737418240);  -- 10 GB limit

Terraform Workspace Isolation

Use Terraform workspaces to manage separate infrastructure for each environment, with environment-specific sizing:

# variables.tf
variable "environment" {
  type = string
}

locals {
  env_config = {
    production = {
      emr_instance_type  = "m5.2xlarge"
      emr_instance_count = 10
      s3_lifecycle_days  = 365
      glue_workers       = 20
    }
    staging = {
      emr_instance_type  = "m5.xlarge"
      emr_instance_count = 3
      s3_lifecycle_days  = 30
      glue_workers       = 5
    }
    development = {
      emr_instance_type  = "m5.large"
      emr_instance_count = 2
      s3_lifecycle_days  = 7
      glue_workers       = 2
    }
  }

  config = local.env_config[var.environment]
}

resource "aws_emr_cluster" "etl" {
  name = "${var.environment}-etl-cluster"

  master_instance_group {
    instance_type  = local.config.emr_instance_type
    instance_count = 1
  }

  core_instance_group {
    instance_type  = local.config.emr_instance_type
    instance_count = local.config.emr_instance_count
  }

  tags = {
    environment = var.environment
    team        = "data-engineering"
  }
}

Scheduling Dev Resources to Shut Down Overnight

Development resources sitting idle overnight and on weekends waste significant money. Automate their shutdown:

AWS Lambda for scheduled EMR termination:

import boto3
from datetime import datetime

def lambda_handler(event, context):
    """
    Runs daily at 8 PM via CloudWatch Events.
    Terminates all non-production EMR clusters.
    """
    emr = boto3.client('emr')
    clusters = emr.list_clusters(ClusterStates=['RUNNING', 'WAITING'])

    for cluster in clusters['Clusters']:
        details = emr.describe_cluster(ClusterId=cluster['Id'])
        tags = {t['Key']: t['Value'] for t in details['Cluster']['Tags']}

        if tags.get('environment') in ('development', 'staging'):
            emr.terminate_job_flows(JobFlowIds=[cluster['Id']])
            print(f"Terminated {cluster['Name']} ({cluster['Id']})")

Snowflake warehouse scheduling:

-- Create a task to suspend dev warehouses at 7 PM
CREATE TASK suspend_dev_warehouses
  WAREHOUSE = admin_wh
  SCHEDULE = 'USING CRON 0 19 * * MON-FRI America/New_York'
AS
  ALTER WAREHOUSE dev_wh SUSPEND;

-- Create a task to resume dev warehouses at 8 AM
CREATE TASK resume_dev_warehouses
  WAREHOUSE = admin_wh
  SCHEDULE = 'USING CRON 0 8 * * MON-FRI America/New_York'
AS
  ALTER WAREHOUSE dev_wh RESUME;

ALTER TASK suspend_dev_warehouses RESUME;
ALTER TASK resume_dev_warehouses RESUME;

Assuming a Medium dev warehouse ($4/credit/hour x 4 credits/hour = $16/hour equivalent):

Running 24/7: $16 x 730 hours = $11,680/month
Running business hours only (10 hours/day, weekdays): $16 x 10 x 22 = $3,520/month
Savings: $8,160/month (70% reduction)

Next Steps

After completing this course on cloud cost optimization, you will be equipped to:

Audit your current cloud spend - Use the INFORMATION_SCHEMA queries, Cost Explorer, and ACCOUNT_USAGE views to understand where money is going
Implement quick wins - Partition pruning, column selection, auto-suspend, and lifecycle policies can deliver savings in days
Build a cost monitoring practice - Set up budgets, alerts, and dashboards for ongoing visibility
Advocate for cost-aware architecture - Bring cost analysis into design reviews and architecture decisions
Explore Data Modeling - Understand how data model design impacts query costs
Review Fundamentals - Revisit batch vs. streaming trade-offs with a cost lens

Recommended Resources:

Cloud FinOps by J.R. Storment and Mike Fuller
- The definitive guide to the FinOps framework
- Covers organizational practices, not just technical optimization
- Essential reading for anyone managing cloud budgets
FinOps Foundation (finops.org)
- Free resources, frameworks, and community
- FinOps Certified Practitioner certification
- Case studies from major organizations
Provider Documentation
- BigQuery Pricing - Always check current pricing
- Snowflake Credit Usage - Understand credit consumption
- AWS Cost Management - Tools and best practices
dbt Documentation on Incremental Models
- dbt Incremental Models - Framework for cost-effective data transformations

Remember: cloud cost optimization is not a one-time project -- it is an ongoing practice. Data volumes grow, pricing models change, and new services emerge. The data engineers who treat cost as a continuous concern, not an afterthought, deliver the most value to their organizations.