Cloud Cost Optimization for Data Engineering
Master cloud cost management strategies for BigQuery, Snowflake, and AWS data services. Learn pricing models, optimization techniques, and monitoring best practices.
Skills You'll Learn:
Module 1: FinOps Fundamentals for Data Engineers
This module introduces the financial dimension of cloud data engineering. As organizations migrate workloads to the cloud and data volumes grow exponentially, cost management has become a core competency for data engineers -- not just a concern for finance teams. Understanding how your architectural decisions translate into cloud bills is what separates a senior data engineer from a junior one.
1.1 Why Cost Matters in Data Engineering
Data engineering is consistently one of the largest contributors to cloud spending in modern organizations. Unlike a web application that serves relatively predictable traffic, data pipelines routinely process terabytes or petabytes of data, and small inefficiencies can compound into enormous costs at scale. A single poorly written query in BigQuery scanning a multi-petabyte table can cost hundreds or even thousands of dollars in a matter of minutes. A Snowflake warehouse left running over a weekend can burn through thousands of credits with no one watching.
The exponential growth of data makes this problem worse every quarter. IDC estimates that global data creation will exceed 180 zettabytes by 2025, and most organizations see their data volumes double every two to three years. As the data grows, so do the compute resources required to process it, the storage costs to retain it, and the network costs to move it between services and regions.
Data engineers sit at the intersection of all three major cloud cost categories:
Compute costs: Running Spark jobs, executing warehouse queries, powering ETL pipelines. Compute is typically the largest line item in a data engineering budget, and it is directly influenced by the efficiency of your code and architecture.
Storage costs: Maintaining raw data lakes, curated data warehouses, staging areas, backups, and historical snapshots. While storage is cheaper per unit than compute, it accumulates relentlessly and is often overlooked because it grows silently.
Network egress costs: Moving data between regions, across clouds, or out to external consumers. Egress charges are the hidden cost that surprises many teams, particularly when architectures span multiple cloud providers or regions.
The data engineer's role in cost management is unique because technical decisions made during pipeline design have a direct and measurable impact on the cloud bill. Choosing the right partitioning strategy, selecting appropriate warehouse sizes, deciding between materialized and live views, and designing incremental processing patterns are all engineering decisions with immediate financial consequences.
Organizations increasingly expect data engineers to think about cost as a first-class requirement alongside correctness, performance, and reliability. This is not about being cheap -- it is about being efficient, ensuring that every dollar spent on cloud infrastructure delivers maximum value to the business.
1.2 Understanding Cloud Pricing Models
Before you can optimize cloud costs, you need to understand how cloud providers charge for their services. While the details vary between providers, there are several universal pricing concepts that apply across all major platforms.
On-Demand Pricing
On-demand pricing is the default and most flexible model. You pay for exactly what you use, with no upfront commitment. Rates are typically billed per second, per minute, or per hour for compute, and per GB-month for storage.
Advantages:
- No upfront commitment or risk
- Scale up and down freely
- Simple to understand and budget for small workloads
- Ideal for variable or unpredictable workloads
Disadvantages:
- Highest per-unit cost
- Costs can spike unpredictably with usage growth
- No incentive to plan capacity
Reserved / Committed Use Pricing
All major cloud providers offer significant discounts (typically 30-70%) in exchange for committing to a certain level of usage over one to three years. AWS calls these Reserved Instances and Savings Plans, GCP offers Committed Use Discounts, and Snowflake offers pre-purchased capacity.
Advantages:
- Substantial cost savings for predictable workloads
- Predictable monthly costs for budgeting
- Can be combined with on-demand for burst capacity
Disadvantages:
- Requires upfront commitment (financial risk if needs change)
- Requires accurate capacity planning
- Unused reservations are wasted money
Spot / Preemptible Pricing
Cloud providers offer deeply discounted compute instances (60-90% off on-demand pricing) that can be reclaimed with short notice. AWS calls these Spot Instances, GCP calls them Preemptible VMs (or Spot VMs), and Azure calls them Spot VMs.
Advantages:
- Dramatic cost savings for fault-tolerant workloads
- Ideal for batch processing, ETL jobs, and data pipelines
- Can be combined with on-demand instances for resilience
Disadvantages:
- Instances can be terminated with as little as two minutes notice
- Not suitable for long-running, stateful, or latency-sensitive workloads
- Requires application-level fault tolerance
The Three Pillars of Cloud Cost
When analyzing any cloud bill, costs break down into three fundamental categories:
| Category | Description | Typical % of Data Eng. Bill |
|---|---|---|
| Compute | Processing power (CPUs, GPUs, memory) | 50-70% |
| Storage | Data at rest (object storage, block storage, databases) | 15-30% |
| Network | Data transfer (egress, cross-region, cross-AZ) | 5-15% |
Understanding this breakdown for your specific workloads is the first step toward optimization. Many teams discover that one category dominates their bill, and focusing optimization efforts there yields the greatest returns.
Total Cost of Ownership (TCO)
TCO extends beyond the direct cloud bill to include the full cost of running data infrastructure:
- Direct cloud costs: Compute, storage, network as described above
- Engineering time: Hours spent building, maintaining, and debugging pipelines
- Operational overhead: Monitoring, incident response, on-call rotations
- Opportunity cost: What could your team build if they were not managing infrastructure?
- Data quality costs: The business impact of late, incorrect, or missing data
A more expensive managed service that reduces engineering time can have a lower TCO than a cheaper self-managed solution. For example, using AWS Glue at a higher per-DPU cost might be cheaper in TCO than running a self-managed Spark cluster on EMR, once you account for the operational overhead of cluster management.
1.3 The FinOps Framework
FinOps (Cloud Financial Operations) is an evolving discipline and cultural practice that brings financial accountability to the variable-spend model of cloud computing. It provides a structured approach for organizations to manage cloud costs without sacrificing speed or innovation.
The Three Phases of FinOps
Phase 1: Inform
The Inform phase is about creating visibility into cloud spending. You cannot optimize what you cannot see. Key activities include:
- Establishing cost allocation through tagging and account structure
- Building dashboards that show cost trends and anomalies
- Attributing costs to teams, projects, and business units
- Understanding the relationship between technical metrics and financial metrics
- Creating a shared vocabulary for discussing cloud costs
Phase 2: Optimize
The Optimize phase focuses on reducing waste and improving efficiency. This is where data engineers have the most direct impact:
- Right-sizing compute resources to match actual workload requirements
- Eliminating idle or unused resources
- Implementing reserved pricing for predictable baseline workloads
- Optimizing queries and pipelines for cost efficiency
- Choosing the most cost-effective services and architectures
Phase 3: Operate
The Operate phase embeds cost awareness into ongoing operations:
- Setting budgets and alerts for anomaly detection
- Establishing governance policies for resource provisioning
- Integrating cost considerations into design reviews and architecture decisions
- Automating cost optimization actions (e.g., auto-scaling, scheduled shutdowns)
- Conducting regular cost reviews and retrospectives
How Data Teams Fit Into FinOps
Data engineering teams are uniquely positioned in the FinOps model because they are both major consumers of cloud resources and possessors of the technical knowledge needed to optimize those costs. Effective FinOps for data teams requires:
Shared accountability: Every data engineer should understand the cost implications of their design decisions. Cost is not solely the domain of the finance team or a centralized FinOps function.
Cost as a metric: Just as data teams track pipeline latency, data quality, and uptime, they should track cost per pipeline run, cost per TB processed, and cost trends over time.
Architectural decision records: When choosing between approaches (e.g., full refresh vs. incremental, materialized view vs. live query), document the cost implications alongside performance and maintainability.
Collaboration with finance: Data engineers should participate in cloud billing reviews and help translate technical metrics into business context. A line item showing "$15,000 for BigQuery compute" is more actionable when it can be attributed to specific pipelines and business use cases.
Module 2: BigQuery Cost Optimization
Google BigQuery is a fully managed, serverless data warehouse that excels at analyzing large datasets quickly. Its serverless nature means you do not manage infrastructure, but it also means you must understand its pricing model deeply to avoid costly surprises. BigQuery's pricing is fundamentally different from traditional databases, and optimizing for cost requires a different mindset than optimizing for performance alone.
2.1 BigQuery Pricing Models
BigQuery offers two primary pricing models for compute, and understanding when to use each is one of the most impactful cost decisions you will make.
On-Demand Pricing (Per TB Scanned)
With on-demand pricing, you pay based on the amount of data your queries scan. As of the current pricing, this costs $6.25 per TB scanned in most regions. The first 1 TB of data scanned per month is free.
Key characteristics:
- You pay only when you run queries
- Cost is proportional to the data scanned, not the complexity of the query
- No cost when queries are not running
- Ideal for ad-hoc analysis and variable workloads
Example cost calculation:
Suppose you have a 10 TB table and run a query that scans the entire table:
Query cost = 10 TB x $6.25/TB = $62.50 per query execution
If this query runs daily as part of a pipeline:
Monthly cost = $62.50 x 30 days = $1,875/month
Now suppose you add a partition filter that reduces the scan to 500 GB:
Query cost = 0.5 TB x $6.25/TB = $3.13 per query execution
Monthly cost = $3.13 x 30 days = $93.75/month
That single optimization saves $1,781.25 per month -- a 95% reduction.
BigQuery Editions (Slot-Based Pricing)
BigQuery Editions use a capacity-based model where you purchase "slots" -- units of computational capacity. You pay for slot-hours regardless of how much data your queries scan.
BigQuery offers three editions:
| Edition | Use Case | Baseline Pricing (per slot-hour) |
|---|---|---|
| Standard | Development, ad-hoc | ~$0.04 |
| Enterprise | Production workloads | ~$0.06 |
| Enterprise Plus | Mission-critical, multi-region | ~$0.10 |
Slots can be purchased as autoscaling (pay-as-you-go, billed per second) or baseline + autoscaling (committed baseline with burst capacity).
Example cost calculation:
Suppose your team consistently uses 500 slots for 8 hours per day:
Daily cost (autoscaling, Enterprise) = 500 slots x 8 hours x $0.06 = $240/day
Monthly cost = $240 x 30 = $7,200/month
With a committed baseline of 500 slots (annual commitment, ~40% discount):
Monthly cost = 500 slots x 730 hours x $0.06 x 0.6 = ~$13,140/month
Wait -- that looks more expensive. The key insight is that baseline commitments cover 24/7, so they only make sense if you are using those slots around the clock. For 8 hours/day, autoscaling is cheaper.
Choosing the Right Model
Use on-demand when:
- Monthly data scanned is under 5-10 TB
- Workloads are unpredictable or infrequent
- You are in early stages and still learning your usage patterns
- Ad-hoc analysis is the primary use case
Use Editions (slot-based) when:
- Monthly data scanned exceeds 10-20 TB consistently
- You want predictable, budgetable costs
- You have many concurrent users or queries
- You need performance guarantees (slots provide dedicated capacity)
Break-even analysis:
To find your break-even point, compare your monthly on-demand spend with what you would pay for the equivalent slot capacity:
Break-even TB scanned = (Slot cost per month) / $6.25
For example, if 100 Enterprise autoscaling slots cost approximately $4,380/month:
Break-even = $4,380 / $6.25 = ~700 TB scanned per month
If your queries scan more than 700 TB per month, slot-based pricing is likely cheaper.
2.2 Query Optimization for Cost
In BigQuery's on-demand model, cost is directly tied to the amount of data scanned. Every optimization that reduces the bytes scanned reduces your bill. Even in slot-based models, efficient queries complete faster, freeing slots for other workloads.
Avoid SELECT *
The single most impactful habit for BigQuery cost optimization is to never use SELECT * in production queries. Because BigQuery uses columnar storage, it only reads the columns you reference. Selecting all columns forces BigQuery to read the entire table.
Before (expensive):
-- Scans ALL columns in the table
-- If the table has 50 columns but you only need 3, you're paying for 47 unnecessary columns
SELECT *
FROM `project.dataset.events`
WHERE event_date = '2024-01-15'
After (optimized):
-- Scans only the 3 columns you need
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.events`
WHERE event_date = '2024-01-15'
On a table with 50 columns averaging 100 bytes each and 1 billion rows, SELECT * scans approximately 5 TB, while selecting 3 specific columns scans approximately 300 GB -- a 94% reduction in cost.
Partition Pruning
Partitioned tables in BigQuery divide data into segments based on a column value (typically a date or timestamp). When your query includes a filter on the partition column, BigQuery only scans the relevant partitions.
Creating a partitioned table:
CREATE TABLE `project.dataset.events`
(
event_id STRING,
user_id STRING,
event_type STRING,
event_timestamp TIMESTAMP,
event_date DATE,
payload JSON
)
PARTITION BY event_date
OPTIONS (
partition_expiration_days = 365,
require_partition_filter = TRUE
);
The require_partition_filter = TRUE option is a powerful safety net -- it prevents queries from running without a partition filter, ensuring that no one accidentally scans the entire table.
Query with partition pruning:
-- Only scans partitions for the specified date range
SELECT user_id, event_type, event_timestamp
FROM `project.dataset.events`
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
If the table contains 3 years of data (approximately 1,095 daily partitions), filtering to one month means BigQuery scans roughly 31/1095 = 2.8% of the data.
Clustering for Filtered Queries
Clustering organizes data within each partition based on the values of specified columns. When you filter on clustered columns, BigQuery can skip irrelevant blocks of data within a partition, further reducing the bytes scanned.
CREATE TABLE `project.dataset.events`
(
event_id STRING,
user_id STRING,
event_type STRING,
event_timestamp TIMESTAMP,
event_date DATE,
country STRING
)
PARTITION BY event_date
CLUSTER BY country, event_type;
Query benefiting from clustering:
-- Partition pruning on event_date AND block pruning on country and event_type
SELECT user_id, event_timestamp
FROM `project.dataset.events`
WHERE event_date = '2024-01-15'
AND country = 'US'
AND event_type = 'purchase'
Clustering is most effective when the clustered columns have high cardinality and are frequently used in WHERE clauses or JOIN conditions. You can cluster on up to four columns, and the order matters -- put the most frequently filtered column first.
Materialized Views
Materialized views precompute and store query results, allowing BigQuery to read from the materialized data instead of scanning the base tables. BigQuery automatically maintains materialized views, refreshing them when the underlying data changes.
CREATE MATERIALIZED VIEW `project.dataset.daily_event_summary`
AS
SELECT
event_date,
event_type,
country,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM `project.dataset.events`
GROUP BY event_date, event_type, country;
When you query the base table with a compatible aggregation, BigQuery automatically routes the query to the materialized view:
-- BigQuery may automatically use the materialized view for this query
SELECT event_date, SUM(event_count) as total_events
FROM `project.dataset.daily_event_summary`
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY event_date
Materialized views incur storage costs and maintenance overhead, so they are most valuable for frequently run aggregation queries on large tables.
Auditing Query Costs with INFORMATION_SCHEMA
BigQuery provides detailed query metadata through the INFORMATION_SCHEMA views, allowing you to identify expensive queries and track cost trends.
Find the most expensive queries in the last 30 days:
SELECT
user_email,
job_id,
query,
total_bytes_processed,
ROUND(total_bytes_processed / POW(1024, 4), 2) AS tb_processed,
ROUND(total_bytes_processed / POW(1024, 4) * 6.25, 2) AS estimated_cost_usd,
creation_time,
total_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
ORDER BY total_bytes_processed DESC
LIMIT 50;
Find cost by user over the last 7 days:
SELECT
user_email,
COUNT(*) AS query_count,
ROUND(SUM(total_bytes_processed) / POW(1024, 4), 2) AS total_tb_scanned,
ROUND(SUM(total_bytes_processed) / POW(1024, 4) * 6.25, 2) AS estimated_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
AND state = 'DONE'
GROUP BY user_email
ORDER BY estimated_cost_usd DESC;
Running these audits regularly (weekly or even daily) helps you catch cost anomalies early, identify users or pipelines that need optimization, and track the impact of your optimization efforts over time.
2.3 Storage Optimization
BigQuery charges for data stored in its tables. While storage is cheaper than compute, it accumulates over time and can become a significant cost driver if left unmanaged.
Active vs. Long-Term Storage
BigQuery automatically classifies table data into two pricing tiers:
| Tier | Condition | Price (per GB/month) |
|---|---|---|
| Active | Modified in the last 90 days | ~$0.02 |
| Long-term | Not modified for 90+ days | ~$0.01 |
The transition to long-term storage is automatic -- you do not need to do anything. However, you should be aware that modifying a table (even appending a single row) resets the 90-day clock for the entire table. This means that append-only tables with daily loads will always be charged at the active storage rate.
Strategy: For large historical tables, consider separating current data (which is frequently appended to) from historical data (which is rarely modified). The historical table will naturally transition to long-term storage pricing.
Table Expiration Policies
Setting expiration policies ensures that temporary or staging tables do not accumulate indefinitely:
-- Set table expiration to 7 days from creation
CREATE TABLE `project.dataset.staging_events`
(
event_id STRING,
event_data JSON
)
OPTIONS (
expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
);
-- Set default expiration for all new tables in a dataset
ALTER SCHEMA `project.dataset`
SET OPTIONS (
default_table_expiration_days = 30
);
For partitioned tables, you can set partition-level expiration, which is even more powerful:
-- Automatically delete partitions older than 365 days
CREATE TABLE `project.dataset.events`
(
event_id STRING,
event_date DATE
)
PARTITION BY event_date
OPTIONS (
partition_expiration_days = 365
);
This approach ensures that your table always contains at most one year of data, with old partitions automatically cleaned up.
Removing Duplicate Data
Duplicate data wastes storage and increases query costs. BigQuery does not enforce unique constraints, so duplicates can easily creep in through pipeline retries or overlapping data loads.
Identify duplicates:
SELECT event_id, COUNT(*) as duplicate_count
FROM `project.dataset.events`
GROUP BY event_id
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC
LIMIT 100;
Deduplicate using MERGE:
-- Create a deduplicated version of the table
CREATE OR REPLACE TABLE `project.dataset.events_deduped` AS
SELECT * EXCEPT(row_num)
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_timestamp DESC) as row_num
FROM `project.dataset.events`
)
WHERE row_num = 1;
Choosing the Right Partitioning Strategy
The choice of partition column and granularity has a significant impact on both query cost and storage efficiency:
| Strategy | Best For | Partition Count |
|---|---|---|
| Daily partitioning on date | Time-series data with daily queries | ~365/year |
| Monthly partitioning | Historical analysis with monthly grain | ~12/year |
| Integer range partitioning | Non-time-series data (e.g., customer segments) | Varies |
| Ingestion-time partitioning | When no natural partition column exists | ~365/year |
Guidelines:
- Choose a column that is frequently used in WHERE clauses
- Ensure partitions are not too small (under 1 GB) or too large (over 5 TB)
- Daily partitioning is the most common and works well for most time-series data
- Always require partition filters on large tables to prevent accidental full scans
Module 3: Snowflake Cost Optimization
Snowflake's architecture separates storage and compute, giving you independent control over each. This separation is powerful but introduces a pricing model that differs significantly from both traditional databases and other cloud warehouses. Understanding Snowflake's credit-based system is essential for cost management.
3.1 Understanding Snowflake Credits
Snowflake charges for compute using a unit called a "credit." The cost of a credit depends on your edition and region, but typically ranges from $2 to $4 per credit. Credits are consumed whenever a virtual warehouse is running, regardless of whether it is actively processing queries.
Warehouse Sizes and Credit Consumption
Each warehouse size consumes a fixed number of credits per hour:
| Warehouse Size | Credits/Hour | Approx. Nodes |
|---|---|---|
| X-Small (XS) | 1 | 1 |
| Small (S) | 2 | 2 |
| Medium (M) | 4 | 4 |
| Large (L) | 8 | 8 |
| X-Large (XL) | 16 | 16 |
| 2X-Large | 32 | 32 |
| 3X-Large | 64 | 64 |
| 4X-Large | 128 | 128 |
Notice that each size step doubles the credit consumption. This means upgrading from Medium to Large doubles your cost, even if the query only runs marginally faster.
Calculating Query Costs
The cost of a query depends on the warehouse size, how long it runs, and the per-credit price. Snowflake bills in one-second increments with a minimum of 60 seconds per warehouse resume.
Example:
Warehouse size: Medium (4 credits/hour)
Query runtime: 45 seconds
Credit price: $3.00
Cost = (4 credits/hour) x (45 seconds / 3600 seconds) x $3.00
Cost = 4 x 0.0125 x $3.00
Cost = $0.15 per query execution
Now consider running the same query on an X-Large warehouse where it completes in 12 seconds:
Cost = (16 credits/hour) x (12 seconds / 3600 seconds) x $3.00
Cost = 16 x 0.00333 x $3.00
Cost = $0.16 per query execution
Even though the XL warehouse is 4x more powerful and finishes 3.75x faster, the cost is nearly identical for this single query. However, for a 10-minute query on Medium that takes 2.5 minutes on XL:
Medium: 4 x (600/3600) x $3.00 = $2.00
XL: 16 x (150/3600) x $3.00 = $2.00
The costs remain similar because Snowflake's linear scaling means larger warehouses finish proportionally faster. The real savings come from auto-suspend behavior -- a larger warehouse that finishes quickly and suspends sooner can actually cost less than a smaller warehouse that stays active longer processing a queue of queries.
3.2 Warehouse Management
Effective warehouse management is the single most impactful lever for Snowflake cost optimization. A poorly configured warehouse can burn credits 24/7 with no one noticing.
Auto-Suspend and Auto-Resume
Auto-suspend shuts down a warehouse after a period of inactivity. Auto-resume starts it automatically when a query arrives. These two settings are your first line of defense against wasted credits.
-- Create a warehouse with aggressive auto-suspend (60 seconds)
CREATE WAREHOUSE analytics_wh
WITH WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 60 -- Suspend after 60 seconds of inactivity
AUTO_RESUME = TRUE -- Automatically resume when queries arrive
INITIALLY_SUSPENDED = TRUE;
-- Modify an existing warehouse
ALTER WAREHOUSE etl_wh SET
AUTO_SUSPEND = 120
AUTO_RESUME = TRUE;
Recommended auto-suspend values by use case:
| Use Case | Auto-Suspend (seconds) | Rationale |
|---|---|---|
| ETL / Batch pipelines | 60 | Jobs are scheduled; no waiting users |
| BI / Dashboard queries | 300 | Users may run follow-up queries |
| Ad-hoc analysis | 300-600 | Analysts explore iteratively |
| Data science / ML | 600-900 | Long think-time between queries |
The 60-second minimum charge per resume means that warehouses resuming frequently can accumulate costs. If your ETL pipeline runs 100 small queries, each separated by 2 minutes of inactivity, the warehouse will suspend and resume 100 times, paying 100 minutes of minimum charges instead of the actual compute time. In this case, a longer auto-suspend period (e.g., 300 seconds) or batching queries together is more cost-effective.
Right-Sizing Warehouses
Different workloads have different resource requirements. Using a single warehouse for all workloads means you are either over-provisioning (wasting money) or under-provisioning (degrading performance).
-- Dedicated warehouse for ETL: needs power, runs intermittently
CREATE WAREHOUSE etl_wh
WITH WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;
-- Dedicated warehouse for BI dashboards: moderate, concurrent users
CREATE WAREHOUSE bi_wh
WITH WAREHOUSE_SIZE = 'MEDIUM'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
-- Dedicated warehouse for ad-hoc analysis: small, variable usage
CREATE WAREHOUSE adhoc_wh
WITH WAREHOUSE_SIZE = 'SMALL'
AUTO_SUSPEND = 600
AUTO_RESUME = TRUE;
How to determine the right size:
- Start with a Small warehouse for new workloads
- Monitor query performance using the
QUERY_HISTORYview - If queries are queuing (spillage to remote storage or excessive queue time), try the next size up
- If the warehouse is idle most of the time, try a smaller size
- Check the query profile for "bytes spilled to local storage" and "bytes spilled to remote storage" -- spilling is a sign the warehouse is undersized
-- Check for spilling in recent queries
SELECT
query_id,
warehouse_name,
warehouse_size,
execution_time / 1000 as execution_seconds,
bytes_spilled_to_local_storage,
bytes_spilled_to_remote_storage,
query_text
FROM snowflake.account_usage.query_history
WHERE start_time > DATEADD('day', -7, CURRENT_TIMESTAMP())
AND bytes_spilled_to_remote_storage > 0
ORDER BY bytes_spilled_to_remote_storage DESC
LIMIT 20;
Multi-Cluster Warehouses
For workloads with variable concurrency (e.g., many BI users querying simultaneously during business hours), multi-cluster warehouses automatically scale out by adding clusters as demand increases.
CREATE WAREHOUSE bi_wh
WITH WAREHOUSE_SIZE = 'MEDIUM'
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4
SCALING_POLICY = 'STANDARD'
AUTO_SUSPEND = 300
AUTO_RESUME = TRUE;
Multi-cluster warehouses prevent query queuing without requiring a permanently large warehouse. The STANDARD scaling policy adds clusters when queries start queuing and removes them after the load decreases.
Cost implication: A 4-cluster Medium warehouse at peak costs the same as a single X-Large warehouse (both consume 16 credits/hour), but it automatically scales back down to 1 cluster during off-peak hours.
Resource Monitors
Resource monitors set credit consumption limits and trigger actions (alerts or warehouse suspension) when thresholds are reached.
-- Create a resource monitor with a monthly budget
CREATE RESOURCE MONITOR monthly_etl_budget
WITH CREDIT_QUOTA = 5000
FREQUENCY = MONTHLY
START_TIMESTAMP = IMMEDIATELY
TRIGGERS
ON 75 PERCENT DO NOTIFY
ON 90 PERCENT DO NOTIFY
ON 100 PERCENT DO SUSPEND;
-- Apply the monitor to a specific warehouse
ALTER WAREHOUSE etl_wh SET RESOURCE_MONITOR = monthly_etl_budget;
-- Create an account-level resource monitor
CREATE RESOURCE MONITOR account_monthly_budget
WITH CREDIT_QUOTA = 20000
FREQUENCY = MONTHLY
START_TIMESTAMP = IMMEDIATELY
TRIGGERS
ON 80 PERCENT DO NOTIFY
ON 95 PERCENT DO NOTIFY
ON 100 PERCENT DO SUSPEND_IMMEDIATE;
Resource monitors are essential governance controls. Without them, a runaway query or misconfigured pipeline can burn through your entire budget. The SUSPEND_IMMEDIATE action will cancel running queries, while SUSPEND allows currently running queries to complete before suspending.
3.3 Query and Storage Optimization
Beyond warehouse management, Snowflake offers several features for optimizing query performance and storage costs.
Clustering Keys
Snowflake automatically organizes data into micro-partitions, but for very large tables (multi-TB), the default organization may not align with your query patterns. Clustering keys tell Snowflake to co-locate related data, improving partition pruning.
-- Add clustering keys to an existing table
ALTER TABLE events CLUSTER BY (event_date, country);
-- Verify clustering efficiency
SELECT SYSTEM$CLUSTERING_INFORMATION('events', '(event_date, country)');
Clustering is an ongoing, automatic process that consumes credits in the background. It is most cost-effective for:
- Tables larger than 1 TB
- Tables with well-defined query patterns (consistent WHERE clause columns)
- Tables where the natural data loading order does not match query patterns
Avoid over-clustering: Adding clustering keys to small tables or tables that are already well-ordered wastes credits. Monitor the automatic clustering costs in your account usage views.
Micro-Partition Pruning
Snowflake stores data in micro-partitions (50-500 MB compressed). When a query includes a filter, Snowflake can skip entire micro-partitions that do not contain matching data. Effective pruning depends on data co-location, which is influenced by clustering.
-- Check how many partitions are being pruned
SELECT *
FROM TABLE(information_schema.query_history())
WHERE query_text ILIKE '%events%'
ORDER BY start_time DESC
LIMIT 10;
-- Look at partitions_scanned vs. partitions_total in the query profile
Time Travel and Fail-Safe Cost Implications
Snowflake's Time Travel feature allows you to access historical data that has been modified or deleted. Fail-safe provides an additional recovery period. Both features consume storage.
| Feature | Duration | Storage Cost |
|---|---|---|
| Time Travel | 0-90 days (configurable) | Charged for changed data |
| Fail-safe | 7 days (non-configurable) | Charged for changed data |
For tables with frequent updates or deletes, Time Travel storage can be substantial because Snowflake retains the old versions of changed micro-partitions.
-- Reduce Time Travel retention for staging tables
ALTER TABLE staging_events SET DATA_RETENTION_TIME_IN_DAYS = 1;
-- Set to 0 for truly temporary tables (disables Time Travel)
ALTER TABLE tmp_processing SET DATA_RETENTION_TIME_IN_DAYS = 0;
-- Keep longer retention for critical production tables
ALTER TABLE production_facts SET DATA_RETENTION_TIME_IN_DAYS = 30;
Recommendation: Use the minimum Time Travel retention that meets your recovery needs. For staging and temporary tables, 0-1 days is usually sufficient. For production tables, 7-14 days provides a reasonable safety net without excessive storage costs.
Zero-Copy Cloning
Zero-copy cloning creates an instant copy of a table, schema, or database without physically duplicating the data. The clone shares the underlying micro-partitions with the original, so it consumes no additional storage until data in either the clone or the original is modified.
-- Clone a production database for development
CREATE DATABASE dev_analytics CLONE prod_analytics;
-- Clone a specific table for testing
CREATE TABLE events_test CLONE events;
Zero-copy cloning is a powerful cost optimization tool for:
- Development environments: Give each developer a full clone of production data without doubling storage costs
- Testing: Create test datasets that mirror production
- Backups: Take point-in-time snapshots before risky operations
- Data sharing: Share data across teams without copying
The cost only increases as the clone and original diverge (i.e., as modifications create new micro-partitions that are not shared).
Module 4: AWS Data Service Costs
AWS offers a wide range of data services, each with its own pricing model. For data engineers, the key services to understand are S3 for storage, and EMR, Glue, and Athena for compute. Making the right choice between these services can save tens of thousands of dollars per month.
4.1 S3 Storage Economics
Amazon S3 is the foundation of nearly every AWS data architecture. Understanding its tiered storage model and pricing nuances is essential for cost management.
Storage Tiers
S3 offers multiple storage classes optimized for different access patterns:
| Storage Class | Monthly Cost (per GB) | Min. Storage Duration | Retrieval Cost | Use Case |
|---|---|---|---|---|
| Standard | $0.023 | None | None | Frequently accessed data |
| Intelligent-Tiering | $0.023 + monitoring fee | None | None | Unknown access patterns |
| Standard-IA | $0.0125 | 30 days | $0.01/GB | Infrequent access |
| One Zone-IA | $0.01 | 30 days | $0.01/GB | Non-critical, infrequent |
| Glacier Instant | $0.004 | 90 days | $0.03/GB | Archive, instant access |
| Glacier Flexible | $0.0036 | 90 days | $0.01/GB (expedited) | Archive, minutes-hours |
| Glacier Deep Archive | $0.00099 | 180 days | $0.02/GB | Long-term archive |
The cost difference between Standard and Deep Archive is over 23x. For a 100 TB data lake, the annual storage cost difference is:
Standard: 100,000 GB x $0.023 x 12 = $27,600/year
Deep Archive: 100,000 GB x $0.00099 x 12 = $1,188/year
Savings: $26,412/year (96% reduction)
Of course, Deep Archive data takes hours to retrieve, so it is only appropriate for data that is rarely or never accessed.
Lifecycle Policies
S3 Lifecycle policies automate the transition of data between storage classes based on age:
{
"Rules": [
{
"ID": "DataLakeLifecycle",
"Status": "Enabled",
"Filter": {
"Prefix": "raw/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER_IR"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
],
"Expiration": {
"Days": 2555
}
}
]
}
This policy moves raw data to Standard-IA after 30 days, Glacier Instant Retrieval after 90 days, Deep Archive after 1 year, and deletes it after 7 years. For a data lake ingesting 1 TB/day, this policy can reduce storage costs by 60-80% compared to keeping everything in Standard.
Terraform implementation:
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "raw-data-lifecycle"
status = "Enabled"
filter {
prefix = "raw/"
}
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER_IR"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
expiration {
days = 2555
}
}
}
Request Pricing and Data Transfer
S3 charges not just for storage but also for API requests and data transfer:
| Operation | Cost |
|---|---|
| PUT, COPY, POST, LIST | $0.005 per 1,000 requests |
| GET, SELECT | $0.0004 per 1,000 requests |
| Data Transfer OUT to Internet | $0.09/GB (first 10 TB/month) |
| Data Transfer to same region | Free |
| Data Transfer to different region | $0.02/GB |
Request costs matter when you have many small files. A pipeline that processes 1 million small files makes 1 million GET requests, costing $0.40 just in request fees. Compacting small files into larger ones (e.g., using Apache Spark's coalesce() or repartition()) reduces both request costs and processing time.
Data transfer is the hidden cost killer. Moving 10 TB out of AWS to the internet costs $900. Moving data between regions costs $200 per 10 TB. This is why data locality matters -- keep compute and storage in the same region.
S3 Select
S3 Select allows you to retrieve a subset of data from an object using SQL expressions, reducing the amount of data transferred and processed:
import boto3
s3_client = boto3.client('s3')
response = s3_client.select_object_content(
Bucket='my-data-lake',
Key='events/2024/01/events.parquet',
Expression="SELECT user_id, event_type FROM s3object WHERE country = 'US'",
ExpressionType='SQL',
InputSerialization={
'Parquet': {}
},
OutputSerialization={
'JSON': {}
}
)
S3 Select is priced at $0.002 per GB scanned and $0.0007 per GB returned. For large Parquet or CSV files where you only need a subset of columns or rows, this can be significantly cheaper than downloading the entire file.
4.2 Compute: EMR vs. Glue vs. Athena
AWS offers three primary services for data processing, each with distinct pricing models, operational characteristics, and ideal use cases.
Amazon EMR (Elastic MapReduce)
EMR provides managed Hadoop/Spark clusters. You pay for EC2 instances plus an EMR surcharge (typically 15-25% above EC2 pricing).
Pricing model:
EMR cost = EC2 instance cost + EMR surcharge
Example: Medium Spark cluster
3x m5.xlarge instances (1 master + 2 core)
EC2 cost: 3 x $0.192/hour = $0.576/hour
EMR surcharge: 3 x $0.048/hour = $0.144/hour
Total: $0.720/hour
Running 4 hours/day for ETL:
Monthly cost: $0.720 x 4 x 30 = $86.40/month
With Spot Instances for core/task nodes:
1x m5.xlarge on-demand (master): $0.192 + $0.048 = $0.240/hour
2x m5.xlarge spot (core): 2 x ($0.058 + $0.048) = $0.212/hour
Total: $0.452/hour (37% savings)
Monthly cost: $0.452 x 4 x 30 = $54.24/month
EMR is ideal for complex Spark jobs, custom frameworks, or workloads requiring fine-grained control over the execution environment.
AWS Glue
Glue is a fully serverless ETL service. You pay per DPU-hour (Data Processing Unit), where each DPU provides 4 vCPUs and 16 GB of memory.
Pricing model:
Glue cost = Number of DPUs x Runtime (hours) x $0.44/DPU-hour
Example:
10 DPUs running for 30 minutes:
Cost = 10 x 0.5 x $0.44 = $2.20 per job run
Running daily:
Monthly cost: $2.20 x 30 = $66/month
Glue 4.0 supports auto-scaling, which means it starts with a minimum number of workers and scales up based on workload. This avoids paying for idle DPUs during the less intensive phases of a job.
# Glue job configuration with auto-scaling
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# With auto-scaling enabled:
# --enable-auto-scaling true
# --number-of-workers 2 (minimum)
# --worker-type G.1X
Glue is ideal for teams that want serverless simplicity, built-in data catalog integration, and do not need fine-grained infrastructure control.
Amazon Athena
Athena is a serverless query engine that charges per TB scanned, similar to BigQuery's on-demand model.
Pricing model:
Athena cost = $5.00 per TB scanned
Example:
Query scanning 500 GB of Parquet data:
Cost = 0.5 TB x $5.00 = $2.50
100 such queries per month:
Monthly cost: $2.50 x 100 = $250/month
Athena benefits enormously from Parquet/ORC formats (columnar compression reduces data scanned), partitioned data (Hive-style partitions in S3), and bucketing (similar to BigQuery clustering).
Cost Comparison
Let us compare the three services for a common scenario: processing 500 GB of data daily.
| Service | Configuration | Daily Cost | Monthly Cost |
|---|---|---|---|
| EMR (On-Demand) | 3x m5.xlarge, 2 hours | $1.44 | $43.20 |
| EMR (Spot) | 1 on-demand + 2 spot, 2 hours | $0.90 | $27.12 |
| Glue | 10 DPUs, 1 hour | $4.40 | $132.00 |
| Athena | 500 GB scan per query | $2.50 | $75.00 |
EMR with Spot instances is the cheapest for large batch workloads, but requires operational overhead for cluster management. Athena is the cheapest for occasional, ad-hoc queries on well-partitioned data. Glue sits in the middle, offering serverless simplicity at a moderate cost.
When to use each:
- EMR: Large-scale batch processing, complex Spark pipelines, ML training, workloads requiring custom libraries or specific Spark configurations
- Glue: Serverless ETL, tight integration with AWS data catalog, teams without dedicated infrastructure expertise
- Athena: Ad-hoc SQL queries, interactive analysis, data exploration, lightweight scheduled queries
4.3 Reserved and Spot Strategies
For predictable workloads, reserved pricing and spot instances can dramatically reduce compute costs.
Reserved Instances and Savings Plans
AWS offers two mechanisms for commitment-based discounts:
Reserved Instances (RIs):
- Commit to a specific instance type in a specific region
- 1-year or 3-year terms
- Up to 72% discount over on-demand
- Best for always-on workloads (e.g., persistent EMR clusters)
Savings Plans:
- Commit to a dollar amount of compute per hour
- More flexible than RIs (apply across instance families and regions)
- Compute Savings Plans: Up to 66% discount, apply to EC2, Fargate, and Lambda
- EC2 Instance Savings Plans: Up to 72% discount, specific to instance family and region
Example:
On-demand: 10x m5.xlarge instances, 24/7
Cost: 10 x $0.192 x 730 hours = $1,401.60/month
3-year All Upfront Reserved:
Cost: 10 x $0.074 x 730 hours = $540.20/month
Savings: $861.40/month (61%)
Spot Instances for Batch Processing
Spot instances are ideal for fault-tolerant data engineering workloads. EMR natively supports Spot instances with automatic fallback to on-demand.
EMR Spot Fleet configuration (Terraform):
resource "aws_emr_cluster" "spark_etl" {
name = "spark-etl-cluster"
release_label = "emr-7.0.0"
applications = ["Spark"]
master_instance_group {
instance_type = "m5.xlarge"
instance_count = 1
}
core_instance_group {
instance_type = "m5.2xlarge"
instance_count = 2
bid_price = "0.10" # Spot bid price
ebs_config {
size = 100
type = "gp3"
}
}
# Task instance fleet for additional spot capacity
ec2_attributes {
instance_profile = aws_iam_instance_profile.emr.arn
}
}
Best practices for Spot in data engineering:
- Always use on-demand for the master node
- Use Spot for core and task nodes with instance fleet diversification
- Configure multiple instance types to increase spot capacity availability
- Use checkpointing in Spark to handle spot interruptions gracefully
- Set up CloudWatch alarms for spot interruption notices
Graviton Instances
AWS Graviton processors (ARM-based) offer 20-40% better price-performance compared to x86 instances for many data workloads.
| Instance | vCPUs | Memory | On-Demand Price | Savings vs x86 |
|---|---|---|---|---|
| m5.xlarge (x86) | 4 | 16 GB | $0.192/hour | Baseline |
| m6g.xlarge (Graviton2) | 4 | 16 GB | $0.154/hour | 20% |
| m7g.xlarge (Graviton3) | 4 | 16 GB | $0.163/hour | 15% |
EMR, Glue, and most AWS data services support Graviton instances. Spark, Hive, Presto, and Python workloads generally work without modification on Graviton.
# Use Graviton instances in EMR for better price-performance
core_instance_group {
instance_type = "m6g.2xlarge" # Graviton2
instance_count = 4
}
Module 5: Cost Monitoring and Governance
Optimization without monitoring is guesswork. This module covers the tools and practices for maintaining ongoing visibility into cloud costs and ensuring that optimization gains are sustained over time.
5.1 Tagging Strategy
Resource tagging is the foundation of cost allocation. Without consistent tags, you cannot attribute costs to teams, projects, or environments, making it impossible to know where money is being spent.
Designing a Tagging Schema
A well-designed tagging schema should answer four questions:
- Who is responsible for this resource? (team, owner)
- What is this resource used for? (project, application)
- Where is this resource in the lifecycle? (environment)
- Why does this resource exist? (cost center, business unit)
Recommended tags:
| Tag Key | Example Values | Purpose |
|---|---|---|
team |
data-engineering, analytics, ml | Cost attribution to teams |
project |
customer-360, revenue-pipeline | Cost attribution to projects |
environment |
production, staging, development | Separate prod from non-prod costs |
cost-center |
CC-1234 | Finance allocation |
owner |
jane.doe@company.com | Accountability |
managed-by |
terraform, manual, cloudformation | Infrastructure management |
data-classification |
public, internal, confidential | Governance |
Enforcing Tags with Policy
Tags are only useful if they are consistently applied. Use policy-as-code to enforce tagging compliance:
AWS - Tag Policy (via AWS Organizations):
{
"tags": {
"team": {
"tag_key": {
"@@assign": "team"
},
"enforced_for": {
"@@assign": [
"s3:bucket",
"emr:cluster",
"glue:job"
]
}
},
"environment": {
"tag_key": {
"@@assign": "environment"
},
"tag_value": {
"@@assign": ["production", "staging", "development"]
}
}
}
}
Terraform - Required tags module:
variable "required_tags" {
type = object({
team = string
project = string
environment = string
cost_center = string
owner = string
})
}
# Apply to all resources
resource "aws_s3_bucket" "data_lake" {
bucket = "my-data-lake"
tags = merge(var.required_tags, {
Name = "data-lake-bucket"
})
}
resource "aws_emr_cluster" "etl" {
name = "etl-cluster"
tags = merge(var.required_tags, {
Name = "etl-spark-cluster"
})
}
GCP - Organization Policy:
GCP uses labels (their equivalent of tags) and organization policies to enforce consistent labeling:
# gcp-label-policy.yaml
constraint: constraints/compute.requireLabels
listPolicy:
allValues: DENY
allowedValues:
- team
- project
- environment
5.2 Budgets and Alerts
Budgets and alerts are your early warning system for cost anomalies. They do not prevent overspending by themselves, but they ensure you know about it quickly enough to take action.
AWS Budgets
AWS Budgets allows you to set custom cost thresholds and receive notifications when spending approaches or exceeds those thresholds.
resource "aws_budgets_budget" "data_engineering_monthly" {
name = "data-engineering-monthly"
budget_type = "COST"
limit_amount = "10000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["user:team$data-engineering"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 75
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["data-eng-leads@company.com"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["data-eng-leads@company.com", "finance@company.com"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["data-eng-leads@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
GCP Budget Alerts
resource "google_billing_budget" "data_engineering" {
billing_account = var.billing_account_id
display_name = "Data Engineering Monthly Budget"
budget_filter {
projects = ["projects/${var.project_id}"]
labels = {
team = ["data-engineering"]
}
}
amount {
specified_amount {
currency_code = "USD"
units = "10000"
}
}
threshold_rules {
threshold_percent = 0.5
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 0.75
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 0.9
spend_basis = "FORECASTED_SPEND"
}
threshold_rules {
threshold_percent = 1.0
spend_basis = "CURRENT_SPEND"
}
all_updates_rule {
monitoring_notification_channels = [
google_monitoring_notification_channel.email.name
]
pubsub_topic = google_pubsub_topic.budget_alerts.id
}
}
Snowflake Budget Controls
Snowflake provides resource monitors (covered in Module 3) and account-level spending views:
-- Monitor daily credit consumption
SELECT
DATE_TRUNC('day', start_time) AS usage_date,
warehouse_name,
SUM(credits_used) AS credits_consumed,
SUM(credits_used) * 3.00 AS estimated_cost_usd
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1 DESC, credits_consumed DESC;
-- Detect anomalies: days where spending exceeded 2x the average
WITH daily_spend AS (
SELECT
DATE_TRUNC('day', start_time) AS usage_date,
SUM(credits_used) AS daily_credits
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('day', -90, CURRENT_TIMESTAMP())
GROUP BY 1
),
stats AS (
SELECT
AVG(daily_credits) AS avg_daily,
STDDEV(daily_credits) AS stddev_daily
FROM daily_spend
)
SELECT
d.usage_date,
d.daily_credits,
s.avg_daily,
ROUND((d.daily_credits - s.avg_daily) / NULLIF(s.stddev_daily, 0), 2) AS z_score
FROM daily_spend d
CROSS JOIN stats s
WHERE d.daily_credits > s.avg_daily * 2
ORDER BY d.usage_date DESC;
Automated Responses
Beyond alerts, you can automate responses to cost anomalies using cloud-native event systems:
AWS: Lambda triggered by budget alert via SNS:
import boto3
import json
def lambda_handler(event, context):
"""
Triggered when data engineering budget exceeds threshold.
Suspends non-critical EMR clusters and notifies the team.
"""
emr_client = boto3.client('emr')
sns_client = boto3.client('sns')
# List running clusters tagged as non-critical
clusters = emr_client.list_clusters(
ClusterStates=['RUNNING', 'WAITING']
)
terminated = []
for cluster in clusters['Clusters']:
tags = emr_client.describe_cluster(
ClusterId=cluster['Id']
)['Cluster']['Tags']
tag_dict = {t['Key']: t['Value'] for t in tags}
if tag_dict.get('environment') != 'production':
emr_client.terminate_job_flows(
JobFlowIds=[cluster['Id']]
)
terminated.append(cluster['Name'])
# Notify the team
sns_client.publish(
TopicArn='arn:aws:sns:us-east-1:123456789:data-eng-alerts',
Subject='Budget Alert: Non-critical EMR clusters terminated',
Message=json.dumps({
'action': 'auto-terminate',
'clusters': terminated,
'reason': 'Monthly budget threshold exceeded'
})
)
return {'terminated_clusters': terminated}
5.3 Cost Dashboards and Reporting
Cost dashboards transform raw billing data into actionable insights. A good dashboard answers three questions at a glance: How much are we spending? Where is the money going? Is spending trending up or down?
AWS Cost Explorer
AWS Cost Explorer provides built-in visualization of your AWS spending. For data engineering teams, the most useful views are:
- Service breakdown: See which AWS services (S3, EMR, Glue, Athena) consume the most budget
- Tag-based filtering: Filter by team, project, or environment tags
- Daily granularity: Spot anomalies and correlate with pipeline schedules
- Forecasting: AWS provides 12-month cost forecasts based on historical trends
For custom dashboards, use the AWS Cost and Usage Report (CUR) exported to S3 and queried via Athena:
-- Query the Cost and Usage Report in Athena
SELECT
line_item_product_code AS service,
resource_tags_user_team AS team,
resource_tags_user_project AS project,
DATE_FORMAT(line_item_usage_start_date, '%Y-%m') AS month,
SUM(line_item_unblended_cost) AS total_cost
FROM cost_and_usage_report
WHERE line_item_usage_start_date >= DATE '2024-01-01'
AND resource_tags_user_team = 'data-engineering'
GROUP BY 1, 2, 3, 4
ORDER BY total_cost DESC;
GCP Billing Reports
GCP Billing provides similar capabilities with BigQuery export:
-- Query GCP billing export in BigQuery
SELECT
service.description AS service_name,
labels.value AS team,
FORMAT_TIMESTAMP('%Y-%m', usage_start_time) AS month,
SUM(cost) AS total_cost,
SUM(credits.amount) AS total_credits,
SUM(cost) + SUM(credits.amount) AS net_cost
FROM `billing_dataset.gcp_billing_export`
LEFT JOIN UNNEST(labels) AS labels ON labels.key = 'team'
LEFT JOIN UNNEST(credits) AS credits
WHERE usage_start_time >= TIMESTAMP('2024-01-01')
GROUP BY 1, 2, 3
ORDER BY net_cost DESC;
Snowflake ACCOUNT_USAGE Views
Snowflake provides comprehensive usage tracking through the SNOWFLAKE.ACCOUNT_USAGE schema:
-- Weekly cost report by warehouse
SELECT
DATE_TRUNC('week', start_time) AS week,
warehouse_name,
SUM(credits_used) AS total_credits,
ROUND(SUM(credits_used) * 3.00, 2) AS estimated_cost_usd,
COUNT(DISTINCT DATE_TRUNC('day', start_time)) AS active_days,
ROUND(SUM(credits_used) / COUNT(DISTINCT DATE_TRUNC('day', start_time)), 2) AS avg_credits_per_day
FROM snowflake.account_usage.warehouse_metering_history
WHERE start_time >= DATEADD('week', -12, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1 DESC, total_credits DESC;
-- Storage cost breakdown
SELECT
DATE_TRUNC('month', usage_date) AS month,
ROUND(AVG(storage_bytes) / POW(1024, 4), 2) AS avg_storage_tb,
ROUND(AVG(stage_bytes) / POW(1024, 4), 2) AS avg_stage_tb,
ROUND(AVG(failsafe_bytes) / POW(1024, 4), 2) AS avg_failsafe_tb,
ROUND((AVG(storage_bytes) + AVG(stage_bytes) + AVG(failsafe_bytes))
/ POW(1024, 4) * 23, 2) AS estimated_monthly_cost_usd
FROM snowflake.account_usage.storage_usage
WHERE usage_date >= DATEADD('month', -6, CURRENT_DATE())
GROUP BY 1
ORDER BY 1 DESC;
Weekly Cost Review Cadence
Establishing a regular review cadence ensures cost awareness stays embedded in team culture:
Weekly (15-minute standup):
- Review week-over-week cost trend
- Identify any anomalies or spikes
- Check if any budget thresholds were triggered
- Assign follow-up investigation for anomalies
Monthly (30-minute review):
- Deep dive into cost by service, team, and project
- Compare actual vs. budgeted spend
- Review optimization opportunities identified
- Track savings from implemented optimizations
- Plan next month's optimization initiatives
Quarterly (1-hour planning):
- Review reserved instance / savings plan coverage
- Evaluate architecture decisions for cost efficiency
- Update budgets and forecasts
- Align with finance on cost allocation and chargeback
Module 6: Real-World Cost Patterns
This module applies the concepts from the previous modules to common data engineering patterns. These are the decisions you will face repeatedly in your career, and understanding their cost implications will help you make better choices.
6.1 Incremental Processing vs. Full Refresh
One of the most consequential architectural decisions in data engineering is whether to process data incrementally or do a full refresh. This decision has profound cost implications that scale with data volume.
Full Refresh
A full refresh drops and recreates a table by processing all source data from scratch.
Cost characteristics:
- Processes the entire dataset every run
- Cost grows linearly with data volume
- Simple to implement and reason about
- No state management required
- Guaranteed correctness (no edge cases with late-arriving data)
Example (BigQuery on-demand):
-- Full refresh: scan the entire source table every time
CREATE OR REPLACE TABLE `project.dataset.daily_metrics` AS
SELECT
DATE(event_timestamp) AS metric_date,
event_type,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM `project.dataset.events` -- 10 TB table
GROUP BY 1, 2;
-- Cost per run: 10 TB x $6.25 = $62.50
-- Daily cost: $62.50
-- Monthly cost: $1,875
-- Annual cost: $22,500
Incremental Processing
Incremental processing only handles new or changed data since the last run.
Cost characteristics:
- Processes only new/changed data
- Cost is proportional to the data change rate, not total volume
- More complex to implement (needs watermarks, change detection)
- Requires state management (tracking what has been processed)
- Must handle late-arriving data and out-of-order events
Example (BigQuery on-demand):
-- Incremental: only scan today's partition
MERGE INTO `project.dataset.daily_metrics` AS target
USING (
SELECT
DATE(event_timestamp) AS metric_date,
event_type,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM `project.dataset.events`
WHERE event_date = CURRENT_DATE() -- Only today's partition (~30 GB)
GROUP BY 1, 2
) AS source
ON target.metric_date = source.metric_date
AND target.event_type = source.event_type
WHEN MATCHED THEN
UPDATE SET
event_count = source.event_count,
unique_users = source.unique_users
WHEN NOT MATCHED THEN
INSERT (metric_date, event_type, event_count, unique_users)
VALUES (source.metric_date, source.event_type, source.event_count, source.unique_users);
-- Cost per run: 30 GB x $6.25/TB = $0.19
-- Daily cost: $0.19
-- Monthly cost: $5.70
-- Annual cost: $68.40
The incremental approach costs $22,431.60 less per year for this single pipeline. Across an organization with hundreds of pipelines, the savings are transformational.
dbt Incremental Models
dbt (data build tool) provides a framework for incremental processing that handles much of the complexity:
-- models/daily_metrics.sql
{{
config(
materialized='incremental',
unique_key=['metric_date', 'event_type'],
partition_by={
"field": "metric_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['event_type']
)
}}
SELECT
DATE(event_timestamp) AS metric_date,
event_type,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM {{ source('raw', 'events') }}
{% if is_incremental() %}
WHERE event_date > (SELECT MAX(metric_date) FROM {{ this }})
{% endif %}
GROUP BY 1, 2
dbt handles the merge logic, partition management, and the incremental/full-refresh toggle automatically. Running dbt run processes incrementally; running dbt run --full-refresh rebuilds from scratch.
When to Use Each Approach
| Factor | Full Refresh | Incremental |
|---|---|---|
| Data volume | Small (< 100 GB) | Large (> 100 GB) |
| Update frequency | Entire dataset changes | Small % of data changes per run |
| Complexity tolerance | Low (want simplicity) | High (can handle complexity) |
| Late-arriving data | Naturally handles it | Needs explicit handling |
| Cost sensitivity | Low (budget is flexible) | High (cost is a constraint) |
| Source data | No reliable timestamp/watermark | Has reliable change tracking |
6.2 Materialized Views vs. Live Queries
The decision to precompute results (materialize) vs. compute on-the-fly (live query) is a classic space-time trade-off with direct cost implications.
Live Queries
Live queries compute results from base tables every time they are executed.
Cost characteristics:
- No additional storage cost
- Full compute cost on every execution
- Always returns the latest data
- Cost scales with query frequency
Example:
-- This query scans 5 TB every time it runs
SELECT
DATE(event_timestamp) AS day,
country,
COUNT(DISTINCT user_id) AS dau
FROM `project.dataset.events`
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE()
GROUP BY 1, 2;
-- If run 50 times/day by different dashboard users:
-- Daily cost: 50 x 5 TB x $6.25 = $1,562.50
-- Monthly cost: $46,875
Materialized Views / Pre-Aggregated Tables
Materializing results stores precomputed aggregations that are queried instead of base tables.
Cost characteristics:
- Additional storage cost for the materialized data
- Compute cost for refreshing the materialization
- Much lower cost per query (smaller data to scan)
- Data freshness depends on refresh frequency
- Cost scales with refresh frequency, not query frequency
Example:
-- Materialized view refreshed automatically (BigQuery)
CREATE MATERIALIZED VIEW `project.dataset.daily_country_dau`
AS
SELECT
DATE(event_timestamp) AS day,
country,
COUNT(DISTINCT user_id) AS dau
FROM `project.dataset.events`
GROUP BY 1, 2;
-- Storage cost: ~1 GB of aggregated data = $0.02/month
-- Refresh cost: automatic, incremental
-- Query cost per execution: 1 GB x $6.25/TB = $0.006
-- If queried 50 times/day:
-- Daily cost: 50 x $0.006 = $0.30
-- Monthly cost: $9.00 + refresh costs
The materialized approach reduces costs from $46,875/month to under $100/month for this pattern -- a 99.8% reduction.
Decision Framework
Materialize when:
- The same expensive query is run frequently (more than 5-10 times per day)
- The base data is much larger than the aggregated result
- Slight data staleness is acceptable
- Multiple users or dashboards consume the same aggregation
Query live when:
- The query is run rarely (less than once per day)
- The base data is small
- Real-time freshness is required
- The query pattern is ad-hoc and unpredictable
6.3 Development Environment Cost Control
Development and testing environments are notorious for wasting money. They often mirror production in scale but receive a fraction of the traffic, resulting in massively over-provisioned resources. Controlling dev environment costs can save 30-50% of total cloud spend.
Using Smaller Datasets
Production-scale data is rarely needed for development. Creating representative subsets reduces both compute and storage costs:
-- BigQuery: Create a 1% sample of production data for development
CREATE TABLE `dev_project.dataset.events_sample` AS
SELECT *
FROM `prod_project.dataset.events`
WHERE MOD(ABS(FARM_FINGERPRINT(user_id)), 100) = 0 -- Deterministic 1% sample
AND event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY); -- Last 30 days only
-- Snowflake: Use SAMPLE for quick development datasets
CREATE TABLE dev_db.schema.events_sample AS
SELECT *
FROM prod_db.schema.events
SAMPLE (1); -- 1% sample
Snowflake Zero-Copy Clones for Development
As covered in Module 3, zero-copy clones provide full-fidelity development environments at minimal cost:
-- Create a dev environment with production data structure and sample data
CREATE DATABASE dev_analytics CLONE prod_analytics;
-- Each developer gets their own clone
CREATE SCHEMA dev_analytics.developer_jane CLONE prod_analytics.public;
The clone shares storage with production until data is modified, so the incremental cost is minimal. Combined with a small development warehouse (X-Small or Small), this provides a powerful and cost-effective development experience.
BigQuery Sandboxes and Development Projects
Use separate GCP projects for development with restricted quotas:
# Terraform: Create a dev project with custom quotas
resource "google_project" "data_eng_dev" {
name = "data-eng-dev"
project_id = "data-eng-dev-12345"
org_id = var.org_id
labels = {
team = "data-engineering"
environment = "development"
}
}
# Set a custom quota for BigQuery to limit dev spending
resource "google_project_service" "bigquery" {
project = google_project.data_eng_dev.project_id
service = "bigquery.googleapis.com"
}
Configure BigQuery custom cost controls per project to prevent developers from accidentally running expensive queries:
-- In BigQuery, set maximum bytes billed per query
-- This can be set at the project level or per query
-- Queries exceeding this limit will fail before executing
SELECT *
FROM `project.dataset.large_table`
OPTIONS (maximum_bytes_billed = 10737418240); -- 10 GB limit
Terraform Workspace Isolation
Use Terraform workspaces to manage separate infrastructure for each environment, with environment-specific sizing:
# variables.tf
variable "environment" {
type = string
}
locals {
env_config = {
production = {
emr_instance_type = "m5.2xlarge"
emr_instance_count = 10
s3_lifecycle_days = 365
glue_workers = 20
}
staging = {
emr_instance_type = "m5.xlarge"
emr_instance_count = 3
s3_lifecycle_days = 30
glue_workers = 5
}
development = {
emr_instance_type = "m5.large"
emr_instance_count = 2
s3_lifecycle_days = 7
glue_workers = 2
}
}
config = local.env_config[var.environment]
}
resource "aws_emr_cluster" "etl" {
name = "${var.environment}-etl-cluster"
master_instance_group {
instance_type = local.config.emr_instance_type
instance_count = 1
}
core_instance_group {
instance_type = local.config.emr_instance_type
instance_count = local.config.emr_instance_count
}
tags = {
environment = var.environment
team = "data-engineering"
}
}
Scheduling Dev Resources to Shut Down Overnight
Development resources sitting idle overnight and on weekends waste significant money. Automate their shutdown:
AWS Lambda for scheduled EMR termination:
import boto3
from datetime import datetime
def lambda_handler(event, context):
"""
Runs daily at 8 PM via CloudWatch Events.
Terminates all non-production EMR clusters.
"""
emr = boto3.client('emr')
clusters = emr.list_clusters(ClusterStates=['RUNNING', 'WAITING'])
for cluster in clusters['Clusters']:
details = emr.describe_cluster(ClusterId=cluster['Id'])
tags = {t['Key']: t['Value'] for t in details['Cluster']['Tags']}
if tags.get('environment') in ('development', 'staging'):
emr.terminate_job_flows(JobFlowIds=[cluster['Id']])
print(f"Terminated {cluster['Name']} ({cluster['Id']})")
Snowflake warehouse scheduling:
-- Create a task to suspend dev warehouses at 7 PM
CREATE TASK suspend_dev_warehouses
WAREHOUSE = admin_wh
SCHEDULE = 'USING CRON 0 19 * * MON-FRI America/New_York'
AS
ALTER WAREHOUSE dev_wh SUSPEND;
-- Create a task to resume dev warehouses at 8 AM
CREATE TASK resume_dev_warehouses
WAREHOUSE = admin_wh
SCHEDULE = 'USING CRON 0 8 * * MON-FRI America/New_York'
AS
ALTER WAREHOUSE dev_wh RESUME;
ALTER TASK suspend_dev_warehouses RESUME;
ALTER TASK resume_dev_warehouses RESUME;
Assuming a Medium dev warehouse ($4/credit/hour x 4 credits/hour = $16/hour equivalent):
- Running 24/7: $16 x 730 hours = $11,680/month
- Running business hours only (10 hours/day, weekdays): $16 x 10 x 22 = $3,520/month
- Savings: $8,160/month (70% reduction)
Next Steps
After completing this course on cloud cost optimization, you will be equipped to:
- Audit your current cloud spend - Use the INFORMATION_SCHEMA queries, Cost Explorer, and ACCOUNT_USAGE views to understand where money is going
- Implement quick wins - Partition pruning, column selection, auto-suspend, and lifecycle policies can deliver savings in days
- Build a cost monitoring practice - Set up budgets, alerts, and dashboards for ongoing visibility
- Advocate for cost-aware architecture - Bring cost analysis into design reviews and architecture decisions
- Explore Data Modeling - Understand how data model design impacts query costs
- Review Fundamentals - Revisit batch vs. streaming trade-offs with a cost lens
Recommended Resources:
Cloud FinOps by J.R. Storment and Mike Fuller
- The definitive guide to the FinOps framework
- Covers organizational practices, not just technical optimization
- Essential reading for anyone managing cloud budgets
FinOps Foundation (finops.org)
- Free resources, frameworks, and community
- FinOps Certified Practitioner certification
- Case studies from major organizations
Provider Documentation
- BigQuery Pricing - Always check current pricing
- Snowflake Credit Usage - Understand credit consumption
- AWS Cost Management - Tools and best practices
dbt Documentation on Incremental Models
- dbt Incremental Models - Framework for cost-effective data transformations
Remember: cloud cost optimization is not a one-time project -- it is an ongoing practice. Data volumes grow, pricing models change, and new services emerge. The data engineers who treat cost as a continuous concern, not an afterthought, deliver the most value to their organizations.