ETL Pipeline Orchestration with Apache Airflow
Design and implement an orchestrated ETL pipeline using Apache Airflow to extract, transform, and load weather data from a public API into a data warehouse.
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Airflow, Docker, Python and 3 more technologies through hands-on implementation. Rated intermediate level with comprehensive documentation and starter code.
🛠 Project: ETL Pipeline Orchestration with Apache Airflow
📌 Project Overview
In this project, you'll design and implement an orchestrated ETL pipeline using Apache Airflow to extract data from a public API, transform it, and load it into a data warehouse (e.g. BigQuery or DuckDB). You'll learn how to build modular, testable, and observable workflows that can run on a schedule and handle production-like errors and dependencies.
This project simulates a production-ready orchestration setup for modern data teams.
🎯 Learning Objectives
- Understand key concepts of workflow orchestration (DAGs, operators, scheduling)
- Learn the architecture of Airflow (scheduler, webserver, metadata DB, workers)
- Develop a maintainable ETL pipeline using Airflow DAGs
- Implement monitoring, alerting, and task-level retries
- Optionally deploy a local data warehouse for testing (e.g., DuckDB or BigQuery)
⌛ Estimated Duration
🕒 8–12 hours
🧠 Difficulty: Intermediate
📦 Dataset / API Recommendation
Dataset: Open Meteo API
- No authentication required
- Offers hourly/daily historical + forecast weather data
- JSON-based responses
Example:
https://api.open-meteo.com/v1/forecast?latitude=40.71&longitude=-74.01&hourly=temperature_2m,precipitation
🗂 Suggested Project Structure
airflow-etl-project/
├── dags/
│ ├── hello_world_dag.py
│ ├── weather_etl_dag.py
├── plugins/
│ └── custom_operators/
├── include/
│ └── templates/
│ └── email_alert.html
├── data/
│ └── weather_staging.csv
├── docker-compose.yaml
├── .env
└── README.md
🔄 Step-by-Step Guide
1. ⚙️ Set Up Airflow with Docker
- Use the official Docker Compose setup
- Create
.env:
AIRFLOW_UID=$(id -u)
AIRFLOW_HOME=./airflow
- Start Airflow:
docker-compose up airflow-init
docker-compose up
- Visit Airflow UI:
http://localhost:8080
2. ✨ Build a Simple DAG
- Create
hello_world_dag.py - Schedule it every 5 minutes
- Use
BashOperator,PythonOperator - Observe logs and task runs in the UI
3. 🌦 Build the Weather Data Ingestion DAG
- Create
weather_etl_dag.py - Schedule: every 6 hours
- Use
PythonOperatorto:- Call the API
- Parse and clean JSON
- Save to
data/weather_staging.csv
- Pass filenames with XCom if needed
4. 📥 Load Data into a Warehouse
Option A: DuckDB (local)
- Use
duckdbin Python - Load from CSV to
weather_hourly
Option B: BigQuery
- Use
BigQueryInsertJobOperator - Add row-count validation
5. 🚨 Add Observability
- Add retries:
retries=3,
retry_delay=timedelta(minutes=5),
- Add
on_failure_callbackfor alerts - Monitor with logs, Gantt charts, and UI
6. 🚦 Advanced DAG Features (Optional)
- Task branching (skip if no new data)
- Jinja templating for dynamic API calls
- Custom operators
- File sensors or upstream DAG dependency
✅ Deliverables
weather_etl_dag.py- Docker Compose Airflow setup
- Ingestion and transformation Python scripts
- Screenshots of DAG in UI
- README with:
- Setup
- DAG diagram
- Error handling notes
🚀 Optional Extensions
- Add Great Expectations validation
- Use dbt for modeling post-load
- Parametrize DAG for different cities
- Add CI/CD with GitHub Actions