ETL Pipeline Orchestration with Apache Airflow

    Design and implement an orchestrated ETL pipeline using Apache Airflow to extract, transform, and load weather data from a public API into a data warehouse.

    ✓ Expert-Designed Project• Industry-Validated Implementation• Production-Ready Architecture

    This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Airflow, Docker, Python and 3 more technologies through hands-on implementation. Rated intermediate level with comprehensive documentation and starter code.

    Intermediate
    8-12 hours

    🛠 Project: ETL Pipeline Orchestration with Apache Airflow

    📌 Project Overview

    In this project, you'll design and implement an orchestrated ETL pipeline using Apache Airflow to extract data from a public API, transform it, and load it into a data warehouse (e.g. BigQuery or DuckDB). You'll learn how to build modular, testable, and observable workflows that can run on a schedule and handle production-like errors and dependencies.

    This project simulates a production-ready orchestration setup for modern data teams.

    🎯 Learning Objectives

    • Understand key concepts of workflow orchestration (DAGs, operators, scheduling)
    • Learn the architecture of Airflow (scheduler, webserver, metadata DB, workers)
    • Develop a maintainable ETL pipeline using Airflow DAGs
    • Implement monitoring, alerting, and task-level retries
    • Optionally deploy a local data warehouse for testing (e.g., DuckDB or BigQuery)

    ⌛ Estimated Duration

    🕒 8–12 hours
    🧠 Difficulty: Intermediate

    📦 Dataset / API Recommendation

    Dataset: Open Meteo API

    • No authentication required
    • Offers hourly/daily historical + forecast weather data
    • JSON-based responses

    Example:

    https://api.open-meteo.com/v1/forecast?latitude=40.71&longitude=-74.01&hourly=temperature_2m,precipitation
    

    🗂 Suggested Project Structure

    airflow-etl-project/
    ├── dags/
    │   ├── hello_world_dag.py
    │   ├── weather_etl_dag.py
    ├── plugins/
    │   └── custom_operators/
    ├── include/
    │   └── templates/
    │       └── email_alert.html
    ├── data/
    │   └── weather_staging.csv
    ├── docker-compose.yaml
    ├── .env
    └── README.md
    

    🔄 Step-by-Step Guide

    1. ⚙️ Set Up Airflow with Docker

    AIRFLOW_UID=$(id -u)
    AIRFLOW_HOME=./airflow
    
    • Start Airflow:
    docker-compose up airflow-init
    docker-compose up
    
    • Visit Airflow UI: http://localhost:8080

    2. ✨ Build a Simple DAG

    • Create hello_world_dag.py
    • Schedule it every 5 minutes
    • Use BashOperator, PythonOperator
    • Observe logs and task runs in the UI

    3. 🌦 Build the Weather Data Ingestion DAG

    • Create weather_etl_dag.py
    • Schedule: every 6 hours
    • Use PythonOperator to:
      • Call the API
      • Parse and clean JSON
      • Save to data/weather_staging.csv
    • Pass filenames with XCom if needed

    4. 📥 Load Data into a Warehouse

    Option A: DuckDB (local)

    • Use duckdb in Python
    • Load from CSV to weather_hourly

    Option B: BigQuery

    • Use BigQueryInsertJobOperator
    • Add row-count validation

    5. 🚨 Add Observability

    • Add retries:
    retries=3,
    retry_delay=timedelta(minutes=5),
    
    • Add on_failure_callback for alerts
    • Monitor with logs, Gantt charts, and UI

    6. 🚦 Advanced DAG Features (Optional)

    • Task branching (skip if no new data)
    • Jinja templating for dynamic API calls
    • Custom operators
    • File sensors or upstream DAG dependency

    ✅ Deliverables

    • weather_etl_dag.py
    • Docker Compose Airflow setup
    • Ingestion and transformation Python scripts
    • Screenshots of DAG in UI
    • README with:
      • Setup
      • DAG diagram
      • Error handling notes

    🚀 Optional Extensions

    • Add Great Expectations validation
    • Use dbt for modeling post-load
    • Parametrize DAG for different cities
    • Add CI/CD with GitHub Actions

    Project Details

    Tools & Technologies

    Airflow
    Docker
    Python
    APIs
    DuckDB
    BigQuery

    Difficulty Level

    Intermediate

    Estimated Duration

    8-12 hours

    Sign in to submit projects and track your progress

    More Projects