Tourism Recovery Dashboard (SQL + Power BI)

    Answer a real business question end to end: load Eurostat regional tourism data into DuckDB, model the metrics in SQL, and ship a one-page Power BI dashboard explaining where tourism recovered fastest between 2022 and 2025.

    โœ“ Expert-Designed Projectโ€ข Industry-Validated Implementationโ€ข Production-Ready Architecture

    This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master DuckDB, SQL, Power BI and 2 more technologies through hands-on implementation. Rated beginner level with comprehensive documentation and starter code.

    Beginner
    6-8 hours

    ๐Ÿ‡ฎ๐Ÿ‡น Project: Italian Tourism Recovery Dashboard

    ๐Ÿ“Œ Project Overview

    A real, scoped business question with a clean public dataset and a polished Power BI deliverable โ€” the kind of portfolio piece a recruiter actually reads.

    You will load Eurostat tourism statistics into a local DuckDB database, model the data in SQL, then build a single-page Power BI dashboard that answers:

    Which Italian regions had the steepest tourism recovery between 2022 and 2025, and what does it imply for a hotel chain planning 2027 capacity expansion?

    By the end you have a public GitHub repo with the SQL, the Power BI file, a written conclusion, and a dashboard screenshot โ€” exactly the package recruiters look for at first-screen.


    ๐ŸŽฏ Learning Objectives

    • Pull a real EU public dataset and load it into a local analytical database
    • Write SQL with CTEs and window functions to compute recovery metrics
    • Model fact and dimension tables for a BI tool
    • Build a Power BI report with DAX measures, slicers, and a mobile layout
    • Tell a one-page data story end to end (question โ†’ analysis โ†’ recommendation)
    • Package the project on GitHub with a recruiter-ready README

    ๐Ÿงฐ Prerequisites

    • Power BI Desktop (Windows only; macOS users can run it in a Windows VM or use a Windows VM in Azure free tier)
    • DuckDB CLI or Python โ€” install with pip install duckdb or download from duckdb.org
    • Python 3.10+ with pandas (just for the ingestion script)
    • Git + a GitHub account
    • Basic SQL: SELECT, JOIN, GROUP BY, CTE. Window functions are a stretch goal.

    ๐Ÿ“Š Dataset

    Eurostat: Nights spent at tourist accommodation establishments by NUTS 2 region

    • Direct browser link: tour_occ_arn2 on Eurostat databrowser
    • Download format: TSV via the "Download" button (full dataset, then filter to Italy by geo codes starting with IT)
    • Time coverage: 2012โ€“present, monthly. We'll restrict to 2022-01 onward.
    • Geographic granularity: NUTS 2 (20 Italian regions: Lombardia, Lazio, Veneto, Sicilia, etc.).

    Why this dataset? It's authoritative (official EU statistics), regularly refreshed, covers exactly the question we're asking, and is at the right grain for a dashboard. The mess is real but bounded โ€” perfect for showing cleaning skill without spending three weeks on it.


    โŒ› Estimated Time

    Duration: 6โ€“8 hours
    Difficulty: Beginner


    ๐Ÿ“‚ Suggested Project Structure

    tourism-recovery-dashboard/
    โ”œโ”€โ”€ data/
    โ”‚   โ”œโ”€โ”€ raw/
    โ”‚   โ”‚   โ””โ”€โ”€ tour_occ_arn2.tsv         # Eurostat download
    โ”‚   โ””โ”€โ”€ tourism.duckdb                # local analytics DB (gitignored)
    โ”œโ”€โ”€ ingest/
    โ”‚   โ””โ”€โ”€ load_eurostat.py              # parse TSV โ†’ DuckDB
    โ”œโ”€โ”€ sql/
    โ”‚   โ”œโ”€โ”€ 01_stg_tourism.sql            # cleaned staging table
    โ”‚   โ”œโ”€โ”€ 02_dim_region.sql             # NUTS 2 โ†’ region name map
    โ”‚   โ”œโ”€โ”€ 03_fct_monthly_nights.sql     # monthly fact table
    โ”‚   โ””โ”€โ”€ 04_recovery_index.sql         # the analysis query
    โ”œโ”€โ”€ powerbi/
    โ”‚   โ”œโ”€โ”€ tourism_recovery.pbix         # the dashboard
    โ”‚   โ””โ”€โ”€ screenshots/
    โ”‚       โ””โ”€โ”€ headline.png
    โ”œโ”€โ”€ README.md
    โ””โ”€โ”€ .gitignore
    

    ๐Ÿ”„ Step-by-Step Guide

    1. ๐Ÿ“ฅ Pull the data

    1. Open Eurostat: tour_occ_arn2.
    2. Click Download โ†’ Full dataset โ†’ TSV (one file, around 6โ€“10 MB).
    3. Save as data/raw/tour_occ_arn2.tsv.

    The file is wide-format and uses Eurostat's TSV conventions: the first column has comma-separated dimension values, then one column per year/month period. Yes, it's awkward โ€” that's the point.


    2. ๐Ÿงฑ Load into DuckDB

    Eurostat TSV is too messy for DuckDB's read_csv_auto alone. Use a tiny Python script to reshape:

    # ingest/load_eurostat.py
    import pandas as pd
    import duckdb
    
    df = pd.read_csv(
        "data/raw/tour_occ_arn2.tsv",
        sep="\t",
        na_values=[":"],            # Eurostat marks missing as ":"
    )
    
    # Split the first column "freq,c_resid,unit,nace_r2,geo\TIME_PERIOD"
    key_col = df.columns[0]
    keys = df[key_col].str.split(",", expand=True)
    keys.columns = key_col.split("\\")[0].split(",")
    df = pd.concat([keys, df.drop(columns=[key_col])], axis=1)
    
    # Wide โ†’ long
    df_long = df.melt(
        id_vars=keys.columns.tolist(),
        var_name="period",
        value_name="nights",
    )
    df_long["nights"] = pd.to_numeric(df_long["nights"], errors="coerce")
    
    con = duckdb.connect("data/tourism.duckdb")
    con.execute("CREATE OR REPLACE TABLE raw_eurostat AS SELECT * FROM df_long")
    print(con.execute("SELECT COUNT(*) FROM raw_eurostat").fetchone())
    

    Run it: python ingest/load_eurostat.py. You should see a row count in the hundreds of thousands.


    3. ๐Ÿงน Stage and model in SQL

    Open DuckDB:

    duckdb data/tourism.duckdb
    

    01_stg_tourism.sql โ€” clean staging

    CREATE OR REPLACE TABLE stg_tourism AS
    SELECT
      geo                                              AS nuts2_code,
      unit,
      c_resid                                          AS residence,
      CAST(SUBSTR(period, 1, 4) AS INT)                AS year,
      CAST(SUBSTR(period, 6, 2) AS INT)                AS month,
      nights
    FROM raw_eurostat
    WHERE geo LIKE 'IT%'              -- Italy only
      AND period LIKE '20__-__'       -- monthly rows
      AND nights IS NOT NULL
      AND c_resid = 'TOTAL';          -- residents + non-residents combined
    

    02_dim_region.sql โ€” readable region names

    CREATE OR REPLACE TABLE dim_region AS
    SELECT * FROM (VALUES
      ('ITC1', 'Piemonte'),
      ('ITC2', 'Valle d''Aosta'),
      ('ITC3', 'Liguria'),
      ('ITC4', 'Lombardia'),
      ('ITF1', 'Abruzzo'),
      ('ITF2', 'Molise'),
      ('ITF3', 'Campania'),
      ('ITF4', 'Puglia'),
      ('ITF5', 'Basilicata'),
      ('ITF6', 'Calabria'),
      ('ITG1', 'Sicilia'),
      ('ITG2', 'Sardegna'),
      ('ITH1', 'Bolzano'),
      ('ITH2', 'Trento'),
      ('ITH3', 'Veneto'),
      ('ITH4', 'Friuli-Venezia Giulia'),
      ('ITH5', 'Emilia-Romagna'),
      ('ITI1', 'Toscana'),
      ('ITI2', 'Umbria'),
      ('ITI3', 'Marche'),
      ('ITI4', 'Lazio')
    ) AS t(nuts2_code, region_name);
    

    03_fct_monthly_nights.sql โ€” the fact table

    CREATE OR REPLACE TABLE fct_monthly_nights AS
    SELECT
      s.nuts2_code,
      d.region_name,
      s.year,
      s.month,
      DATE_TRUNC('month', MAKE_DATE(s.year, s.month, 1)) AS month_start,
      s.nights
    FROM stg_tourism s
    JOIN dim_region d ON d.nuts2_code = s.nuts2_code
    WHERE s.year BETWEEN 2019 AND 2025;        -- pre-COVID baseline through latest
    

    04_recovery_index.sql โ€” the analysis

    We define "recovery index" as 2025 nights / 2019 nights per region, where 2019 is the last full pre-COVID year. > 1 means the region is past pre-pandemic levels.

    WITH yearly AS (
      SELECT region_name, year, SUM(nights) AS yearly_nights
      FROM   fct_monthly_nights
      GROUP BY region_name, year
    ),
    pivoted AS (
      SELECT
        region_name,
        SUM(CASE WHEN year = 2019 THEN yearly_nights END) AS nights_2019,
        SUM(CASE WHEN year = 2022 THEN yearly_nights END) AS nights_2022,
        SUM(CASE WHEN year = 2025 THEN yearly_nights END) AS nights_2025
      FROM yearly
      GROUP BY region_name
    )
    SELECT
      region_name,
      nights_2019,
      nights_2022,
      nights_2025,
      ROUND(nights_2022 / nights_2019, 3) AS recovery_2022,
      ROUND(nights_2025 / nights_2019, 3) AS recovery_2025,
      ROUND((nights_2025 - nights_2022) / nights_2019, 3) AS lift_22_to_25
    FROM pivoted
    ORDER BY recovery_2025 DESC NULLS LAST;
    

    Save the result to a recovery_index view โ€” Power BI will hit it next.


    4. ๐Ÿ“Š Build the Power BI dashboard

    1. Install Power BI Desktop โ€” download here (Windows only).
    2. Connect to DuckDB: install the DuckDB ODBC driver and point Power BI โ†’ Get Data โ†’ ODBC at your tourism.duckdb file. (Alternative: export each table to Parquet with COPY fct_monthly_nights TO 'fct.parquet' and import the parquets โ€” simpler, slightly less elegant.)
    3. Build the model: in the model view, confirm fct_monthly_nights[nuts2_code] joins to dim_region[nuts2_code].
    4. DAX measures (Modeling tab โ†’ New measure):
    Total Nights = SUM ( fct_monthly_nights[nights] )
    
    Nights 2019 = CALCULATE ( [Total Nights], fct_monthly_nights[year] = 2019 )
    Nights 2025 = CALCULATE ( [Total Nights], fct_monthly_nights[year] = 2025 )
    
    Recovery Index = DIVIDE ( [Nights 2025], [Nights 2019] )
    Lift 22โ†’25     = DIVIDE ( [Nights 2025] - CALCULATE([Total Nights], fct_monthly_nights[year] = 2022), [Nights 2019] )
    
    1. Page layout (one page, mobile-friendly):
      • Headline card (top-left, large): Recovery Index for all Italy.
      • Bar chart (left): Recovery Index by region, sorted descending. Conditional color (red < 0.9, gray 0.9โ€“1.0, green > 1.0).
      • Line chart (right): Monthly nights, 2019 vs 2022 vs 2025, one line each.
      • Map (bottom): NUTS 2 regions colored by Recovery Index. Use the Filled Map visual with region_name on Location.
    2. Slicers: Year (multi-select) and Region (dropdown).
    3. Mobile layout (View โ†’ Mobile layout): stack the card, the bar chart, and the line chart in a single column. Skip the map on mobile.
    4. Title and subtitle at the top of the page:
      • "Italian tourism in 2025: recovered above pre-pandemic in 14 of 20 regions, with the South leading." (or whatever your data says)
      • Source line: "Source: Eurostat tour_occ_arn2, accessed [date]"

    5. โœ๏ธ Write the conclusion

    In the README, in a "Findings" section, write three to five lines following the SCQA structure (Situation, Complication, Question, Answer):

    Italian regional tourism recovery has been uneven since the 2020 collapse. Eurostat data shows that as of 2025, [N] of 20 regions have surpassed their 2019 baseline, while [M] remain at [X]% of pre-pandemic volume. The South of Italy outpaced the North for the first time in modern record, driven by [Calabria, Sicilia, Puglia]. For a hotel chain planning 2027 expansion, the recovery-laggard regions concentrate in [the Alpine and northern industrial belt], suggesting headroom there has been underestimated.

    Tailor to what your actual data shows. The recommendation matters more than the exact numbers.


    6. ๐Ÿ“ฆ Package on GitHub

    README.md template:

    # Italian Tourism Recovery Dashboard (SQL + Power BI)
    
    **Question:** Which Italian regions had the steepest tourism recovery 2022 โ†’ 2025?
    
    **Headline finding:** [one sentence with the result]
    
    ![dashboard screenshot](powerbi/screenshots/headline.png)
    
    ## Tech
    - DuckDB + SQL (CTEs, pivots, ratio analysis)
    - Power BI Desktop with DAX measures
    - Python (only for the Eurostat TSV ingest)
    
    ## Reproduce
    1. Clone this repo
    2. `pip install duckdb pandas`
    3. Download `tour_occ_arn2.tsv` from [Eurostat](https://ec.europa.eu/eurostat/databrowser/view/tour_occ_arn2/default/table) into `data/raw/`
    4. `python ingest/load_eurostat.py`
    5. `duckdb data/tourism.duckdb < sql/01_stg_tourism.sql sql/02_dim_region.sql sql/03_fct_monthly_nights.sql`
    6. Open `powerbi/tourism_recovery.pbix` and refresh
    
    ## Findings
    [Your 4-line SCQA paragraph here]
    
    ## Data source
    [Eurostat โ€” Nights spent at tourist accommodation establishments by NUTS 2](https://ec.europa.eu/eurostat/databrowser/view/tour_occ_arn2/default/table) (CC-BY 4.0).
    

    Commit everything except the .duckdb file (gitignore it โ€” too big and easy to rebuild from the script).


    โœ… Definition of Done

    • GitHub repo exists with README, SQL files, ingestion script, and a .pbix
    • README headline answers the question in one sentence
    • One screenshot embedded in the README shows the dashboard
    • Three or more SQL files using CTEs and joins
    • At least three DAX measures in the Power BI file
    • Mobile layout exists in the Power BI file
    • You can talk through the dashboard in 2 minutes out loud

    ๐Ÿš€ Stretch Goals

    • Add a window function to compute year-over-year growth per region per month (LAG over a region-partition).
    • Add a drillthrough page that filters to a single region and shows its monthly history.
    • Publish to Power BI Service (free with a Microsoft 365 trial) and share a live link in the README.
    • Replicate the dashboard in Tableau Public for a side-by-side BI comparison โ€” useful interview talking point.

    ๐ŸŽฏ Why this is portfolio-grade

    It demonstrates the full analyst loop in one repo: question framing, data ingestion, SQL modeling, BI delivery, and written communication. The dataset is authoritative and re-runnable, the question is genuinely interesting, and the recommendation is specific. That's the package a hiring manager wants to see in 30 seconds.

    Project Details

    Tools & Technologies

    DuckDB
    SQL
    Power BI
    DAX
    Python

    Difficulty Level

    Beginner

    Estimated Duration

    6-8 hours

    Sign in to submit projects and track your progress

    More Projects