Tourism Recovery Dashboard (SQL + Power BI)
Answer a real business question end to end: load Eurostat regional tourism data into DuckDB, model the metrics in SQL, and ship a one-page Power BI dashboard explaining where tourism recovered fastest between 2022 and 2025.
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master DuckDB, SQL, Power BI and 2 more technologies through hands-on implementation. Rated beginner level with comprehensive documentation and starter code.
๐ฎ๐น Project: Italian Tourism Recovery Dashboard
๐ Project Overview
A real, scoped business question with a clean public dataset and a polished Power BI deliverable โ the kind of portfolio piece a recruiter actually reads.
You will load Eurostat tourism statistics into a local DuckDB database, model the data in SQL, then build a single-page Power BI dashboard that answers:
Which Italian regions had the steepest tourism recovery between 2022 and 2025, and what does it imply for a hotel chain planning 2027 capacity expansion?
By the end you have a public GitHub repo with the SQL, the Power BI file, a written conclusion, and a dashboard screenshot โ exactly the package recruiters look for at first-screen.
๐ฏ Learning Objectives
- Pull a real EU public dataset and load it into a local analytical database
- Write SQL with CTEs and window functions to compute recovery metrics
- Model fact and dimension tables for a BI tool
- Build a Power BI report with DAX measures, slicers, and a mobile layout
- Tell a one-page data story end to end (question โ analysis โ recommendation)
- Package the project on GitHub with a recruiter-ready README
๐งฐ Prerequisites
- Power BI Desktop (Windows only; macOS users can run it in a Windows VM or use a Windows VM in Azure free tier)
- DuckDB CLI or Python โ install with
pip install duckdbor download from duckdb.org - Python 3.10+ with
pandas(just for the ingestion script) - Git + a GitHub account
- Basic SQL: SELECT, JOIN, GROUP BY, CTE. Window functions are a stretch goal.
๐ Dataset
Eurostat: Nights spent at tourist accommodation establishments by NUTS 2 region
- Direct browser link: tour_occ_arn2 on Eurostat databrowser
- Download format: TSV via the "Download" button (full dataset, then filter to Italy by
geocodes starting withIT) - Time coverage: 2012โpresent, monthly. We'll restrict to 2022-01 onward.
- Geographic granularity: NUTS 2 (20 Italian regions: Lombardia, Lazio, Veneto, Sicilia, etc.).
Why this dataset? It's authoritative (official EU statistics), regularly refreshed, covers exactly the question we're asking, and is at the right grain for a dashboard. The mess is real but bounded โ perfect for showing cleaning skill without spending three weeks on it.
โ Estimated Time
Duration: 6โ8 hours
Difficulty: Beginner
๐ Suggested Project Structure
tourism-recovery-dashboard/
โโโ data/
โ โโโ raw/
โ โ โโโ tour_occ_arn2.tsv # Eurostat download
โ โโโ tourism.duckdb # local analytics DB (gitignored)
โโโ ingest/
โ โโโ load_eurostat.py # parse TSV โ DuckDB
โโโ sql/
โ โโโ 01_stg_tourism.sql # cleaned staging table
โ โโโ 02_dim_region.sql # NUTS 2 โ region name map
โ โโโ 03_fct_monthly_nights.sql # monthly fact table
โ โโโ 04_recovery_index.sql # the analysis query
โโโ powerbi/
โ โโโ tourism_recovery.pbix # the dashboard
โ โโโ screenshots/
โ โโโ headline.png
โโโ README.md
โโโ .gitignore
๐ Step-by-Step Guide
1. ๐ฅ Pull the data
- Open Eurostat: tour_occ_arn2.
- Click Download โ Full dataset โ TSV (one file, around 6โ10 MB).
- Save as
data/raw/tour_occ_arn2.tsv.
The file is wide-format and uses Eurostat's TSV conventions: the first column has comma-separated dimension values, then one column per year/month period. Yes, it's awkward โ that's the point.
2. ๐งฑ Load into DuckDB
Eurostat TSV is too messy for DuckDB's read_csv_auto alone. Use a tiny Python script to reshape:
# ingest/load_eurostat.py
import pandas as pd
import duckdb
df = pd.read_csv(
"data/raw/tour_occ_arn2.tsv",
sep="\t",
na_values=[":"], # Eurostat marks missing as ":"
)
# Split the first column "freq,c_resid,unit,nace_r2,geo\TIME_PERIOD"
key_col = df.columns[0]
keys = df[key_col].str.split(",", expand=True)
keys.columns = key_col.split("\\")[0].split(",")
df = pd.concat([keys, df.drop(columns=[key_col])], axis=1)
# Wide โ long
df_long = df.melt(
id_vars=keys.columns.tolist(),
var_name="period",
value_name="nights",
)
df_long["nights"] = pd.to_numeric(df_long["nights"], errors="coerce")
con = duckdb.connect("data/tourism.duckdb")
con.execute("CREATE OR REPLACE TABLE raw_eurostat AS SELECT * FROM df_long")
print(con.execute("SELECT COUNT(*) FROM raw_eurostat").fetchone())
Run it: python ingest/load_eurostat.py. You should see a row count in the hundreds of thousands.
3. ๐งน Stage and model in SQL
Open DuckDB:
duckdb data/tourism.duckdb
01_stg_tourism.sql โ clean staging
CREATE OR REPLACE TABLE stg_tourism AS
SELECT
geo AS nuts2_code,
unit,
c_resid AS residence,
CAST(SUBSTR(period, 1, 4) AS INT) AS year,
CAST(SUBSTR(period, 6, 2) AS INT) AS month,
nights
FROM raw_eurostat
WHERE geo LIKE 'IT%' -- Italy only
AND period LIKE '20__-__' -- monthly rows
AND nights IS NOT NULL
AND c_resid = 'TOTAL'; -- residents + non-residents combined
02_dim_region.sql โ readable region names
CREATE OR REPLACE TABLE dim_region AS
SELECT * FROM (VALUES
('ITC1', 'Piemonte'),
('ITC2', 'Valle d''Aosta'),
('ITC3', 'Liguria'),
('ITC4', 'Lombardia'),
('ITF1', 'Abruzzo'),
('ITF2', 'Molise'),
('ITF3', 'Campania'),
('ITF4', 'Puglia'),
('ITF5', 'Basilicata'),
('ITF6', 'Calabria'),
('ITG1', 'Sicilia'),
('ITG2', 'Sardegna'),
('ITH1', 'Bolzano'),
('ITH2', 'Trento'),
('ITH3', 'Veneto'),
('ITH4', 'Friuli-Venezia Giulia'),
('ITH5', 'Emilia-Romagna'),
('ITI1', 'Toscana'),
('ITI2', 'Umbria'),
('ITI3', 'Marche'),
('ITI4', 'Lazio')
) AS t(nuts2_code, region_name);
03_fct_monthly_nights.sql โ the fact table
CREATE OR REPLACE TABLE fct_monthly_nights AS
SELECT
s.nuts2_code,
d.region_name,
s.year,
s.month,
DATE_TRUNC('month', MAKE_DATE(s.year, s.month, 1)) AS month_start,
s.nights
FROM stg_tourism s
JOIN dim_region d ON d.nuts2_code = s.nuts2_code
WHERE s.year BETWEEN 2019 AND 2025; -- pre-COVID baseline through latest
04_recovery_index.sql โ the analysis
We define "recovery index" as 2025 nights / 2019 nights per region, where 2019 is the last full pre-COVID year. > 1 means the region is past pre-pandemic levels.
WITH yearly AS (
SELECT region_name, year, SUM(nights) AS yearly_nights
FROM fct_monthly_nights
GROUP BY region_name, year
),
pivoted AS (
SELECT
region_name,
SUM(CASE WHEN year = 2019 THEN yearly_nights END) AS nights_2019,
SUM(CASE WHEN year = 2022 THEN yearly_nights END) AS nights_2022,
SUM(CASE WHEN year = 2025 THEN yearly_nights END) AS nights_2025
FROM yearly
GROUP BY region_name
)
SELECT
region_name,
nights_2019,
nights_2022,
nights_2025,
ROUND(nights_2022 / nights_2019, 3) AS recovery_2022,
ROUND(nights_2025 / nights_2019, 3) AS recovery_2025,
ROUND((nights_2025 - nights_2022) / nights_2019, 3) AS lift_22_to_25
FROM pivoted
ORDER BY recovery_2025 DESC NULLS LAST;
Save the result to a recovery_index view โ Power BI will hit it next.
4. ๐ Build the Power BI dashboard
- Install Power BI Desktop โ download here (Windows only).
- Connect to DuckDB: install the DuckDB ODBC driver and point Power BI โ Get Data โ ODBC at your
tourism.duckdbfile. (Alternative: export each table to Parquet withCOPY fct_monthly_nights TO 'fct.parquet'and import the parquets โ simpler, slightly less elegant.) - Build the model: in the model view, confirm
fct_monthly_nights[nuts2_code]joins todim_region[nuts2_code]. - DAX measures (Modeling tab โ New measure):
Total Nights = SUM ( fct_monthly_nights[nights] )
Nights 2019 = CALCULATE ( [Total Nights], fct_monthly_nights[year] = 2019 )
Nights 2025 = CALCULATE ( [Total Nights], fct_monthly_nights[year] = 2025 )
Recovery Index = DIVIDE ( [Nights 2025], [Nights 2019] )
Lift 22โ25 = DIVIDE ( [Nights 2025] - CALCULATE([Total Nights], fct_monthly_nights[year] = 2022), [Nights 2019] )
- Page layout (one page, mobile-friendly):
- Headline card (top-left, large): Recovery Index for all Italy.
- Bar chart (left): Recovery Index by region, sorted descending. Conditional color (red < 0.9, gray 0.9โ1.0, green > 1.0).
- Line chart (right): Monthly nights, 2019 vs 2022 vs 2025, one line each.
- Map (bottom): NUTS 2 regions colored by Recovery Index. Use the Filled Map visual with
region_nameon Location.
- Slicers: Year (multi-select) and Region (dropdown).
- Mobile layout (View โ Mobile layout): stack the card, the bar chart, and the line chart in a single column. Skip the map on mobile.
- Title and subtitle at the top of the page:
- "Italian tourism in 2025: recovered above pre-pandemic in 14 of 20 regions, with the South leading." (or whatever your data says)
- Source line: "Source: Eurostat tour_occ_arn2, accessed [date]"
5. โ๏ธ Write the conclusion
In the README, in a "Findings" section, write three to five lines following the SCQA structure (Situation, Complication, Question, Answer):
Italian regional tourism recovery has been uneven since the 2020 collapse. Eurostat data shows that as of 2025, [N] of 20 regions have surpassed their 2019 baseline, while [M] remain at [X]% of pre-pandemic volume. The South of Italy outpaced the North for the first time in modern record, driven by [Calabria, Sicilia, Puglia]. For a hotel chain planning 2027 expansion, the recovery-laggard regions concentrate in [the Alpine and northern industrial belt], suggesting headroom there has been underestimated.
Tailor to what your actual data shows. The recommendation matters more than the exact numbers.
6. ๐ฆ Package on GitHub
README.md template:
# Italian Tourism Recovery Dashboard (SQL + Power BI)
**Question:** Which Italian regions had the steepest tourism recovery 2022 โ 2025?
**Headline finding:** [one sentence with the result]

## Tech
- DuckDB + SQL (CTEs, pivots, ratio analysis)
- Power BI Desktop with DAX measures
- Python (only for the Eurostat TSV ingest)
## Reproduce
1. Clone this repo
2. `pip install duckdb pandas`
3. Download `tour_occ_arn2.tsv` from [Eurostat](https://ec.europa.eu/eurostat/databrowser/view/tour_occ_arn2/default/table) into `data/raw/`
4. `python ingest/load_eurostat.py`
5. `duckdb data/tourism.duckdb < sql/01_stg_tourism.sql sql/02_dim_region.sql sql/03_fct_monthly_nights.sql`
6. Open `powerbi/tourism_recovery.pbix` and refresh
## Findings
[Your 4-line SCQA paragraph here]
## Data source
[Eurostat โ Nights spent at tourist accommodation establishments by NUTS 2](https://ec.europa.eu/eurostat/databrowser/view/tour_occ_arn2/default/table) (CC-BY 4.0).
Commit everything except the .duckdb file (gitignore it โ too big and easy to rebuild from the script).
โ Definition of Done
- GitHub repo exists with README, SQL files, ingestion script, and a
.pbix - README headline answers the question in one sentence
- One screenshot embedded in the README shows the dashboard
- Three or more SQL files using CTEs and joins
- At least three DAX measures in the Power BI file
- Mobile layout exists in the Power BI file
- You can talk through the dashboard in 2 minutes out loud
๐ Stretch Goals
- Add a window function to compute year-over-year growth per region per month (LAG over a region-partition).
- Add a drillthrough page that filters to a single region and shows its monthly history.
- Publish to Power BI Service (free with a Microsoft 365 trial) and share a live link in the README.
- Replicate the dashboard in Tableau Public for a side-by-side BI comparison โ useful interview talking point.
๐ฏ Why this is portfolio-grade
It demonstrates the full analyst loop in one repo: question framing, data ingestion, SQL modeling, BI delivery, and written communication. The dataset is authoritative and re-runnable, the question is genuinely interesting, and the recommendation is specific. That's the package a hiring manager wants to see in 30 seconds.