🇮🇹 Project: Italian Tourism Recovery Dashboard

📌 Project Overview

A real, scoped business question with a clean public dataset and a polished Power BI deliverable — the kind of portfolio piece a recruiter actually reads.

You will load Eurostat tourism statistics into a local DuckDB database, model the data in SQL, then build a single-page Power BI dashboard that answers:

Which Italian regions had the steepest tourism recovery between 2022 and 2025, and what does it imply for a hotel chain planning 2027 capacity expansion?

By the end you have a public GitHub repo with the SQL, the Power BI file, a written conclusion, and a dashboard screenshot — exactly the package recruiters look for at first-screen.

🎯 Learning Objectives

Pull a real EU public dataset and load it into a local analytical database
Write SQL with CTEs and window functions to compute recovery metrics
Model fact and dimension tables for a BI tool
Build a Power BI report with DAX measures, slicers, and a mobile layout
Tell a one-page data story end to end (question → analysis → recommendation)
Package the project on GitHub with a recruiter-ready README

🧰 Prerequisites

Power BI Desktop (Windows only; macOS users can run it in a Windows VM or use a Windows VM in Azure free tier)
DuckDB CLI or Python — install with pip install duckdb or download from duckdb.org
Python 3.10+ with pandas (just for the ingestion script)
Git + a GitHub account
Basic SQL: SELECT, JOIN, GROUP BY, CTE. Window functions are a stretch goal.

📊 Dataset

Eurostat: Arrivals at tourist accommodation establishments by NUTS 2 region

Direct browser link: tour_occ_arn2 on Eurostat databrowser
Metric: arrivals (check-ins), NOT nights spent. Eurostat's nights-spent datasets are tour_occ_nin2 (NUTS 3, annual) and tour_occ_nin2m (NUTS 2, monthly, but only from 2020 — no pre-COVID baseline). Arrivals at NUTS 2 with a 2019 baseline is the cleanest cut for this question; just call the metric what it is.
Download format: TSV via the "Download" button (full dataset, then filter to Italy by geo codes starting with IT)
Time coverage: annual, 2012–present. The recovery analysis uses 2019 (pre-COVID baseline) through the latest full year.
Geographic granularity: NUTS 2 (Italian regions: Lombardia, Lazio, Veneto, Sicilia, etc., with Bolzano and Trento as separate NUTS 2 units).

Why this dataset? It's authoritative (official EU statistics), regularly refreshed, covers exactly the question we're asking, and is at the right grain for a dashboard. The mess is real but bounded — perfect for showing cleaning skill without spending three weeks on it. One real-world trap included for free: the dataset code says "arn2" (ARrivals), and half the internet mislabels it as nights. Verify your metric against a known number before you trust any label, including ours.

⌛ Estimated Time

Duration: 6–8 hours
Difficulty: Beginner

📂 Suggested Project Structure

tourism-recovery-dashboard/
├── data/
│   ├── raw/
│   │   └── tour_occ_arn2.tsv         # Eurostat download
│   └── tourism.duckdb                # local analytics DB (gitignored)
├── ingest/
│   └── load_eurostat.py              # parse TSV → DuckDB
├── sql/
│   ├── 01_stg_tourism.sql            # cleaned staging table
│   ├── 02_dim_region.sql             # NUTS 2 → region name map
│   ├── 03_fct_yearly_arrivals.sql    # yearly fact table
│   └── 04_recovery_index.sql         # the analysis query
├── powerbi/
│   ├── tourism_recovery.pbix         # the dashboard
│   └── screenshots/
│       └── headline.png
├── README.md
└── .gitignore

🔄 Step-by-Step Guide

1. 📥 Pull the data

Open Eurostat: tour_occ_arn2.
Click Download → Full dataset → TSV (one file, around 6–10 MB).
Save as data/raw/tour_occ_arn2.tsv.

The file is wide-format and uses Eurostat's TSV conventions: the first column has comma-separated dimension values, then one column per year/month period. Yes, it's awkward — that's the point.

2. 🧱 Load into DuckDB

Eurostat TSV is too messy for DuckDB's read_csv_auto alone. Use a tiny Python script to reshape:

# ingest/load_eurostat.py
import pandas as pd
import duckdb

df = pd.read_csv(
    "data/raw/tour_occ_arn2.tsv",
    sep="\t",
    na_values=[":"],            # Eurostat marks missing as ":"
)

# Split the first column "freq,c_resid,unit,nace_r2,geo\TIME_PERIOD"
key_col = df.columns[0]
keys = df[key_col].str.split(",", expand=True)
keys.columns = key_col.split("\\")[0].split(",")
df = pd.concat([keys, df.drop(columns=[key_col])], axis=1)

# Wide → long
df_long = df.melt(
    id_vars=keys.columns.tolist(),
    var_name="period",
    value_name="arrivals",
)
df_long["arrivals"] = pd.to_numeric(df_long["arrivals"], errors="coerce")

con = duckdb.connect("data/tourism.duckdb")
con.execute("CREATE OR REPLACE TABLE raw_eurostat AS SELECT * FROM df_long")
print(con.execute("SELECT COUNT(*) FROM raw_eurostat").fetchone())

Run it: python ingest/load_eurostat.py. You should see a row count in the hundreds of thousands.

3. 🧹 Stage and model in SQL

Open DuckDB:

duckdb data/tourism.duckdb

`01_stg_tourism.sql` — clean staging

CREATE OR REPLACE TABLE stg_tourism AS
SELECT
  geo                                              AS nuts2_code,
  unit,
  c_resid                                          AS residence,
  CAST(TRIM(period) AS INT)                        AS year,
  arrivals
FROM raw_eurostat
WHERE geo LIKE 'IT%'              -- Italy only
  AND arrivals IS NOT NULL
  AND c_resid = 'TOTAL'           -- residents + non-residents combined
  AND unit = 'NR'                 -- absolute numbers, not % changes
  AND nace_r2 = 'I551-I553';      -- aggregated accommodation sector: the
                                  -- individual I551/I552/I553 rows overlap
                                  -- with this total, so keeping them all
                                  -- would double count

`02_dim_region.sql` — readable region names

CREATE OR REPLACE TABLE dim_region AS
SELECT * FROM (VALUES
  ('ITC1', 'Piemonte'),
  ('ITC2', 'Valle d''Aosta'),
  ('ITC3', 'Liguria'),
  ('ITC4', 'Lombardia'),
  ('ITF1', 'Abruzzo'),
  ('ITF2', 'Molise'),
  ('ITF3', 'Campania'),
  ('ITF4', 'Puglia'),
  ('ITF5', 'Basilicata'),
  ('ITF6', 'Calabria'),
  ('ITG1', 'Sicilia'),
  ('ITG2', 'Sardegna'),
  ('ITH1', 'Bolzano'),
  ('ITH2', 'Trento'),
  ('ITH3', 'Veneto'),
  ('ITH4', 'Friuli-Venezia Giulia'),
  ('ITH5', 'Emilia-Romagna'),
  ('ITI1', 'Toscana'),
  ('ITI2', 'Umbria'),
  ('ITI3', 'Marche'),
  ('ITI4', 'Lazio')
) AS t(nuts2_code, region_name);

`03_fct_yearly_arrivals.sql` — the fact table

CREATE OR REPLACE TABLE fct_yearly_arrivals AS
SELECT
  s.nuts2_code,
  d.region_name,
  s.year,
  s.arrivals
FROM stg_tourism s
JOIN dim_region d ON d.nuts2_code = s.nuts2_code
WHERE s.year BETWEEN 2019 AND 2025;        -- pre-COVID baseline through latest

`04_recovery_index.sql` — the analysis

We define "recovery index" as 2025 arrivals / 2019 arrivals per region, where 2019 is the last full pre-COVID year. > 1 means the region is past pre-pandemic levels. Every recovery column divides by the same 2019 baseline; resist the temptation to chain period-over-period ratios, they answer a different question.

WITH yearly AS (
  SELECT region_name, year, SUM(arrivals) AS yearly_arrivals
  FROM   fct_yearly_arrivals
  GROUP BY region_name, year
),
pivoted AS (
  SELECT
    region_name,
    SUM(CASE WHEN year = 2019 THEN yearly_arrivals END) AS arrivals_2019,
    SUM(CASE WHEN year = 2022 THEN yearly_arrivals END) AS arrivals_2022,
    SUM(CASE WHEN year = 2025 THEN yearly_arrivals END) AS arrivals_2025
  FROM yearly
  GROUP BY region_name
)
SELECT
  region_name,
  arrivals_2019,
  arrivals_2022,
  arrivals_2025,
  ROUND(arrivals_2022 / arrivals_2019, 3) AS recovery_2022,
  ROUND(arrivals_2025 / arrivals_2019, 3) AS recovery_2025,
  ROUND((arrivals_2025 - arrivals_2022) / arrivals_2019, 3) AS lift_22_to_25
FROM pivoted
ORDER BY recovery_2025 DESC NULLS LAST;

Save the result to a recovery_index view — Power BI will hit it next.

4. 📊 Build the Power BI dashboard

Install Power BI Desktop — download here (Windows only).
Connect to DuckDB: install the DuckDB ODBC driver and point Power BI → Get Data → ODBC at your tourism.duckdb file. (Alternative: export each table to Parquet with COPY fct_yearly_arrivals TO 'fct.parquet' and import the parquets — simpler, slightly less elegant.)
Build the model: in the model view, confirm fct_yearly_arrivals[nuts2_code] joins to dim_region[nuts2_code].
DAX measures (Modeling tab → New measure):

Total Arrivals = SUM ( fct_yearly_arrivals[arrivals] )

Arrivals 2019 = CALCULATE ( [Total Arrivals], fct_yearly_arrivals[year] = 2019 )
Arrivals 2025 = CALCULATE ( [Total Arrivals], fct_yearly_arrivals[year] = 2025 )

Recovery Index = DIVIDE ( [Arrivals 2025], [Arrivals 2019] )
Lift 22→25     = DIVIDE ( [Arrivals 2025] - CALCULATE([Total Arrivals], fct_yearly_arrivals[year] = 2022), [Arrivals 2019] )

Page layout (one page, mobile-friendly):
- Headline card (top-left, large): Recovery Index for all Italy.
- Bar chart (left): Recovery Index by region, sorted descending. Conditional color (red < 0.9, gray 0.9–1.0, green > 1.0).
- Line chart (right): yearly arrivals 2019 through 2025, one line for Italy overall (regions via the slicer). Power BI's built-in forecast (Analytics pane) extending it 2 to 3 years is a nice touch here.
- Map (bottom): NUTS 2 regions colored by Recovery Index. Use the Filled Map visual with region_name on Location.
Slicers: Year (multi-select) and Region (dropdown).
Mobile layout (View → Mobile layout): stack the card, the bar chart, and the line chart in a single column. Skip the map on mobile.
Title and subtitle at the top of the page:
- "Italian tourism in 2025: recovered above pre-pandemic in 14 of 20 regions, with the South leading." (or whatever your data says)
- Source line: "Source: Eurostat tour_occ_arn2, accessed [date]"

5. ✍️ Write the conclusion

In the README, in a "Findings" section, write three to five lines following the SCQA structure (Situation, Complication, Question, Answer):

Italian regional tourism recovery has been uneven since the 2020 collapse. Eurostat data shows that as of 2025, [N] of 20 regions have surpassed their 2019 baseline, while [M] remain at [X]% of pre-pandemic volume. The South of Italy outpaced the North for the first time in modern record, driven by [Calabria, Sicilia, Puglia]. For a hotel chain planning 2027 expansion, the recovery-laggard regions concentrate in [the Alpine and northern industrial belt], suggesting headroom there has been underestimated.

Tailor to what your actual data shows. The recommendation matters more than the exact numbers.

6. 📦 Package on GitHub

README.md template:

# Italian Tourism Recovery Dashboard (SQL + Power BI)

**Question:** Which Italian regions had the steepest tourism recovery 2022 → 2025?

**Headline finding:** [one sentence with the result]

![dashboard screenshot](powerbi/screenshots/headline.png)

## Tech
- DuckDB + SQL (CTEs, pivots, ratio analysis)
- Power BI Desktop with DAX measures
- Python (only for the Eurostat TSV ingest)

## Reproduce
1. Clone this repo
2. `pip install duckdb pandas`
3. Download `tour_occ_arn2.tsv` from [Eurostat](https://ec.europa.eu/eurostat/databrowser/view/tour_occ_arn2/default/table) into `data/raw/`
4. `python ingest/load_eurostat.py`
5. `duckdb data/tourism.duckdb < sql/01_stg_tourism.sql sql/02_dim_region.sql sql/03_fct_yearly_arrivals.sql`
6. Open `powerbi/tourism_recovery.pbix` and refresh

## Findings
[Your 4-line SCQA paragraph here]

## Data source
[Eurostat — Arrivals at tourist accommodation establishments by NUTS 2 region](https://ec.europa.eu/eurostat/databrowser/view/tour_occ_arn2/default/table) (CC-BY 4.0).

Commit everything except the .duckdb file (gitignore it — too big and easy to rebuild from the script).

✅ Definition of Done

GitHub repo exists with README, SQL files, ingestion script, and a .pbix
README headline answers the question in one sentence
One screenshot embedded in the README shows the dashboard
Three or more SQL files using CTEs and joins
At least three DAX measures in the Power BI file
Mobile layout exists in the Power BI file
You can talk through the dashboard in 2 minutes out loud

🚀 Stretch Goals

Add a window function to compute year-over-year growth per region (LAG over a region-partition).
Add a drillthrough page that filters to a single region and shows its yearly history.
Rebuild the analysis on nights spent using tour_occ_nin2m (monthly, NUTS 2, from 2020): no 2019 baseline, so reframe the question around seasonality instead of recovery. Comparing arrivals vs nights recovery is a great interview talking point (day-trippers vs long stays).
Publish to Power BI Service (free with a Microsoft 365 trial) and share a live link in the README.
Replicate the dashboard in Tableau Public for a side-by-side BI comparison — useful interview talking point.

🎯 Why this is portfolio-grade

It demonstrates the full analyst loop in one repo: question framing, data ingestion, SQL modeling, BI delivery, and written communication. The dataset is authoritative and re-runnable, the question is genuinely interesting, and the recommendation is specific. That's the package a hiring manager wants to see in 30 seconds.

Tourism Recovery Dashboard (SQL + Power BI)

🇮🇹 Project: Italian Tourism Recovery Dashboard

📌 Project Overview

🎯 Learning Objectives

🧰 Prerequisites

📊 Dataset

Eurostat: Arrivals at tourist accommodation establishments by NUTS 2 region

⌛ Estimated Time

📂 Suggested Project Structure

🔄 Step-by-Step Guide

1. 📥 Pull the data

2. 🧱 Load into DuckDB

3. 🧹 Stage and model in SQL

`01_stg_tourism.sql` — clean staging

`02_dim_region.sql` — readable region names

`03_fct_yearly_arrivals.sql` — the fact table

`04_recovery_index.sql` — the analysis

4. 📊 Build the Power BI dashboard

5. ✍️ Write the conclusion

6. 📦 Package on GitHub

✅ Definition of Done

🚀 Stretch Goals

🎯 Why this is portfolio-grade

Project Details

Tools & Technologies

Difficulty Level

Estimated Duration

More Projects

Continue Learning

Tourism Recovery Dashboard (SQL + Power BI)

🇮🇹 Project: Italian Tourism Recovery Dashboard

📌 Project Overview

🎯 Learning Objectives

🧰 Prerequisites

📊 Dataset

Eurostat: Arrivals at tourist accommodation establishments by NUTS 2 region

⌛ Estimated Time

📂 Suggested Project Structure

🔄 Step-by-Step Guide

1. 📥 Pull the data

2. 🧱 Load into DuckDB

3. 🧹 Stage and model in SQL

01_stg_tourism.sql — clean staging

02_dim_region.sql — readable region names

03_fct_yearly_arrivals.sql — the fact table

04_recovery_index.sql — the analysis

4. 📊 Build the Power BI dashboard

5. ✍️ Write the conclusion

6. 📦 Package on GitHub

✅ Definition of Done

🚀 Stretch Goals

🎯 Why this is portfolio-grade

Project Details

Tools & Technologies

Difficulty Level

Estimated Duration

More Projects

Continue Learning

`01_stg_tourism.sql` — clean staging

`02_dim_region.sql` — readable region names

`03_fct_yearly_arrivals.sql` — the fact table

`04_recovery_index.sql` — the analysis