🏠 Project: Milan Airbnb EDA

📌 Project Overview

A second portfolio piece that demonstrates the side of analyst work BI dashboards don't: cleaning messy real-world data and writing structured analysis in Python.

You'll work with the Inside Airbnb public dataset for Milan — a real, intentionally messy snapshot of every active Airbnb listing in the city. You'll clean it, run an exploratory data analysis in a Jupyter notebook, and answer:

What drives Airbnb listing prices in Milan, and where are the under-priced opportunities for a guest looking for value?

By the end you have a public GitHub repo with a polished notebook, a README that recruiters can read in 30 seconds, and one publication-quality chart that summarizes the story.

🎯 Learning Objectives

Pull a real, messy public dataset and load it in pandas
Apply the full data-cleaning loop: missing values, type coercion, outliers, string normalization, datetime parsing
Run an EDA with at least three visualizations
Test a hypothesis against the data and check counter-evidence
Write a one-paragraph SCQA conclusion that lands the recommendation
Package the notebook on GitHub with a recruiter-ready README

🧰 Prerequisites

Python 3.10+ (or Anaconda)
A working Jupyter install (pip install jupyterlab or use Google Colab)
Git + GitHub account
Comfort with the pandas basics: load, filter, groupby, merge

If you completed da-4-2 and da-4-3 of the Data Analyst roadmap, you're ready.

📊 Dataset

Inside Airbnb: Milan listings

Source: insideairbnb.com/get-the-data → scroll to Milan, Italy → download listings.csv.gz (the detailed listings file, ~10–15 MB compressed).
Snapshot: scraped quarterly; pick the most recent one available.
Rows: ~25,000 Milan listings.
Columns: ~75 — including price (as a messy string with $ and commas), neighbourhood, room type, host info, latitude/longitude, review scores.

Why this dataset? It's famously messy in real-world ways: prices stored as strings, missing values everywhere, inconsistent text fields, geo coordinates that need projecting. Cleaning it well is the demonstration. The dataset is also genuinely interesting — anyone who's looked for a place in Milan has an intuition you can check against the data.

License: Inside Airbnb data is published under CC-BY 4.0. Credit them in your README.

⌛ Estimated Time

Duration: 5–7 hours
Difficulty: Beginner

📂 Suggested Project Structure

milan-airbnb-eda/
├── data/
│   └── raw/
│       └── listings.csv.gz       # the Inside Airbnb download
├── notebook.ipynb                 # the analysis
├── assets/
│   └── headline-chart.png         # exported for the README
├── requirements.txt
├── README.md
└── .gitignore

🔄 Step-by-Step Guide

1. 🎯 Frame the question

Open the notebook. The very first cell is a markdown cell with the question:

# Milan Airbnb — What drives price, and where's the value?

**Audience:** a guest planning a 3-day Milan trip in 2026, budget-conscious, willing to walk 10–15 minutes for value.

**Concrete sub-questions:**
1. Which neighbourhoods have the highest median nightly price?
2. After controlling for room type and capacity, which neighbourhoods are underpriced?
3. Is there a review-score floor below which price drops sharply?

Writing the question first stops you from drifting into "let me explore everything".

2. 📥 Load and sniff

import pandas as pd
pd.set_option("display.max_columns", 80)

df = pd.read_csv("data/raw/listings.csv.gz", compression="gzip")
print(df.shape)
df.head(3)
df.info()

You should see ~25k rows and ~75 columns. Note the columns you'll keep: id, neighbourhood_cleansed, room_type, accommodates, bathrooms, bedrooms, price, number_of_reviews, review_scores_rating, latitude, longitude. You can drop the rest.

3. 🧹 Clean

Each cleanup gets a markdown header in the notebook explaining the decision.

Drop columns you don't need

keep = [
    "id", "neighbourhood_cleansed", "room_type", "accommodates",
    "bathrooms", "bedrooms", "price", "number_of_reviews",
    "review_scores_rating", "latitude", "longitude",
]
df = df[keep].copy()
df.columns = df.columns.str.replace("_cleansed", "")  # cleaner names

Parse the messy `price` column

# Looks like "$120.00" with commas for thousands. Coerce to float.
df["price"] = (
    df["price"]
    .astype(str)
    .str.replace(r"[$,]", "", regex=True)
    .pipe(pd.to_numeric, errors="coerce")
)

print("nulls in price:", df["price"].isna().sum())
df["price"].describe()

Drop pathological rows

before = len(df)
df = df.dropna(subset=["price", "neighbourhood", "room_type"])
df = df[df["price"] > 0]                          # zero-price listings are bugs
df = df[df["price"] < df["price"].quantile(0.99)]  # cut the long tail (mansions, errors)
print(f"dropped {before - len(df)} rows ({(before - len(df)) / before:.1%})")

The 1% trim is justified — Inside Airbnb data has a known long right tail of unrealistically high "test" listings that distort medians.

Type the rest

df["bedrooms"]      = df["bedrooms"].fillna(0).astype(int)
df["accommodates"]  = df["accommodates"].astype(int)
df["room_type"]     = df["room_type"].astype("category")
df["neighbourhood"] = df["neighbourhood"].astype("category")

4. 📈 Describe

df.describe(include="all").T
df["neighbourhood"].value_counts().head(10)
df["room_type"].value_counts(normalize=True)

Comment on what you see in a markdown cell: how many neighbourhoods, how concentrated the listings are, what fraction is "Entire home" vs "Private room".

5. 🎨 Visualize three things

import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

Chart 1 — Median price by neighbourhood (sorted bar)

top = (
    df.groupby("neighbourhood", observed=True)["price"]
      .median()
      .sort_values(ascending=False)
      .head(15)
)
fig, ax = plt.subplots(figsize=(10, 6))
top.plot(kind="barh", ax=ax, color="#3b82f6")
ax.invert_yaxis()
ax.set_xlabel("Median nightly price (€)")
ax.set_ylabel("")
ax.set_title("Top 15 Milan neighbourhoods by median Airbnb price")
plt.tight_layout()
plt.savefig("assets/headline-chart.png", dpi=150)
plt.show()

This is your headline chart. Export it for the README.

Chart 2 — Price distribution by room type

fig, ax = plt.subplots(figsize=(9, 5))
sns.boxplot(data=df, x="room_type", y="price", ax=ax, showfliers=False)
ax.set_title("Price distribution by room type (outliers hidden)")
ax.set_ylabel("Nightly price (€)")
plt.tight_layout()
plt.show()

Caption: notice the overlap between "Entire home" lower quartile and "Private room" upper quartile — there's an arbitrage band there.

Chart 3 — Price vs review score (does quality command a premium?)

fig, ax = plt.subplots(figsize=(9, 6))
sns.scatterplot(
    data=df.sample(min(2000, len(df)), random_state=42),  # subsample for legibility
    x="review_scores_rating", y="price",
    hue="room_type", alpha=0.5, ax=ax,
)
ax.set_xlim(2.5, 5.05)
ax.set_ylim(0, df["price"].quantile(0.95))
ax.set_xlabel("Average review score (out of 5)")
ax.set_ylabel("Nightly price (€)")
ax.set_title("Price vs review score, by room type")
plt.tight_layout()
plt.show()

Caption: comment on whether the slope is flat or rising — does the market pay for quality, or is it priced on location and capacity alone?

6. 🧠 Hypothesis + counter-evidence

Write your hypothesis in markdown, then deliberately go looking for what would disprove it. Example:

Hypothesis: The Brera and Duomo neighbourhoods command the highest prices because of central location, but Isola and Porta Romana offer better value (lower price for similar quality and capacity).

Counter-evidence to check:

Are Isola listings systematically smaller? (Compare accommodates distribution.)

Do Isola listings have worse review scores? (Compare review_scores_rating.)

Is there a sample-size issue? (How many listings does each neighbourhood have?)

Write the small follow-up cells that check each one. Write what you find honestly — if the counter-evidence wins, that's part of the analysis.

focus = df[df["neighbourhood"].isin(["Brera", "Duomo", "Isola", "Porta Romana"])]
focus.groupby("neighbourhood", observed=True).agg(
    listings=("id", "count"),
    median_price=("price", "median"),
    median_accommodates=("accommodates", "median"),
    median_rating=("review_scores_rating", "median"),
)

(Neighbourhood spellings come from Inside Airbnb's neighbourhood_cleansed. Check the actual values in your data and adjust the list.)

7. ✅ Conclusion

A final markdown cell with the SCQA framing in 4–6 lines:

Situation: Milan Airbnb supply is concentrated in central neighbourhoods (Duomo, Brera) where median prices run €X.
Complication: Within walking distance, Isola and Porta Romana offer comparable room types at €Y less (~Z%).
Question: Where's the value play for a 3-day trip?
Answer: Isola for entire-home stays with high review scores (>4.8); Porta Romana for private rooms. Avoid Duomo unless the trip is < 2 nights and the convenience premium is worth €X/night.

The exact numbers are what you found. The point is the structure.

8. 📦 Package on GitHub

README.md template:

# Milan Airbnb EDA

**Question:** What drives Airbnb listing prices in Milan, and where's the value?

**Headline finding:** [one sentence with the result]

![Median price by neighbourhood](assets/headline-chart.png)

Full analysis: [notebook.ipynb](./notebook.ipynb)

## Tech
- Python 3.10, pandas, seaborn, matplotlib
- Jupyter notebook with markdown narration
- One publication-quality chart exported for this README

## Reproduce
1. Clone this repo
2. `pip install -r requirements.txt`
3. Download `listings.csv.gz` from [Inside Airbnb — Milan](https://insideairbnb.com/get-the-data/) into `data/raw/`
4. `jupyter lab notebook.ipynb` and Run All

## Findings
[Your 5-line SCQA paragraph here]

## Data source
[Inside Airbnb — Milan](https://insideairbnb.com/) (CC-BY 4.0).

requirements.txt:

pandas>=2.1
seaborn>=0.13
matplotlib>=3.8
jupyterlab>=4.0

.gitignore:

data/raw/*.gz
.ipynb_checkpoints/
__pycache__/
.venv/

✅ Definition of Done

GitHub repo exists with notebook, README, requirements, and assets folder
README headline answers the question in one sentence
Headline chart embedded in the README via the assets folder
Notebook has the 7-step structure (question, load, clean, describe, visualize, hypothesis, conclusion)
At least three labelled visualizations
The notebook renders cleanly on GitHub (no broken outputs, no exceptions in cells)
You can present the analysis in 2 minutes out loud

🚀 Stretch Goals

Geo plot with latitude/longitude — use folium for an interactive map of price by listing, color-coded.
Price regression: build a simple linear regression of price on room_type, accommodates, bedrooms, neighbourhood using statsmodels. Identify the most under-/over-priced listings as the largest residuals.
Compare two cities: repeat for Rome or Florence and write a short comparison.
Ship as a Streamlit app so a non-technical reader can filter and explore interactively.

🎯 Why this is portfolio-grade

Recruiters look for two things in a Python notebook: clean cleaning and a real recommendation. This project enforces both. The dataset is the standard Inside Airbnb that hiring managers recognize, which makes the analysis legible at a glance. The notebook structure is replicable for any future EDA — you build a personal template here you'll use for years.

Airbnb Listings EDA (Python + pandas)

🏠 Project: Milan Airbnb EDA

📌 Project Overview

🎯 Learning Objectives

🧰 Prerequisites

📊 Dataset

Inside Airbnb: Milan listings

⌛ Estimated Time

📂 Suggested Project Structure

🔄 Step-by-Step Guide

1. 🎯 Frame the question

2. 📥 Load and sniff

3. 🧹 Clean

Drop columns you don't need

Parse the messy `price` column

Drop pathological rows

Type the rest

4. 📈 Describe

5. 🎨 Visualize three things

Chart 1 — Median price by neighbourhood (sorted bar)

Chart 2 — Price distribution by room type

Chart 3 — Price vs review score (does quality command a premium?)

6. 🧠 Hypothesis + counter-evidence

7. ✅ Conclusion

8. 📦 Package on GitHub

✅ Definition of Done

🚀 Stretch Goals

🎯 Why this is portfolio-grade

Project Details

Tools & Technologies

Difficulty Level

Estimated Duration

More Projects

Continue Learning

Airbnb Listings EDA (Python + pandas)

🏠 Project: Milan Airbnb EDA

📌 Project Overview

🎯 Learning Objectives

🧰 Prerequisites

📊 Dataset

Inside Airbnb: Milan listings

⌛ Estimated Time

📂 Suggested Project Structure

🔄 Step-by-Step Guide

1. 🎯 Frame the question

2. 📥 Load and sniff

3. 🧹 Clean

Drop columns you don't need

Parse the messy price column

Drop pathological rows

Type the rest

4. 📈 Describe

5. 🎨 Visualize three things

Chart 1 — Median price by neighbourhood (sorted bar)

Chart 2 — Price distribution by room type

Chart 3 — Price vs review score (does quality command a premium?)

6. 🧠 Hypothesis + counter-evidence

7. ✅ Conclusion

8. 📦 Package on GitHub

✅ Definition of Done

🚀 Stretch Goals

🎯 Why this is portfolio-grade

Project Details

Tools & Technologies

Difficulty Level

Estimated Duration

More Projects

Continue Learning

Parse the messy `price` column