Airbnb Listings EDA (Python + pandas)
Clean a real, messy Inside Airbnb listings dataset, run an exploratory analysis in a Jupyter notebook, and ship the result on GitHub with a README and one publication-quality chart.
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Python, pandas, Jupyter and 2 more technologies through hands-on implementation. Rated beginner level with comprehensive documentation and starter code.
๐ Project: Milan Airbnb EDA
๐ Project Overview
A second portfolio piece that demonstrates the side of analyst work BI dashboards don't: cleaning messy real-world data and writing structured analysis in Python.
You'll work with the Inside Airbnb public dataset for Milan โ a real, intentionally messy snapshot of every active Airbnb listing in the city. You'll clean it, run an exploratory data analysis in a Jupyter notebook, and answer:
What drives Airbnb listing prices in Milan, and where are the under-priced opportunities for a guest looking for value?
By the end you have a public GitHub repo with a polished notebook, a README that recruiters can read in 30 seconds, and one publication-quality chart that summarizes the story.
๐ฏ Learning Objectives
- Pull a real, messy public dataset and load it in pandas
- Apply the full data-cleaning loop: missing values, type coercion, outliers, string normalization, datetime parsing
- Run an EDA with at least three visualizations
- Test a hypothesis against the data and check counter-evidence
- Write a one-paragraph SCQA conclusion that lands the recommendation
- Package the notebook on GitHub with a recruiter-ready README
๐งฐ Prerequisites
- Python 3.10+ (or Anaconda)
- A working Jupyter install (
pip install jupyterlabor use Google Colab) - Git + GitHub account
- Comfort with the pandas basics: load, filter, groupby, merge
If you completed da-4-2 and da-4-3 of the Data Analyst roadmap, you're ready.
๐ Dataset
Inside Airbnb: Milan listings
- Source: insideairbnb.com/get-the-data โ scroll to Milan, Italy โ download
listings.csv.gz(the detailed listings file, ~10โ15 MB compressed). - Snapshot: scraped quarterly; pick the most recent one available.
- Rows: ~25,000 Milan listings.
- Columns: ~75 โ including price (as a messy string with
$and commas), neighbourhood, room type, host info, latitude/longitude, review scores.
Why this dataset? It's famously messy in real-world ways: prices stored as strings, missing values everywhere, inconsistent text fields, geo coordinates that need projecting. Cleaning it well is the demonstration. The dataset is also genuinely interesting โ anyone who's looked for a place in Milan has an intuition you can check against the data.
License: Inside Airbnb data is published under CC-BY 4.0. Credit them in your README.
โ Estimated Time
Duration: 5โ7 hours
Difficulty: Beginner
๐ Suggested Project Structure
milan-airbnb-eda/
โโโ data/
โ โโโ raw/
โ โโโ listings.csv.gz # the Inside Airbnb download
โโโ notebook.ipynb # the analysis
โโโ assets/
โ โโโ headline-chart.png # exported for the README
โโโ requirements.txt
โโโ README.md
โโโ .gitignore
๐ Step-by-Step Guide
1. ๐ฏ Frame the question
Open the notebook. The very first cell is a markdown cell with the question:
# Milan Airbnb โ What drives price, and where's the value?
**Audience:** a guest planning a 3-day Milan trip in 2026, budget-conscious, willing to walk 10โ15 minutes for value.
**Concrete sub-questions:**
1. Which neighbourhoods have the highest median nightly price?
2. After controlling for room type and capacity, which neighbourhoods are underpriced?
3. Is there a review-score floor below which price drops sharply?
Writing the question first stops you from drifting into "let me explore everything".
2. ๐ฅ Load and sniff
import pandas as pd
pd.set_option("display.max_columns", 80)
df = pd.read_csv("data/raw/listings.csv.gz", compression="gzip")
print(df.shape)
df.head(3)
df.info()
You should see ~25k rows and ~75 columns. Note the columns you'll keep: id, neighbourhood_cleansed, room_type, accommodates, bathrooms, bedrooms, price, number_of_reviews, review_scores_rating, latitude, longitude. You can drop the rest.
3. ๐งน Clean
Each cleanup gets a markdown header in the notebook explaining the decision.
Drop columns you don't need
keep = [
"id", "neighbourhood_cleansed", "room_type", "accommodates",
"bathrooms", "bedrooms", "price", "number_of_reviews",
"review_scores_rating", "latitude", "longitude",
]
df = df[keep].copy()
df.columns = df.columns.str.replace("_cleansed", "") # cleaner names
Parse the messy price column
# Looks like "$120.00" with commas for thousands. Coerce to float.
df["price"] = (
df["price"]
.astype(str)
.str.replace(r"[$,]", "", regex=True)
.pipe(pd.to_numeric, errors="coerce")
)
print("nulls in price:", df["price"].isna().sum())
df["price"].describe()
Drop pathological rows
before = len(df)
df = df.dropna(subset=["price", "neighbourhood", "room_type"])
df = df[df["price"] > 0] # zero-price listings are bugs
df = df[df["price"] < df["price"].quantile(0.99)] # cut the long tail (mansions, errors)
print(f"dropped {before - len(df)} rows ({(before - len(df)) / before:.1%})")
The 1% trim is justified โ Inside Airbnb data has a known long right tail of unrealistically high "test" listings that distort medians.
Type the rest
df["bedrooms"] = df["bedrooms"].fillna(0).astype(int)
df["accommodates"] = df["accommodates"].astype(int)
df["room_type"] = df["room_type"].astype("category")
df["neighbourhood"] = df["neighbourhood"].astype("category")
4. ๐ Describe
df.describe(include="all").T
df["neighbourhood"].value_counts().head(10)
df["room_type"].value_counts(normalize=True)
Comment on what you see in a markdown cell: how many neighbourhoods, how concentrated the listings are, what fraction is "Entire home" vs "Private room".
5. ๐จ Visualize three things
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
Chart 1 โ Median price by neighbourhood (sorted bar)
top = (
df.groupby("neighbourhood", observed=True)["price"]
.median()
.sort_values(ascending=False)
.head(15)
)
fig, ax = plt.subplots(figsize=(10, 6))
top.plot(kind="barh", ax=ax, color="#3b82f6")
ax.invert_yaxis()
ax.set_xlabel("Median nightly price (โฌ)")
ax.set_ylabel("")
ax.set_title("Top 15 Milan neighbourhoods by median Airbnb price")
plt.tight_layout()
plt.savefig("assets/headline-chart.png", dpi=150)
plt.show()
This is your headline chart. Export it for the README.
Chart 2 โ Price distribution by room type
fig, ax = plt.subplots(figsize=(9, 5))
sns.boxplot(data=df, x="room_type", y="price", ax=ax, showfliers=False)
ax.set_title("Price distribution by room type (outliers hidden)")
ax.set_ylabel("Nightly price (โฌ)")
plt.tight_layout()
plt.show()
Caption: notice the overlap between "Entire home" lower quartile and "Private room" upper quartile โ there's an arbitrage band there.
Chart 3 โ Price vs review score (does quality command a premium?)
fig, ax = plt.subplots(figsize=(9, 6))
sns.scatterplot(
data=df.sample(min(2000, len(df)), random_state=42), # subsample for legibility
x="review_scores_rating", y="price",
hue="room_type", alpha=0.5, ax=ax,
)
ax.set_xlim(2.5, 5.05)
ax.set_ylim(0, df["price"].quantile(0.95))
ax.set_xlabel("Average review score (out of 5)")
ax.set_ylabel("Nightly price (โฌ)")
ax.set_title("Price vs review score, by room type")
plt.tight_layout()
plt.show()
Caption: comment on whether the slope is flat or rising โ does the market pay for quality, or is it priced on location and capacity alone?
6. ๐ง Hypothesis + counter-evidence
Write your hypothesis in markdown, then deliberately go looking for what would disprove it. Example:
Hypothesis: The Brera and Duomo neighbourhoods command the highest prices because of central location, but Isola and Porta Romana offer better value (lower price for similar quality and capacity).
Counter-evidence to check:
- Are Isola listings systematically smaller? (Compare
accommodatesdistribution.)- Do Isola listings have worse review scores? (Compare
review_scores_rating.)- Is there a sample-size issue? (How many listings does each neighbourhood have?)
Write the small follow-up cells that check each one. Write what you find honestly โ if the counter-evidence wins, that's part of the analysis.
focus = df[df["neighbourhood"].isin(["Brera", "Duomo", "Isola", "Porta Romana"])]
focus.groupby("neighbourhood", observed=True).agg(
listings=("id", "count"),
median_price=("price", "median"),
median_accommodates=("accommodates", "median"),
median_rating=("review_scores_rating", "median"),
)
(Neighbourhood spellings come from Inside Airbnb's neighbourhood_cleansed. Check the actual values in your data and adjust the list.)
7. โ Conclusion
A final markdown cell with the SCQA framing in 4โ6 lines:
Situation: Milan Airbnb supply is concentrated in central neighbourhoods (Duomo, Brera) where median prices run โฌX.
Complication: Within walking distance, Isola and Porta Romana offer comparable room types at โฌY less (~Z%).
Question: Where's the value play for a 3-day trip?
Answer: Isola for entire-home stays with high review scores (>4.8); Porta Romana for private rooms. Avoid Duomo unless the trip is < 2 nights and the convenience premium is worth โฌX/night.
The exact numbers are what you found. The point is the structure.
8. ๐ฆ Package on GitHub
README.md template:
# Milan Airbnb EDA
**Question:** What drives Airbnb listing prices in Milan, and where's the value?
**Headline finding:** [one sentence with the result]

Full analysis: [notebook.ipynb](./notebook.ipynb)
## Tech
- Python 3.10, pandas, seaborn, matplotlib
- Jupyter notebook with markdown narration
- One publication-quality chart exported for this README
## Reproduce
1. Clone this repo
2. `pip install -r requirements.txt`
3. Download `listings.csv.gz` from [Inside Airbnb โ Milan](https://insideairbnb.com/get-the-data/) into `data/raw/`
4. `jupyter lab notebook.ipynb` and Run All
## Findings
[Your 5-line SCQA paragraph here]
## Data source
[Inside Airbnb โ Milan](https://insideairbnb.com/) (CC-BY 4.0).
requirements.txt:
pandas>=2.1
seaborn>=0.13
matplotlib>=3.8
jupyterlab>=4.0
.gitignore:
data/raw/*.gz
.ipynb_checkpoints/
__pycache__/
.venv/
โ Definition of Done
- GitHub repo exists with notebook, README, requirements, and assets folder
- README headline answers the question in one sentence
- Headline chart embedded in the README via the assets folder
- Notebook has the 7-step structure (question, load, clean, describe, visualize, hypothesis, conclusion)
- At least three labelled visualizations
- The notebook renders cleanly on GitHub (no broken outputs, no exceptions in cells)
- You can present the analysis in 2 minutes out loud
๐ Stretch Goals
- Geo plot with
latitude/longitudeโ use folium for an interactive map of price by listing, color-coded. - Price regression: build a simple linear regression of price on
room_type,accommodates,bedrooms,neighbourhoodusingstatsmodels. Identify the most under-/over-priced listings as the largest residuals. - Compare two cities: repeat for Rome or Florence and write a short comparison.
- Ship as a Streamlit app so a non-technical reader can filter and explore interactively.
๐ฏ Why this is portfolio-grade
Recruiters look for two things in a Python notebook: clean cleaning and a real recommendation. This project enforces both. The dataset is the standard Inside Airbnb that hiring managers recognize, which makes the analysis legible at a glance. The notebook structure is replicable for any future EDA โ you build a personal template here you'll use for years.