Airbnb Listings EDA (Python + pandas)

    Clean a real, messy Inside Airbnb listings dataset, run an exploratory analysis in a Jupyter notebook, and ship the result on GitHub with a README and one publication-quality chart.

    โœ“ Expert-Designed Projectโ€ข Industry-Validated Implementationโ€ข Production-Ready Architecture

    This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Python, pandas, Jupyter and 2 more technologies through hands-on implementation. Rated beginner level with comprehensive documentation and starter code.

    Beginner
    5-7 hours

    ๐Ÿ  Project: Milan Airbnb EDA

    ๐Ÿ“Œ Project Overview

    A second portfolio piece that demonstrates the side of analyst work BI dashboards don't: cleaning messy real-world data and writing structured analysis in Python.

    You'll work with the Inside Airbnb public dataset for Milan โ€” a real, intentionally messy snapshot of every active Airbnb listing in the city. You'll clean it, run an exploratory data analysis in a Jupyter notebook, and answer:

    What drives Airbnb listing prices in Milan, and where are the under-priced opportunities for a guest looking for value?

    By the end you have a public GitHub repo with a polished notebook, a README that recruiters can read in 30 seconds, and one publication-quality chart that summarizes the story.


    ๐ŸŽฏ Learning Objectives

    • Pull a real, messy public dataset and load it in pandas
    • Apply the full data-cleaning loop: missing values, type coercion, outliers, string normalization, datetime parsing
    • Run an EDA with at least three visualizations
    • Test a hypothesis against the data and check counter-evidence
    • Write a one-paragraph SCQA conclusion that lands the recommendation
    • Package the notebook on GitHub with a recruiter-ready README

    ๐Ÿงฐ Prerequisites

    • Python 3.10+ (or Anaconda)
    • A working Jupyter install (pip install jupyterlab or use Google Colab)
    • Git + GitHub account
    • Comfort with the pandas basics: load, filter, groupby, merge

    If you completed da-4-2 and da-4-3 of the Data Analyst roadmap, you're ready.


    ๐Ÿ“Š Dataset

    Inside Airbnb: Milan listings

    • Source: insideairbnb.com/get-the-data โ†’ scroll to Milan, Italy โ†’ download listings.csv.gz (the detailed listings file, ~10โ€“15 MB compressed).
    • Snapshot: scraped quarterly; pick the most recent one available.
    • Rows: ~25,000 Milan listings.
    • Columns: ~75 โ€” including price (as a messy string with $ and commas), neighbourhood, room type, host info, latitude/longitude, review scores.

    Why this dataset? It's famously messy in real-world ways: prices stored as strings, missing values everywhere, inconsistent text fields, geo coordinates that need projecting. Cleaning it well is the demonstration. The dataset is also genuinely interesting โ€” anyone who's looked for a place in Milan has an intuition you can check against the data.

    License: Inside Airbnb data is published under CC-BY 4.0. Credit them in your README.


    โŒ› Estimated Time

    Duration: 5โ€“7 hours
    Difficulty: Beginner


    ๐Ÿ“‚ Suggested Project Structure

    milan-airbnb-eda/
    โ”œโ”€โ”€ data/
    โ”‚   โ””โ”€โ”€ raw/
    โ”‚       โ””โ”€โ”€ listings.csv.gz       # the Inside Airbnb download
    โ”œโ”€โ”€ notebook.ipynb                 # the analysis
    โ”œโ”€โ”€ assets/
    โ”‚   โ””โ”€โ”€ headline-chart.png         # exported for the README
    โ”œโ”€โ”€ requirements.txt
    โ”œโ”€โ”€ README.md
    โ””โ”€โ”€ .gitignore
    

    ๐Ÿ”„ Step-by-Step Guide

    1. ๐ŸŽฏ Frame the question

    Open the notebook. The very first cell is a markdown cell with the question:

    # Milan Airbnb โ€” What drives price, and where's the value?
    
    **Audience:** a guest planning a 3-day Milan trip in 2026, budget-conscious, willing to walk 10โ€“15 minutes for value.
    
    **Concrete sub-questions:**
    1. Which neighbourhoods have the highest median nightly price?
    2. After controlling for room type and capacity, which neighbourhoods are underpriced?
    3. Is there a review-score floor below which price drops sharply?
    

    Writing the question first stops you from drifting into "let me explore everything".


    2. ๐Ÿ“ฅ Load and sniff

    import pandas as pd
    pd.set_option("display.max_columns", 80)
    
    df = pd.read_csv("data/raw/listings.csv.gz", compression="gzip")
    print(df.shape)
    df.head(3)
    df.info()
    

    You should see ~25k rows and ~75 columns. Note the columns you'll keep: id, neighbourhood_cleansed, room_type, accommodates, bathrooms, bedrooms, price, number_of_reviews, review_scores_rating, latitude, longitude. You can drop the rest.


    3. ๐Ÿงน Clean

    Each cleanup gets a markdown header in the notebook explaining the decision.

    Drop columns you don't need

    keep = [
        "id", "neighbourhood_cleansed", "room_type", "accommodates",
        "bathrooms", "bedrooms", "price", "number_of_reviews",
        "review_scores_rating", "latitude", "longitude",
    ]
    df = df[keep].copy()
    df.columns = df.columns.str.replace("_cleansed", "")  # cleaner names
    

    Parse the messy price column

    # Looks like "$120.00" with commas for thousands. Coerce to float.
    df["price"] = (
        df["price"]
        .astype(str)
        .str.replace(r"[$,]", "", regex=True)
        .pipe(pd.to_numeric, errors="coerce")
    )
    
    print("nulls in price:", df["price"].isna().sum())
    df["price"].describe()
    

    Drop pathological rows

    before = len(df)
    df = df.dropna(subset=["price", "neighbourhood", "room_type"])
    df = df[df["price"] > 0]                          # zero-price listings are bugs
    df = df[df["price"] < df["price"].quantile(0.99)]  # cut the long tail (mansions, errors)
    print(f"dropped {before - len(df)} rows ({(before - len(df)) / before:.1%})")
    

    The 1% trim is justified โ€” Inside Airbnb data has a known long right tail of unrealistically high "test" listings that distort medians.

    Type the rest

    df["bedrooms"]      = df["bedrooms"].fillna(0).astype(int)
    df["accommodates"]  = df["accommodates"].astype(int)
    df["room_type"]     = df["room_type"].astype("category")
    df["neighbourhood"] = df["neighbourhood"].astype("category")
    

    4. ๐Ÿ“ˆ Describe

    df.describe(include="all").T
    df["neighbourhood"].value_counts().head(10)
    df["room_type"].value_counts(normalize=True)
    

    Comment on what you see in a markdown cell: how many neighbourhoods, how concentrated the listings are, what fraction is "Entire home" vs "Private room".


    5. ๐ŸŽจ Visualize three things

    import seaborn as sns
    import matplotlib.pyplot as plt
    sns.set_theme(style="whitegrid")
    

    Chart 1 โ€” Median price by neighbourhood (sorted bar)

    top = (
        df.groupby("neighbourhood", observed=True)["price"]
          .median()
          .sort_values(ascending=False)
          .head(15)
    )
    fig, ax = plt.subplots(figsize=(10, 6))
    top.plot(kind="barh", ax=ax, color="#3b82f6")
    ax.invert_yaxis()
    ax.set_xlabel("Median nightly price (โ‚ฌ)")
    ax.set_ylabel("")
    ax.set_title("Top 15 Milan neighbourhoods by median Airbnb price")
    plt.tight_layout()
    plt.savefig("assets/headline-chart.png", dpi=150)
    plt.show()
    

    This is your headline chart. Export it for the README.

    Chart 2 โ€” Price distribution by room type

    fig, ax = plt.subplots(figsize=(9, 5))
    sns.boxplot(data=df, x="room_type", y="price", ax=ax, showfliers=False)
    ax.set_title("Price distribution by room type (outliers hidden)")
    ax.set_ylabel("Nightly price (โ‚ฌ)")
    plt.tight_layout()
    plt.show()
    

    Caption: notice the overlap between "Entire home" lower quartile and "Private room" upper quartile โ€” there's an arbitrage band there.

    Chart 3 โ€” Price vs review score (does quality command a premium?)

    fig, ax = plt.subplots(figsize=(9, 6))
    sns.scatterplot(
        data=df.sample(min(2000, len(df)), random_state=42),  # subsample for legibility
        x="review_scores_rating", y="price",
        hue="room_type", alpha=0.5, ax=ax,
    )
    ax.set_xlim(2.5, 5.05)
    ax.set_ylim(0, df["price"].quantile(0.95))
    ax.set_xlabel("Average review score (out of 5)")
    ax.set_ylabel("Nightly price (โ‚ฌ)")
    ax.set_title("Price vs review score, by room type")
    plt.tight_layout()
    plt.show()
    

    Caption: comment on whether the slope is flat or rising โ€” does the market pay for quality, or is it priced on location and capacity alone?


    6. ๐Ÿง  Hypothesis + counter-evidence

    Write your hypothesis in markdown, then deliberately go looking for what would disprove it. Example:

    Hypothesis: The Brera and Duomo neighbourhoods command the highest prices because of central location, but Isola and Porta Romana offer better value (lower price for similar quality and capacity).

    Counter-evidence to check:

    1. Are Isola listings systematically smaller? (Compare accommodates distribution.)
    2. Do Isola listings have worse review scores? (Compare review_scores_rating.)
    3. Is there a sample-size issue? (How many listings does each neighbourhood have?)

    Write the small follow-up cells that check each one. Write what you find honestly โ€” if the counter-evidence wins, that's part of the analysis.

    focus = df[df["neighbourhood"].isin(["Brera", "Duomo", "Isola", "Porta Romana"])]
    focus.groupby("neighbourhood", observed=True).agg(
        listings=("id", "count"),
        median_price=("price", "median"),
        median_accommodates=("accommodates", "median"),
        median_rating=("review_scores_rating", "median"),
    )
    

    (Neighbourhood spellings come from Inside Airbnb's neighbourhood_cleansed. Check the actual values in your data and adjust the list.)


    7. โœ… Conclusion

    A final markdown cell with the SCQA framing in 4โ€“6 lines:

    Situation: Milan Airbnb supply is concentrated in central neighbourhoods (Duomo, Brera) where median prices run โ‚ฌX.
    Complication: Within walking distance, Isola and Porta Romana offer comparable room types at โ‚ฌY less (~Z%).
    Question: Where's the value play for a 3-day trip?
    Answer: Isola for entire-home stays with high review scores (>4.8); Porta Romana for private rooms. Avoid Duomo unless the trip is < 2 nights and the convenience premium is worth โ‚ฌX/night.

    The exact numbers are what you found. The point is the structure.


    8. ๐Ÿ“ฆ Package on GitHub

    README.md template:

    # Milan Airbnb EDA
    
    **Question:** What drives Airbnb listing prices in Milan, and where's the value?
    
    **Headline finding:** [one sentence with the result]
    
    ![Median price by neighbourhood](assets/headline-chart.png)
    
    Full analysis: [notebook.ipynb](./notebook.ipynb)
    
    ## Tech
    - Python 3.10, pandas, seaborn, matplotlib
    - Jupyter notebook with markdown narration
    - One publication-quality chart exported for this README
    
    ## Reproduce
    1. Clone this repo
    2. `pip install -r requirements.txt`
    3. Download `listings.csv.gz` from [Inside Airbnb โ€” Milan](https://insideairbnb.com/get-the-data/) into `data/raw/`
    4. `jupyter lab notebook.ipynb` and Run All
    
    ## Findings
    [Your 5-line SCQA paragraph here]
    
    ## Data source
    [Inside Airbnb โ€” Milan](https://insideairbnb.com/) (CC-BY 4.0).
    

    requirements.txt:

    pandas>=2.1
    seaborn>=0.13
    matplotlib>=3.8
    jupyterlab>=4.0
    

    .gitignore:

    data/raw/*.gz
    .ipynb_checkpoints/
    __pycache__/
    .venv/
    

    โœ… Definition of Done

    • GitHub repo exists with notebook, README, requirements, and assets folder
    • README headline answers the question in one sentence
    • Headline chart embedded in the README via the assets folder
    • Notebook has the 7-step structure (question, load, clean, describe, visualize, hypothesis, conclusion)
    • At least three labelled visualizations
    • The notebook renders cleanly on GitHub (no broken outputs, no exceptions in cells)
    • You can present the analysis in 2 minutes out loud

    ๐Ÿš€ Stretch Goals

    • Geo plot with latitude/longitude โ€” use folium for an interactive map of price by listing, color-coded.
    • Price regression: build a simple linear regression of price on room_type, accommodates, bedrooms, neighbourhood using statsmodels. Identify the most under-/over-priced listings as the largest residuals.
    • Compare two cities: repeat for Rome or Florence and write a short comparison.
    • Ship as a Streamlit app so a non-technical reader can filter and explore interactively.

    ๐ŸŽฏ Why this is portfolio-grade

    Recruiters look for two things in a Python notebook: clean cleaning and a real recommendation. This project enforces both. The dataset is the standard Inside Airbnb that hiring managers recognize, which makes the analysis legible at a glance. The notebook structure is replicable for any future EDA โ€” you build a personal template here you'll use for years.

    Project Details

    Tools & Technologies

    Python
    pandas
    Jupyter
    seaborn
    matplotlib

    Difficulty Level

    Beginner

    Estimated Duration

    5-7 hours

    Sign in to submit projects and track your progress

    More Projects