A machine learning pipeline designed to predict Airbnb listing prices based on features like location, room type, and reviews. This project utilizes Scikit-Learn for modeling and MLflow for end-to-end machine learning lifecycle management, including experiment tracking, model logging, and performance comparison.
The goal of this project is to build a regression model that accurately estimates the price of an Airbnb rental. The workflow includes:
- Data Ingestion: Loading data from AWS S3 (or local source).
- Exploratory Data Analysis (EDA): Visualizing price distributions and correlations.
- Preprocessing:
- Handling missing values.
- Capping outliers (99th percentile).
- Log-transforming skewed features (
number_of_reviews,minimum_nights). - One-Hot Encoding categorical variables (
neighbourhood,room_type).
- Modeling: Training Linear Regression, Random Forest, and Gradient Boosting models.
- MLflow Tracking: Logging hyperparameters, evaluation metrics (RMSE, MAE, R2), residual plots, and model artifacts.
- Language: Python 3.10+
- Data Manipulation: Pandas, NumPy
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-Learn
- Experiment Tracking: MLflow
- Cloud Storage: AWS Boto3
This project uses MLflow to track every run. Below are insights from the experiment logs.
Overview of the different runs (Linear Regression, Random Forest, Gradient Boosting) with their respective metrics.

Detailed view of the logged parameters and custom artifacts (such as Actual vs. Predicted plots and Feature Importance) saved for every run.

Programmatic selection of the best model based on the lowest RMSE.

git clone https://github.com/Dhruvrana8/airbnb-price-prediction-mlflow
cd airbnb-price-predictionpython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activatepip install -r requirements.txtEnsure mlflow, scikit-learn, pandas, boto3, and python-dotenv are in your requirements.
Create a .env file in the root directory if you are loading data from S3:
aws_access_key_id=YOUR_ACCESS_KEY
aws_secret_access_key=YOUR_SECRET_KEY
region=us-east-1
bucket_name=your-bucket-name
object_key=your-data-file.csv
-
Run the Notebook: Open
index.ipynbin Jupyter or VS Code and execute the cells to preprocess data and train models. -
Launch MLflow UI: To view the dashboard shown in the screenshots above, run the following command in your terminal:
mlflow ui
-
Access Dashboard: Open your browser and navigate to:
http://127.0.0.1:5000
Based on the validation set, the models performed as follows:
| Model | RMSE | MAE | R2 Score |
|---|---|---|---|
| Random Forest | 78.82 | 46.63 | 0.475 |
| Gradient Boosting | 80.04 | 47.79 | 0.459 |
| Linear Regression | 84.44 | 51.33 | 0.397 |
Conclusion: The Random Forest model outperformed the others, capturing the non-linear relationships in the housing data more effectively.
├── assets
│ └── images
│ ├── AirBnb Price Prediction Model.png
│ ├── Ml Flow Main Screen.png
│ └── Random Forest Model.png
├── mlruns/ # MLflow local tracking logs
├── notebook
│ └── index.ipynb # Main Python Code
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── .env # AWS Credentials (Not committed)