This project trains a machine learning model to predict taxi fares in New York City using a dataset from Kaggle. This project is a learning project, the competition already ended.
The dataset is sourced from the Kaggle competition: New York City Taxi Fare Prediction. It contains:
- Comprehensive training and test data involving location and fare information.
- Date and time of taxi trips.
- Sampled 10% of the training data to reduce runtime.
- Addressed missing values and outliers.
- Engineered features like trip distance, pickup/dropoff landmarks, and datetime components.
- Identified data distributions, ranges, and outliers.
- Observed that latitude and longitude values had some errors in the dataset.
- Implemented baseline models (e.g., Mean Regressor).
- Experimented with multiple algorithms:
- Linear Regression
- Ridge Regression
- Lasso
- Random Forest
- Elastic Net
- Compared model performances based on RMSE and selected the best-performing model for final tuning.
- Utilized grid search and manual tuning to optimize model parameters.
Ensure Python and pip are installed. Install the required libraries using:
pip install pandas numpy scikit-learn xgboost matplotlib opendatasets-
Download and load the data: Use the
opendatasetslibrary to download data directly from Kaggle. -
Run preprocessing and feature engineering scripts: Prepare the data by cleaning and creating new features necessary for the models.
-
Model training and evaluation: Train various models and evaluate them to select the best one.
-
Hyperparameter tuning: Fine-tune the chosen model to improve accuracy.
-
Predict and generate submission file: Predict taxi fares for the test dataset and generate a submission file for Kaggle.