This repository showcases an end-to-end TensorFlow pipeline to train a ML model for predicting taxi trip duration. The "Chicago Taxi Trips" Kaggle dataset was used for training and evaluation.
- Getting started
- Executing the pipeline
- Evaluation and model analysis
- Making predictions
- Contributing
To get your environment set up and start using this model, please follow the step-by-step instructions provided below.
First, you'll need to clone this repository to your local machine or development environment. Open your terminal, navigate to the directory where you want to clone the repository, and run the following command:
git clone <repository-url>
Replace <repository-url> with the actual URL of this repository. Once cloned, navigate into the repository's directory with cd <repository-name>.
Within the root directory of the cloned repository, create a .env file to store your project configurations. This file should include the following environment variables tailored to your project:
PROJECT_ID=<project_id>
REGION=<region> # example: us-central1Make sure to replace with your specific project details.
To interact with Google Cloud resources, you need to install the Google Cloud Command Line Interface (CLI) on your system. Follow the detailed installation instructions provided in the official documentation here.
To be able to authenticate to Google Cloud services from your development environment, configure the Application Default Credentials (ADC) by following the guide here.
Follow the guide here, the section Create a service account key. To create a json file which stores the service account credentials.
- Install python 3.9.5, using
pyenv install 3.9.5 - Install python dependencies with
poetry install
Executing the pipeline takes care of ingesting and transforming data, hyperparameter tuning and model training, evaluating the model and registering to the model registry, creating a Vertex endpoint, and deploying the model to make it available for predictions. To run the pipeline using Vertex AI use the following command:
python chicago_taxis/kubeflow_v2_runner.pyWhen executing the pipeline there MAPE on the test set will be logged in a file in GCP. If the MAPE beats the current best (for the blessed model) then the trained model will be registered in GCP and the MAPE will be registered as well.
The prediction process is facilitated through the predict.py script by running:
python chicago_taxis/predict.py \
--endpoint_id <endpoint_id> \
--input_file chicago_taxis/data/prediction_sample.jsonOur project embraces a streamlined workflow that ensures high-quality software development and efficient collaboration among team members. To maintain this standard, we follow a specific branching strategy and commit convention outlined in our CONTRIBUTING.md file.
We highly encourage all contributors to familiarize themselves with these guidelines. Adhering to the outlined practices helps us keep our codebase organized, facilitates easier code reviews, and accelerates the development process. For detailed information on our branching strategy and how we commit changes, please refer to the CONTRIBUTING.md file.