This repository showcases a custom trained machine learning model for predicting how much a user will spend on a given product. The "Black Friday" Kaggle dataset was used for training and evaluation.
- Getting started
- Submitting training jobs
- Model deployment
- Making predictions
- Running preprocessing independently
- Contributing
To get your environment set up and start using this model, please follow the step-by-step instructions provided below.
First, you'll need to clone this repository to your local machine or development environment. Open your terminal, navigate to the directory where you want to clone the repository, and run the following command:
git clone <repository-url>
Replace <repository-url> with the actual URL of this repository. Once cloned, navigate into the repository's directory with cd <repository-name>.
Within the root directory of the cloned repository, create a .env file to store your project configurations. This file should include the following environment variables tailored to your project:
PROJECT_ID=<project_id>
REGION=<region> # example: us-central1
BUCKET_NAME=<bucket_name>
FILE_NAME=<file_name> # file inside your bucket, example: data/your_file.csv
Make sure to replace with your specific project details.
To interact with Google Cloud resources, you need to install the Google Cloud Command Line Interface (CLI) on your system. Follow the detailed installation instructions provided in the official documentation here.
To be able to authenticate to Google Cloud services from your development environment, configure the Application Default Credentials (ADC) by following the guide here.
Modify the config.json file according to your dataset and preprocessing and training needs.
Preprocessing Parameters
These settings define how the input data will be processed before being used for training or analysis.
-
categorical_columns: A list of column names in the dataset that should be treated as categorical variables. These columns will be transformed using encoders to convert categorical values into a format that can be better utilized by machine learning algorithms.
-
target_column: Specifies the column name that will be used as the target variable for predictions. This column is what the model will try to predict, and it will be separated from the feature set during preprocessing.
-
columns_to_drop: A list of column names that should be removed from the dataset before processing. These might be columns that are irrelevant to the model, contain sensitive information, or could lead to data leakage.
-
split_ratios: Defines the proportions in which the data will be split into training, validation, and test sets. The values represent the fraction of data used for training, validation, and testing, respectively. For example, [0.8, 0.1, 0.1] means 80% of the data is used for training, and 10% each for validation and testing.
Regression Model Parameters
These settings define the configuration of the regression model used for predicting the target variable.
-
n_estimators: The number of trees in the forest of the model. Increasing the number of estimators can improve the model's accuracy but also increases the computational load.
-
max_depth: The maximum depth of each tree. Deeper trees can learn more detailed data patterns but can lead to overfitting if not controlled. Adjusting this parameter helps in managing the trade-off between bias and variance.
-
learning_rate: This parameter scales the contribution of each tree by a factor of
learning_rate. There is a trade-off between learning rate and number of trees. Smaller values make the model robust to the specific structure of the tree and thus allow it to generalize well but require more trees to be effective. -
verbosity: The degree of verbosity in output logs generated during model training. A value of 0 suppresses logs, higher values provide more detailed logs.
-
objective: Defines the loss function to be minimized. For regression tasks, typical values are
reg:squarederrorfor mean squared error orreg:linearfor linear regression. -
booster: Type of model to run at each iteration. Common options are
gbtree(tree-based models),gblinear(linear models), ordart(Dropouts meet Multiple Additive Regression Trees). -
tree_method: The type of algorithm used to build trees. Options include
auto,exact,approx,hist, and others.autolets XGBoost choose the most appropriate algorithm based on the dataset characteristics.
The parameters n_estimators, max_depth, and learning_rate are crucial for model performance and will be optimized together through hyperparameter tuning during the training job to achieve the best results.
In your command line type the following commands:
source .env
Create a repo in Artifact Registry
REPO_NAME='your_repo_name'
gcloud artifacts repositories create $REPO_NAME --repository-format=docker \
--location=$REGION --description='your_description'
Define a variable with the URI of your container image in Google Artifact Registry
TRAIN_IMAGE_NAME='your_image_name'
TRAIN_IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${TRAIN_IMAGE_NAME}:latest"
Configure Docker
gcloud auth configure-docker \
$REGION-docker.pkg.dev
Build and push the image using the cloudbuild-train.yaml file by running this from the root of the project
gcloud builds submit --config cloudbuild-train.yaml --substitutions=_REGION=$REGION,_PROJECT_ID=$PROJECT_ID,_REPO_NAME=$REPO_NAME,_TRAIN_IMAGE_NAME=$TRAIN_IMAGE_NAME,_BUCKET_NAME=$BUCKET_NAME,_FILE_NAME=$FILE_NAME .
Submitting a training job will train the model, save it to GCS and also evaluate the model against the train, validation, and test sets. The evaluation metrics are also saved in GCS.
python black_friday_profits/submit_training_job.py \
--display_name black_friday --train_image_uri $TRAIN_IMAGE_URI --bucket_name $BUCKET_NAME
If you wish you can add more arguments to modify the training job. Type the following to see which:
python black_friday_profits/submit_training_job.py --help
To perform model deployment and use the model to generate online or batch predictions, we first need to build the Docker image that we'll use in the deployment process. For this purpose, we created the DockerfileApp to build a custom container with all the necessary dependencies to use the BlackFridayRegressor model. This file defines a multi-stage image composed of two stages:
- STAGE 1: Install Poetry and build the package based on the pyproject.toml.
- STAGE 2: Install the package built in the previous step and run the application.
Local Deployment
To run the application locally and test it, we first need to create a new service account and grant permissions to establish the connection between the application and the Google APIs. To do that, we can follow this Google documentation.
-
Create the service account:
gcloud iam service-accounts create <NAME> -
Add the permissions policy for the service account:
gcloud projects add-iam-policy-binding <PROJECT_ID> --member="serviceAccount:<SERVICE_ACCOUNT_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" --role=<ROLE> -
Create the service account JSON file:
gcloud iam service-accounts keys create black-friday-account-keys.json --iam-account=<SERVICE_ACCOUNT_NAME>@<PROJECT_ID>.iam.gserviceaccount.com -
Uncomment the code snippet in the DockerfileApp file tagged as "FOR LOCAL TESTING".
-
Build the image:
docker build -f DockerfileApp -t ${DEPLOY_IMAGE_URI} ./ -
Stop any existing application running:
docker stop <CONTAINER_NAME> -
Delete any existing built container:
docker rm <CONTAINER_NAME> -
Run the container:
docker run -d -p 8080:8080 \ --name=<CONTAINER-NAME> \ -e AIP_HTTP_PORT=8080 \ -e AIP_HEALTH_ROUTE=/health \ -e AIP_PREDICT_ROUTE=/predict \ -e AIP_STORAGE_URI=gs://$BUCKET_URI/$MODEL_ARTIFACT_DIR \ $DEPLOY_IMAGE_URI
GCP Deployment
To run the application in GCP, we can use the DockerfileApp without modifications and run the following lines to build and push the image to GCP Artifact Registry:
- Build and push the image using the
cloudbuild-deploy.yamlfile by running this from the root of the project
gcloud builds submit --config cloudbuild-deploy.yaml --substitutions=_REGION=$REGION,_PROJECT_ID=$PROJECT_ID,_REPO_NAME=$REPO_NAME,_DEPLOY_IMAGE_NAME=$DEPLOY_IMAGE_NAME .
Now, with the deployment image already built, the model deployment process utilizes the deploy_model.py script, which is responsible for registering the model to Vertex AI Model Registry, deploying it to Google Cloud's Vertex AI Endpoints with dedicated resources, and creating a model monitoring job.
python black_friday_profits/deployment/deploy_model.py \
--artifact_uri model_uri_in_gcs \
--machine_type "machine-type" \
--endpoint_name "your-endpoint-name" \
--model_name "your-model-name" \
--deploy_image_uri $DEPLOY_IMAGE_URI \
--monitoring_email "[email protected]" \
--train_data_gcs_uri "your_train_data_gcs_uri" \
--log_sampling_rate sampling_rate \
--monitor_interval interval_hors \
--target_column "your_target_column"
Make sure to change the parameters with your information. This command will create a new endpoint and deploy your model to it, making it ready for serving online predictions.
The prediction process is facilitated through the predict.py script, which supports both online and batch prediction modes for models deployed on Google Cloud's Vertex AI.
python black_friday_profits/predictions/predict.py --online \
--endpoint_id "your-endpoint-name" \
--input_file "sample_data/your_input_file.json"
This command performs real-time, online predictions by sending input data to a deployed model's endpoint. It's ideal for applications requiring immediate inference.
python black_friday_profits/predictions/predict.py --batch \
--model_id "your-model-id" \
--gcs_batch_pred_source "your-test-data-uri" \
--gcs_destination "your-results-destination-uri"
This command initiates a batch prediction job, allowing for the processing of large volumes of data. The results are stored in a specified Google Cloud Storage location, making it suitable for asynchronous prediction tasks on bulk data.
If you need to run the data preprocessing script independently, perhaps for testing or separate analysis purposes, follow these steps:
-
Prepare your data:
Ensure your data is accessible as a CSV file in a cloud storage bucket. -
Configure the script:
Review and update theconfig.jsonfile to match the preprocessing requirements specific to your data. This includes specifying the target column, features to drop, and categorical columns for encoding. -
Execute the preprocessing script:
Run the script from the command line by providing the necessary parameters:python black_friday_profits/trainer/preprocessing.py \ --bucket_name your_bucket_name \ --file_name your_data_file.csvThis script will output the processed data, ready for further use in training or analysis, in a folder in the specified bucket.
Our project embraces a streamlined workflow that ensures high-quality software development and efficient collaboration among team members. To maintain this standard, we follow a specific branching strategy and commit convention outlined in our CONTRIBUTING.md file.
We highly encourage all contributors to familiarize themselves with these guidelines. Adhering to the outlined practices helps us keep our codebase organized, facilitates easier code reviews, and accelerates the development process. For detailed information on our branching strategy and how we commit changes, please refer to the CONTRIBUTING.md file.