Skip to content

Example of an end-to-end machine learning pipeline using the Black Friday dataset to increase profits. This demo can be implemented using either Vertex AI (or Endpoints) or Dataproc, and can utilize any available machine learning library on Google Cloud (for example, XGBoost, scikit-learn, tf.Keras, Spark machine learning).

Notifications You must be signed in to change notification settings

tryolabs/google-partners-demo-2

Repository files navigation

Google Partners Capabilities Assessment - Demo 2

This repository showcases a custom trained machine learning model for predicting how much a user will spend on a given product. The "Black Friday" Kaggle dataset was used for training and evaluation.

Table of contents

Getting started

To get your environment set up and start using this model, please follow the step-by-step instructions provided below.

Step 1: Clone the Repository

First, you'll need to clone this repository to your local machine or development environment. Open your terminal, navigate to the directory where you want to clone the repository, and run the following command:

git clone <repository-url>
Replace <repository-url> with the actual URL of this repository. Once cloned, navigate into the repository's directory with cd <repository-name>.

Step 2: Create a .env File

Within the root directory of the cloned repository, create a .env file to store your project configurations. This file should include the following environment variables tailored to your project:

PROJECT_ID=<project_id>
REGION=<region> # example: us-central1
BUCKET_NAME=<bucket_name>
FILE_NAME=<file_name> # file inside your bucket, example: data/your_file.csv

Make sure to replace with your specific project details.

Step 3: Install the Google Cloud CLI

To interact with Google Cloud resources, you need to install the Google Cloud Command Line Interface (CLI) on your system. Follow the detailed installation instructions provided in the official documentation here.

Step 4: Set Up Application Default Credentials

To be able to authenticate to Google Cloud services from your development environment, configure the Application Default Credentials (ADC) by following the guide here.

Step 5: Set Up Your Configuration

Modify the config.json file according to your dataset and preprocessing and training needs.

Preprocessing Parameters

These settings define how the input data will be processed before being used for training or analysis.

  • categorical_columns: A list of column names in the dataset that should be treated as categorical variables. These columns will be transformed using encoders to convert categorical values into a format that can be better utilized by machine learning algorithms.

  • target_column: Specifies the column name that will be used as the target variable for predictions. This column is what the model will try to predict, and it will be separated from the feature set during preprocessing.

  • columns_to_drop: A list of column names that should be removed from the dataset before processing. These might be columns that are irrelevant to the model, contain sensitive information, or could lead to data leakage.

  • split_ratios: Defines the proportions in which the data will be split into training, validation, and test sets. The values represent the fraction of data used for training, validation, and testing, respectively. For example, [0.8, 0.1, 0.1] means 80% of the data is used for training, and 10% each for validation and testing.

Regression Model Parameters

These settings define the configuration of the regression model used for predicting the target variable.

  • n_estimators: The number of trees in the forest of the model. Increasing the number of estimators can improve the model's accuracy but also increases the computational load.

  • max_depth: The maximum depth of each tree. Deeper trees can learn more detailed data patterns but can lead to overfitting if not controlled. Adjusting this parameter helps in managing the trade-off between bias and variance.

  • learning_rate: This parameter scales the contribution of each tree by a factor of learning_rate. There is a trade-off between learning rate and number of trees. Smaller values make the model robust to the specific structure of the tree and thus allow it to generalize well but require more trees to be effective.

  • verbosity: The degree of verbosity in output logs generated during model training. A value of 0 suppresses logs, higher values provide more detailed logs.

  • objective: Defines the loss function to be minimized. For regression tasks, typical values are reg:squarederror for mean squared error or reg:linear for linear regression.

  • booster: Type of model to run at each iteration. Common options are gbtree (tree-based models), gblinear (linear models), or dart (Dropouts meet Multiple Additive Regression Trees).

  • tree_method: The type of algorithm used to build trees. Options include auto, exact, approx, hist, and others. auto lets XGBoost choose the most appropriate algorithm based on the dataset characteristics.

The parameters n_estimators, max_depth, and learning_rate are crucial for model performance and will be optimized together through hyperparameter tuning during the training job to achieve the best results.

Submitting training jobs

In your command line type the following commands:

Step 1. Source environment variables

source .env

Step 2. Build the training container

Create a repo in Artifact Registry

REPO_NAME='your_repo_name'

gcloud artifacts repositories create $REPO_NAME --repository-format=docker \
--location=$REGION --description='your_description'

Define a variable with the URI of your container image in Google Artifact Registry

TRAIN_IMAGE_NAME='your_image_name'

TRAIN_IMAGE_URI="${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO_NAME}/${TRAIN_IMAGE_NAME}:latest"

Configure Docker

gcloud auth configure-docker \
 $REGION-docker.pkg.dev

Build and push the image using the cloudbuild-train.yaml file by running this from the root of the project

gcloud builds submit --config cloudbuild-train.yaml --substitutions=_REGION=$REGION,_PROJECT_ID=$PROJECT_ID,_REPO_NAME=$REPO_NAME,_TRAIN_IMAGE_NAME=$TRAIN_IMAGE_NAME,_BUCKET_NAME=$BUCKET_NAME,_FILE_NAME=$FILE_NAME .

Step 3. Submit a training job

Submitting a training job will train the model, save it to GCS and also evaluate the model against the train, validation, and test sets. The evaluation metrics are also saved in GCS.

python black_friday_profits/submit_training_job.py \
 --display_name black_friday --train_image_uri $TRAIN_IMAGE_URI --bucket_name $BUCKET_NAME

If you wish you can add more arguments to modify the training job. Type the following to see which:

python black_friday_profits/submit_training_job.py --help

Model Deployment

Step 1. Build deployment container

To perform model deployment and use the model to generate online or batch predictions, we first need to build the Docker image that we'll use in the deployment process. For this purpose, we created the DockerfileApp to build a custom container with all the necessary dependencies to use the BlackFridayRegressor model. This file defines a multi-stage image composed of two stages:

  • STAGE 1: Install Poetry and build the package based on the pyproject.toml.
  • STAGE 2: Install the package built in the previous step and run the application.

Local Deployment

To run the application locally and test it, we first need to create a new service account and grant permissions to establish the connection between the application and the Google APIs. To do that, we can follow this Google documentation.

  1. Create the service account: gcloud iam service-accounts create <NAME>

  2. Add the permissions policy for the service account: gcloud projects add-iam-policy-binding <PROJECT_ID> --member="serviceAccount:<SERVICE_ACCOUNT_NAME>@<PROJECT_ID>.iam.gserviceaccount.com" --role=<ROLE>

  3. Create the service account JSON file: gcloud iam service-accounts keys create black-friday-account-keys.json --iam-account=<SERVICE_ACCOUNT_NAME>@<PROJECT_ID>.iam.gserviceaccount.com

  4. Uncomment the code snippet in the DockerfileApp file tagged as "FOR LOCAL TESTING".

  5. Build the image: docker build -f DockerfileApp -t ${DEPLOY_IMAGE_URI} ./

  6. Stop any existing application running: docker stop <CONTAINER_NAME>

  7. Delete any existing built container: docker rm <CONTAINER_NAME>

  8. Run the container:

    docker run -d -p 8080:8080 \
    --name=<CONTAINER-NAME> \
     -e AIP_HTTP_PORT=8080 \
     -e AIP_HEALTH_ROUTE=/health \
     -e AIP_PREDICT_ROUTE=/predict \
     -e AIP_STORAGE_URI=gs://$BUCKET_URI/$MODEL_ARTIFACT_DIR \
     $DEPLOY_IMAGE_URI
    

GCP Deployment

To run the application in GCP, we can use the DockerfileApp without modifications and run the following lines to build and push the image to GCP Artifact Registry:

  1. Build and push the image using the cloudbuild-deploy.yaml file by running this from the root of the project
 gcloud builds submit --config cloudbuild-deploy.yaml --substitutions=_REGION=$REGION,_PROJECT_ID=$PROJECT_ID,_REPO_NAME=$REPO_NAME,_DEPLOY_IMAGE_NAME=$DEPLOY_IMAGE_NAME .

Step 2. Register and deploy model

Now, with the deployment image already built, the model deployment process utilizes the deploy_model.py script, which is responsible for registering the model to Vertex AI Model Registry, deploying it to Google Cloud's Vertex AI Endpoints with dedicated resources, and creating a model monitoring job.

 python black_friday_profits/deployment/deploy_model.py \
  --artifact_uri model_uri_in_gcs \
  --machine_type "machine-type" \
  --endpoint_name "your-endpoint-name" \
  --model_name "your-model-name" \
  --deploy_image_uri $DEPLOY_IMAGE_URI \
  --monitoring_email "[email protected]" \
  --train_data_gcs_uri "your_train_data_gcs_uri" \
  --log_sampling_rate sampling_rate \
  --monitor_interval interval_hors \
  --target_column "your_target_column"

Make sure to change the parameters with your information. This command will create a new endpoint and deploy your model to it, making it ready for serving online predictions.

Making predictions

The prediction process is facilitated through the predict.py script, which supports both online and batch prediction modes for models deployed on Google Cloud's Vertex AI.

Online Prediction

python black_friday_profits/predictions/predict.py --online \
    --endpoint_id "your-endpoint-name" \
    --input_file "sample_data/your_input_file.json"

This command performs real-time, online predictions by sending input data to a deployed model's endpoint. It's ideal for applications requiring immediate inference.

Batch Prediction

python black_friday_profits/predictions/predict.py --batch \
   --model_id "your-model-id" \
   --gcs_batch_pred_source "your-test-data-uri" \
   --gcs_destination "your-results-destination-uri"

This command initiates a batch prediction job, allowing for the processing of large volumes of data. The results are stored in a specified Google Cloud Storage location, making it suitable for asynchronous prediction tasks on bulk data.

Running preprocessing independently

If you need to run the data preprocessing script independently, perhaps for testing or separate analysis purposes, follow these steps:

  1. Prepare your data:
    Ensure your data is accessible as a CSV file in a cloud storage bucket.

  2. Configure the script:
    Review and update the config.json file to match the preprocessing requirements specific to your data. This includes specifying the target column, features to drop, and categorical columns for encoding.

  3. Execute the preprocessing script:
    Run the script from the command line by providing the necessary parameters:

    python black_friday_profits/trainer/preprocessing.py \
    --bucket_name your_bucket_name \
    --file_name your_data_file.csv
    

    This script will output the processed data, ready for further use in training or analysis, in a folder in the specified bucket.

Contributing

Our project embraces a streamlined workflow that ensures high-quality software development and efficient collaboration among team members. To maintain this standard, we follow a specific branching strategy and commit convention outlined in our CONTRIBUTING.md file.

We highly encourage all contributors to familiarize themselves with these guidelines. Adhering to the outlined practices helps us keep our codebase organized, facilitates easier code reviews, and accelerates the development process. For detailed information on our branching strategy and how we commit changes, please refer to the CONTRIBUTING.md file.

About

Example of an end-to-end machine learning pipeline using the Black Friday dataset to increase profits. This demo can be implemented using either Vertex AI (or Endpoints) or Dataproc, and can utilize any available machine learning library on Google Cloud (for example, XGBoost, scikit-learn, tf.Keras, Spark machine learning).

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •