This quick-start deployment guide can be used to set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.
NOTE: This environment is not intended to be a long lived environment. It is intended for temporary demonstration and learning purposes.
For more information about the architecture, see the Playground AI/ML Platform on GKE: Architecture document.
For an outline of products and features used in the platform, see the Platform Products and Features document.
In this guide you can choose to bring your project (BYOP) or have Terraform create a new project for you. The requirements are difference based on the option that you choose.
- Project ID of a new Google Cloud Project, preferably with no APIs enabled
roles/owner
IAM permissions on the project- GitHub Personal Access Token, steps to create the token are provided below
- Billing account ID
- Organization or folder ID
roles/billing.user
IAM permissions on the billing account specifiedroles/resourcemanager.projectCreator
IAM permissions on the organization or folder specified- GitHub Personal Access Token, steps to create the token are provided below
The default quota given to a project should be sufficient for this guide.
NOTE: This tutorial is designed to be run from Cloud Shell in the Google Cloud Console.
-
Clone the repository and change directory to the guide directory
git clone https://github.com/GoogleCloudPlatform/accelerated-platforms && \ cd accelerated-platforms
-
Set environment variables
export MLP_BASE_DIR=$(pwd) && \ sed -n -i -e '/^export MLP_BASE_DIR=/!p' -i -e '$aexport MLP_BASE_DIR="'"${MLP_BASE_DIR}"'"' ${HOME}/.bashrc
cd platforms/gke-aiml/playground && \ export MLP_TYPE_BASE_DIR=$(pwd) && \ sed -n -i -e '/^export MLP_TYPE_BASE_DIR=/!p' -i -e '$aexport MLP_TYPE_BASE_DIR="'"${MLP_TYPE_BASE_DIR}"'"' ${HOME}/.bashrc
-
Create a Personal Access Token in GitHub:
Note: It is recommended to use a machine user account for this but you can use a personal user account just to try this reference architecture. Ensure that your organization allows access via the access token type you select.
Fine-grained personal access token
- Go to https://github.com/settings/tokens and login using your credentials
- Click "Generate new token" >> "Generate new token (Beta)".
- Enter a Token name.
- Select the expiration.
- Select the Resource owner.
- Select All repositories
- Set the following Permissions:
- Repository permissions
- Administration: Read and write
- Content: Read and write
- Repository permissions
- Click "Generate token"
Personal access tokens (classic)
- Go to https://github.com/settings/tokens and login using your credentials
- Click "Generate new token" >> "Generate new token (classic)".
- You will be directed to a screen to created the new token. Provide the note and expiration.
- Choose the following two access:
- repo - Full control of private repositories
- delete_repo - Delete repositories
- Click "Generate token"
-
Store the token in a secure file.
# Create a secure directory mkdir -p ${HOME}/secrets/ chmod go-rwx ${HOME}/secrets # Create a secure file touch ${HOME}/secrets/mlp-github-token chmod go-rwx ${HOME}/secrets/mlp-github-token # Put the token in the secure file using your preferred editor nano ${HOME}/secrets/mlp-github-token
-
Set the Git environment variables in Cloud Shell
Replace the following values:
<GIT_NAMESPACE>
is the GitHub organization or user namespace to use for the repositories<GIT_USER_EMAIL>
is the email address to use for commit<GIT_USER_NAME>
is the GitHub account to use for authentication
export MLP_GIT_NAMESPACE="<GIT_NAMESPACE>" export MLP_GIT_USER_EMAIL="<GIT_USER_EMAIL>" export MLP_GIT_USER_NAME="<GIT_USER_NAME>"
-
Set the configuration variables
sed -i "s/YOUR_GIT_NAMESPACE/${MLP_GIT_NAMESPACE}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars sed -i "s/YOUR_GIT_USER_EMAIL/${MLP_GIT_USER_EMAIL}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars sed -i "s/YOUR_GIT_USER_NAME/${MLP_GIT_USER_NAME}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars
You only need to complete the section for the option that you have selected (either option 1 or 2).
-
Set the project environment variables in Cloud Shell
Replace the following values
<PROJECT_ID>
is the ID of your existing Google Cloud project
export MLP_PROJECT_ID="<PROJECT_ID>" export MLP_STATE_BUCKET="${MLP_PROJECT_ID}-terraform"
-
Set the default
gcloud
projectgcloud config set project ${MLP_PROJECT_ID}
-
Authorize
gcloud
gcloud auth login --activate --no-launch-browser --quiet --update-adc
-
Create a Cloud Storage bucket to store the Terraform state
gcloud storage buckets create gs://${MLP_STATE_BUCKET} --project ${MLP_PROJECT_ID}
-
Set the configuration variables
sed -i "s/YOUR_STATE_BUCKET/${MLP_STATE_BUCKET}/g" ${MLP_TYPE_BASE_DIR}/backend.tf sed -i "s/YOUR_PROJECT_ID/${MLP_PROJECT_ID}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars
You can now the Configure Identity-Aware Proxy (IAP).
-
Set the configuration variables
nano ${MLP_BASE_DIR}/terraform/features/initialize/initialize.auto.tfvars
environment_name = "dev" iap_support_email = "" project = { billing_account_id = "XXXXXX-XXXXXX-XXXXXX" folder_id = "############" name = "mlp" org_id = "############" }
environment_name
: the name of the environmentiap_support_email
: the email to use as the support contact for the IAP brandproject.billing_account_id
: the billing account IDproject.name
: the prefix for the display name of the project, the full name will be<project.name>-<environment_name>
Enter either
project.folder_id
ORproject.org_id
project.folder_id
: the Google Cloud folder IDproject.org_id
: the Google Cloud organization ID
-
Authorize
gcloud
gcloud auth login --activate --no-launch-browser --quiet --update-adc
-
Create a new project
cd ${MLP_BASE_DIR}/terraform/features/initialize terraform init && \ terraform plan -input=false -out=tfplan && \ terraform apply -input=false tfplan && \ rm tfplan && \ terraform init -force-copy -migrate-state && \ rm -rf state
-
Set the project environment variables in Cloud Shell
MLP_PROJECT_ID=$(grep environment_project_id ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs)
You can now the Create the resources.
Identity-Aware Proxy (IAP) lets you establish a central authorization layer for applications accessed by HTTPS, so you can use an application-level access control model instead of relying on network-level firewalls.
IAP policies scale across your organization. You can define access policies centrally and apply them to all of your applications and resources. When you assign a dedicated team to create and enforce policies, you protect your project from incorrect policy definition or implementation in any application.
For more information on IAP, see the Identity-Aware Proxy documentation
For this guide we will configure a generic OAuth consent screen setup for internal use. Internal use means that only users within your organization can be granted IAM permissions to access the IAP secured applications and resource.
See the Configuring the OAuth consent screen documentation for additional information
NOTE: These steps only need to be completed once for a project. If you are using the Terraform managed project option, this has already been completed for you.
- Go to APIs & Services > OAuth consent screen configuration page.
- Select Internal for the User Type
- Click CREATE
- Enter IAP Secured Application for the the App name
- Enter an email address for the User support email
- Enter an email address for the Developer contact information
- Click SAVE AND CONTINUE
- Leave the default values for Scopes
- Click SAVE AND CONTINUE
- On the Summary page, click BACK TO DASHBOARD
- The OAuth consent screen should now look like this:
For simplicity, in this guide access to the IAP secured applications will be configure to allow all users in the organization. Access can be configured per IAP application or resources.
-
Set the IAP allow domain
MLP_IAP_DOMAIN=$(gcloud auth list --filter=status:ACTIVE --format="value(account)" | awk -F@ '{print $2}') echo "MLP_IAP_DOMAIN=${MLP_IAP_DOMAIN}"
If the domain of the active
gcloud
user is different from the organization that theMLP_PROJECT_ID
project is in, you will need to manually setMLP_IAP_DOMAIN
environment variableMLP_IAP_DOMAIN=<MLP_PROJECT_ID organization domain>
-
Set the IAP domain in the configuration file
sed -i '/^iap_domain[[:blank:]]*=/{h;s/=.*/= "'"${MLP_IAP_DOMAIN}"'"/};${x;/^$/{s//iap_domain = "'"${MLP_IAP_DOMAIN}"'"/;H};x}' ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars
Before running Terraform, make sure that the Service Usage API and Service Management API are enabled.
-
Enable Service Usage API
gcloud services enable serviceusage.googleapis.com
-
Enable the Service Management API
gcloud services enable servicemanagement.googleapis.com
-
Ensure the endpoints are not in a deleted state
MLP_ENVIRONMENT_NAME=$(grep environment_name ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs) MLP_PROJECT_ID=$(grep environment_project_id ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs) gcloud endpoints services undelete gradio.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog --quiet 2>/dev/null gcloud endpoints services undelete locust.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog --quiet 2>/dev/null gcloud endpoints services undelete mlflow-tracking.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog --quiet 2>/dev/null gcloud endpoints services undelete ray-dashboard.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog --quiet 2>/dev/null
-
Create the resources
cd ${MLP_TYPE_BASE_DIR} && \ terraform init && \ terraform plan -input=false -var git_token="$(tr --delete '\n' < ${HOME}/secrets/mlp-github-token)" -out=tfplan && \ terraform apply -input=false tfplan && \ rm tfplan
See Create resources errors in the Troubleshooting section if the apply does not complete successfully.
-
Create your environment configuration file
export MLP_ENVIRONMENT_FILE="${HOME}/mlp-${MLP_PROJECT_ID}-${MLP_ENVIRONMENT_NAME}.env" && \ sed -n -i -e '/^export MLP_ENVIRONMENT_FILE=/!p' -i -e '$aexport MLP_ENVIRONMENT_FILE="'"${MLP_ENVIRONMENT_FILE}"'"' ${HOME}/.bashrc && \ terraform output -raw environment_configuration > ${MLP_ENVIRONMENT_FILE} && \ source ${MLP_ENVIRONMENT_FILE}
-
Go to Google Cloud Console, click on the navigation menu and click on Kubernetes Engine > Clusters. You should see one cluster.
-
Go to Google Cloud Console, click on the navigation menu and click on Kubernetes Engine > Config. If you haven't enabled GKE Enterprise in the project earlier, Click
LEARN AND ENABLE
button and thenENABLE GKE ENTERPRISE
. You should see a RootSync and RepoSync object.
For the playground configuration, Ray and MLflow are installed by default.
You can check the installation by executing the following commands in Cloud Shell:
-
Get cluster credentials:
gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME}
The output will be similar to the following:
Starting to build Gateway kubeconfig... Current project_id: mlops-platform-417609 A new kubeconfig entry "connectgateway_mlops-platform-417609_global_mlp-dev" has been generated and set as the current context.
-
Fetch KubeRay operator CRDs
kubectl get crd | grep ray
The output will be similar to the following:
rayclusters.ray.io 2024-02-12T21:19:06Z rayjobs.ray.io 2024-02-12T21:19:09Z rayservices.ray.io 2024-02-12T21:19:12Z
-
Fetch KubeRay operator pod
kubectl get pods
The output will be similar to the following:
NAME READY STATUS RESTARTS AGE kuberay-operator-56b8d98766-2nvht 1/1 Running 0 6m26s
-
Check the namespace created:
kubectl get ns ${MLP_KUBERNETES_NAMESPACE}
The output will be similar to the following:
NAME STATUS AGE ml-team Active ##m
-
Check the RepoSync object created for the namespace:
kubectl get reposync -n ${MLP_KUBERNETES_NAMESPACE}
-
Check the
raycluster
in the namespacekubectl get raycluster -n ${MLP_KUBERNETES_NAMESPACE}
The output will be similar to the following:
NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE ray-cluster-kuberay 1 1 ready 29m
-
Check the head and worker pods of kuberay in the namespace
kubectl get pods -n ${MLP_KUBERNETES_NAMESPACE}
The output will be similar to the following:
NAME READY STATUS RESTARTS AGE ray-cluster-kuberay-head-sp6dg 2/2 Running 0 3m21s ray-cluster-kuberay-worker-workergroup-rzpjw 2/2 Running 0 3m21s mlflow-tracking-6f9bb844f9-4749n 2/2 Running 0 3m13s
-
Open the namespace's Ray dashboard
echo -e "\n${MLP_KUBERNETES_NAMESPACE} Ray dashboard: ${MLP_RAY_DASHBOARD_NAMESPACE_ENDPOINT}\n"
-
Open the namespace's MLFlow Tracking server
echo -e "\n${MLP_KUBERNETES_NAMESPACE} MLFlow Tracking URL: ${MLP_MLFLOW_TRACKING_NAMESPACE_ENDPOINT}\n"
If you get
ERR_CONNECTION_CLOSED
orERR_CONNECTION_RESET
when trying to go to the links, the Gateway is still being provisioned. Retry in a couple of minutes.If you get
ERR_SSL_VERSION_OR_CIPHER_MISMATCH
when trying to go to the links, the SSL certificate is still being provisioned. Retry in a couple of minutes.
-
Destroy the resources
cd ${MLP_TYPE_BASE_DIR} && \ terraform init && \ terraform destroy -auto-approve -var git_token="$(tr --delete '\n' < ${HOME}/secrets/mlp-github-token)" && \ rm -rf .terraform .terraform.lock.hcl
See Cleanup resources errors in the Troubleshooting section if the destroy does not complete successfully.
You only need to complete the section for the option that you have selected.
-
Delete the project
gcloud projects delete ${MLP_PROJECT_ID}
-
Destroy the project
cd ${MLP_BASE_DIR}/terraform/features/initialize && \ TERRAFORM_BUCKET_NAME=$(grep bucket backend.tf | awk -F"=" '{print $2}' | xargs) && \ cp backend.tf.local backend.tf && \ terraform init -force-copy -lock=false -migrate-state && \ gsutil -m rm -rf gs://${TERRAFORM_BUCKET_NAME}/* && \ terraform init && \ terraform destroy -auto-approve && \ rm -rf .terraform .terraform.lock.hcl state/
-
Delete the environment configuration file
rm -f ${MLP_ENVIRONMENT_FILE}
-
Remove the environment configuration environment variable
sed -i -e '/^export MLP_ENVIRONMENT_FILE=/d' ${HOME}/.bashrc
-
Restore modified files
cd ${MLP_BASE_DIR} && \ git restore \ platforms/gke-aiml/playground/backend.tf \ platforms/gke-aiml/playground/mlp.auto.tfvars \ terraform/features/initialize/backend.tf \ terraform/features/initialize/backend.tf.bucket \ terraform/features/initialize/initialize.auto.tfvars
-
Remove Terraform files and temporary files
cd ${MLP_BASE_DIR} && \ rm -rf \ platforms/gke-aiml/playground/.terraform \ platforms/gke-aiml/playground/.terraform.lock.hcl \ terraform/features/initialize/.terraform \ terraform/features/initialize/.terraform.lock.hcl \ terraform/features/initialize/backend.tf.local \ terraform/features/initialize/state
-
Remove the environment variables
sed \ -i -e '/^export MLP_BASE_DIR=/d' \ -i -e '/^export MLP_TYPE_BASE_DIR=/d' \ ${HOME}/.bashrc
│ Error: Error creating Client: googleapi: Error 404: Requested entity was not found.
│
│ with google_iap_client.ray_head_client,
│ on gateway.tf line ###, in resource "google_iap_client" "ray_head_client":
│ ###: resource "google_iap_client" "ray_head_client" {
│
The OAuth Consent screen was not configured, see the Configure OAuth consent screen for IAP section.
│ Error: googleapi: Error 400: Service ray-dashboard.ml-team.mlp-<environment_name>.endpoints.<project_id>.cloud.goog has been deleted and
will be purged after 30 days. To reuse this service, please undelete the service following https://cloud.google.com/service-infrastructure/docs/create-services#undeleting., failedPrecondition
│
│ with google_endpoints_service.ray_dashboard_https,
│ on gateway.tf line ##, in resource "google_endpoints_service" "ray_dashboard_https":
│ ##: resource "google_endpoints_service" "ray_dashboard_https" {
│
The endpoint is in a deleted state and needs to be undeleted, run the following command and then rerun the Terraform apply.
MLP_ENVIRONMENT_NAME=$(grep environment_name ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs)
MLP_PROJECT_ID=$(grep environment_project_id ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs)
gcloud endpoints services undelete ray-dashboard.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog --quiet
│ Error: Error waiting for Deleting Network: The network resource 'projects/<project_id>/global/networks/ml-vpc-dev'
is already being used by 'projects/<project_id>/zones/us-central1-a/networkEndpointGroups/k8s1-XXXXXXXX-ml-team-ray-cluster-kuberay-head-svc-XXX-XXXXXXXX'
│
│
│
There were orphaned network endpoint groups (NEGs) in the project. Delete the network endpoint groups and retry the Terraform destroy.
gcloud compute network-endpoint-groups list --project ${MLP_PROJECT_ID}
│ Error: Error waiting for Deleting Network: The network resource 'projects/<project_id>/global/networks/ml-vpc-dev'
is already being used by 'projects/<project-id>/global/firewalls/gkegw1-XXXX-l7-ml-vpc-dev-global'
│
│
│
There were orphaned VPC firewall rules in the project. Delete the VPC firewall rules and retry the Terraform destroy.
gcloud compute firewall-rules list --project ${MLP_PROJECT_ID}