Skip to content

Commit 2561974

Browse files
authored
[TPU Webhook] Update Helm chart with Cert duration and renewBefore fields (GoogleCloudPlatform#1113)
Re-open helm PR Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
1 parent c62db2c commit 2561974

File tree

5 files changed

+24
-49
lines changed

5 files changed

+24
-49
lines changed

ray-on-gke/guides/tpu/README.md

Lines changed: 12 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -11,42 +11,12 @@ For addition useful information about TPUs on GKE (such as topology configuratio
1111

1212
In addition, please ensure the following are installed on your local development environment:
1313
* Helm (v3.9.3)
14-
* Terraform (v1.7.4)
1514
* Kubectl
1615

17-
### Provisioning a GKE Cluster with Terraform (Optional)
18-
19-
Skip this section if you already have a GKE cluster with TPUs (cluster version should be 1.28 or later).
20-
21-
1. `git clone https://github.com/GoogleCloudPlatform/ai-on-gke`
22-
23-
2. `cd ai-on-gke/infrastructure`
24-
25-
3. Edit `platform.tfvars` with your GCP settings.
26-
27-
4. Change the region or zone to one where TPUs are available (see [this link](https://cloud.google.com/tpu/docs/regions-zones) for details.
28-
For v4 TPUs (the default type), the region should be set to `us-central2` or `us-central2-b`.
29-
30-
5. Set the following flags (note that TPUs are currently only supported on GKE standard):
31-
32-
```
33-
autopilot_cluster = false
34-
...
35-
enable_tpu = true
36-
```
37-
38-
6. Change the following lines in the `tpu_pools` configuration to match your desired [TPU accelerator](https://cloud.google.com/tpu/docs/supported-tpu-configurations#using-accelerator-type).
39-
```
40-
accelerator_count = 2
41-
accelerator_type = "nvidia-tesla-t4"
42-
```
43-
44-
7. Run `terraform init && terraform apply -var-file platform.tfvars`
45-
4616

4717
### Manually Installing the TPU Initialization Webhook
4818

49-
The TPU Initialization Webhook automatically bootstraps the TPU environment for TPU clusters. The webhook needs to be installed once per GKE cluster and requires a Kuberay Operator running v1.1+ and GKE cluster version of 1.28+. The webhook requires [cert-manager](https://github.com/cert-manager/cert-manager) to be installed in-cluster to handle TLS certificate injection. cert-manager can be installed in both GKE standard and autopilot clusters using the following helm commands:
19+
The TPU Initialization Webhook automatically bootstraps the TPU environment for TPU clusters. The webhook needs to be installed once per GKE cluster and requires a KubeRay Operator running v1.1+ and GKE cluster version of 1.28+. The webhook requires [cert-manager](https://github.com/cert-manager/cert-manager) to be installed in-cluster to handle TLS certificate injection. cert-manager can be installed in both GKE standard and autopilot clusters using the following helm commands:
5020
```
5121
helm repo add jetstack https://charts.jetstack.io
5222
helm repo update
@@ -62,27 +32,23 @@ Installing the webhook:
6232
- to change the namespace, edit the "namespace" value in each .yaml in deployments/ and certs/
6333
4. `make deploy-cert`
6434

65-
For common errors encountered when deploying the webhook, see the [Troubleshooting guide](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/applications/ray/kuberay-tpu-webhook/Troubleshooting.md).
35+
The webhook can also be installed using the [Helm chart](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/ray-on-gke/tpu/kuberay-tpu-webhook/helm-chart), enabling users to easily edit the webhook configuration. This helm package is stored on Artifact Registry and can be installed with the following commands:
36+
1. Ensure you are authenticated with gcloud:
37+
- `gcloud auth login`
38+
- `gcloud auth configure-docker us-docker.pkg.dev`
39+
3. `helm install kuberay-tpu-webhook oci://us-docker.pkg.dev/ai-on-gke/kuberay-tpu-webhook-helm/kuberay-tpu-webhook`
6640

67-
### Creating the Kuberay Cluster
41+
The above command can be edited with `-f` or `--set` flags to pass in a custom values file or key-value pair respectively for the chart (i.e. `--set tpuWebhook.image.tag=v1.2.3-gke.0`).
6842

69-
You can find sample TPU cluster manifests for [single-host](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.tpu-v4-singlehost.yaml) and [multi-host](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.tpu-v4-multihost.yaml) here.
70-
71-
If you are using Terraform:
72-
73-
1. Get the GKE cluster name and location/region from `infrastructure/platform.tfvars`.
74-
Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`.
75-
Configuring `gcloud` [instructions](https://cloud.google.com/sdk/docs/initializing)
43+
For common errors encountered when deploying the webhook, see the [Troubleshooting guide](https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/applications/ray/kuberay-tpu-webhook/Troubleshooting.md).
7644

77-
2. `cd ../applications/ray`
7845

79-
3. Edit `workloads.tfvars` with your GCP settings. Replace `<your project ID>` and `<your cluster name>` with the names you used in `platform.tfvars`.
46+
### Creating the KubeRay Cluster
8047

81-
4. Run `terraform init && terraform apply -var-file workloads.tfvars`
48+
You can find sample TPU cluster manifests for [single-host](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.tpu-v4-singlehost.yaml) and [multi-host](https://github.com/ray-project/kuberay/blob/master/ray-operator/config/samples/ray-cluster.tpu-v4-multihost.yaml) here.
8249

83-
This should deploy a Kuberay cluster with a single TPU worker node (v4 TPU with `2x2x1` topology).
50+
For a quick-start guide to using TPUs with KubeRay, see [Use TPUs with KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/tpu.html).
8451

85-
To deploy a multi-host Ray Cluster, modify the `worker` spec [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/modules/kuberay-cluster/kuberay-tpu-values.yaml) by changing the `cloud.google.com/gke-tpu-topology` `nodeSelector` to a multi-host topology. Set the `numOfHosts` field in the `worker` spec to the number of hosts specified by your chosen topology. For v4 TPUs, each TPU VM has access to 4 TPU chips. Therefore, you can calculate the number of TPU VM hosts by taking the product of the topology and dividing by 4 (i.e. a 2x2x4 TPU podslice will have 4 TPU VM hosts).
8652

8753
### Running Sample Workloads
8854

@@ -114,6 +80,5 @@ print(ray.get(result))
11480
3. `export RAY_ADDRESS=http://localhost:8265`
11581
4. `ray job submit --runtime-env-json='{"working_dir": "."}' -- python test_tpu.py`
11682

117-
For a more advanced workload running Stable Diffusion on TPUs, see [here](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/applications/ray/example_notebooks/stable-diffusion-tpu.ipynb).
118-
83+
For a more advanced workload running Stable Diffusion on TPUs, see [here](https://cloud.google.com/kubernetes-engine/docs/add-on/ray-on-gke/tutorials/deploy-ray-serve-stable-diffusion-tpu). For an example of serving a LLM with TPUs, RayServe, and KubeRay, see [here](https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-lllm-tpu-ray).
11984

ray-on-gke/tpu/kuberay-tpu-webhook/certs/cert.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,4 +31,6 @@ spec:
3131
- kuberay-tpu-webhook.ray-system.svc
3232
- kuberay-tpu-webhook.ray-system.svc.cluster.local
3333
issuerRef:
34+
kind: Issuer
35+
group: cert-manager.io
3436
name: selfsigned-issuer

ray-on-gke/tpu/kuberay-tpu-webhook/helm-chart/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ type: application
66
# This is the chart version. This version number should be incremented each time you make changes
77
# to the chart and its templates, including the app version.
88
# Versions are expected to follow Semantic Versioning (https://semver.org/)
9-
version: 0.2.3
9+
version: 0.2.4
1010

1111
# This is the version number of the application being deployed. This version number should be
1212
# incremented each time you make changes to the application. Versions are not expected to
1313
# follow Semantic Versioning. They should reflect the version the application is using.
1414
# It is recommended to use it with quotes.
15-
appVersion: "1.2.3"
15+
appVersion: "1.2.4"

ray-on-gke/tpu/kuberay-tpu-webhook/helm-chart/templates/cert.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,13 @@ metadata:
2626
name: kuberay-tpu-webhook-certs
2727
namespace: {{ .Values.tpuWebhook.namespace.name }}
2828
spec:
29+
duration: {{ .Values.tpuWebhook.cert.duration }}
30+
renewBefore: {{ .Values.tpuWebhook.cert.renewBefore }}
2931
secretName: kuberay-tpu-webhook-certs
3032
dnsNames:
3133
- kuberay-tpu-webhook.{{ .Values.tpuWebhook.namespace.name }}.svc
3234
- kuberay-tpu-webhook.{{ .Values.tpuWebhook.namespace.name }}.svc.cluster.local
3335
issuerRef:
36+
kind: Issuer
37+
group: cert-manager.io
3438
name: selfsigned-issuer

ray-on-gke/tpu/kuberay-tpu-webhook/helm-chart/values.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,7 @@ tpuWebhook:
1717
service:
1818
type: ClusterIP
1919
port: 443
20+
21+
cert:
22+
duration: 2160h # 90d
23+
renewBefore: 360h # 15d

0 commit comments

Comments
 (0)