Skip to content

Commit 7995f45

Browse files
authored
Merge pull request #3915 from cdunbar13/release-candidate
Updated GKE-A4 docs, and GKE-A3U to mirror the A4 doc
2 parents 6a08bab + 0288e1a commit 7995f45

File tree

2 files changed

+5
-232
lines changed

2 files changed

+5
-232
lines changed

examples/gke-a3-ultragpu/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,3 @@
1-
Refer to [AI Hypercomputer Documentation](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#create-cluster) for instructions.
1+
Refer to [Create an AI-optimized GKE cluster with default configuration](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#use-cluster-toolkit) for instructions on creating the GKE-A3U cluster.
2+
3+
Refer to [Deploy and run NCCL test with Topology Aware Scheduling (TAS)](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#deploy-run-nccl-tas-test) for instructions on running a NCCL test on the GKE-A3U cluster.

examples/gke-a4/README.md

Lines changed: 2 additions & 231 deletions
Original file line numberDiff line numberDiff line change
@@ -1,232 +1,3 @@
1-
# Create a GKE Cluster with A4 nodes
1+
Refer to [Create an AI-optimized GKE cluster with default configuration](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#use-cluster-toolkit) for instructions on creating the GKE-A4 cluster.
22

3-
This example shows how to create your own [Hypercompute Cluster](https://cloud.google.com/ai-hypercomputer/docs/hypercompute-cluster) with Google Kubernetes Engine (GKE) to support your AI and ML workloads, using A4 GPUs.
4-
5-
GKE is the open, portable, extensible, and highly scalable platform for Hypercompute Cluster. GKE provides a single platform surface to run a diverse set of workloads for your organization's needs. This includes high performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. GKE reduces the operational burden of managing multiple platforms.
6-
7-
The following instructions use [Cluster Toolkit](https://cloud.google.com/cluster-toolkit/docs/overview), which lets you create your GKE cluster quickly while incorporating best practices. Through Cluster Toolkit, you have access to reference design blueprints that codify the Hypercompute Cluster environment on GKE including compute, storage, and networking resources. Additionally, Cluster Toolkit sets up the cluster to use GPUDirect RDMA-over-Converged-Ethernet (RoCE) for distributed AI workloads.
8-
9-
## Before you begin
10-
11-
Before you start, make sure you have performed the following tasks:
12-
13-
* Enable the Google Kubernetes Engine API.
14-
15-
* If you want to use the Google Cloud CLI for this task, [install](https://cloud.google.com/sdk/docs/install) and then [initialize](https://cloud.google.com/sdk/docs/initializing) the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
16-
> **NOTE:** For existing gcloud CLI installations, make sure to set the compute/region and compute/zone properties. By setting default locations, you can avoid errors in gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location.
17-
18-
* Ensure that you have enough quota for A4 GPUs. To request more quota,
19-
follow the instructions in [GPU quota](https://cloud.google.com/compute/resource-usage#gpu_quota). To ensure that your cluster has capacity, you can follow the instructions to [reserve capacity](#reserve-capacity).
20-
21-
* Ensure that you have the following roles enabled:
22-
* `roles/editor`
23-
* `roles/container.clusterAdmin`
24-
* `roles/iam.serviceAccountAdmin`
25-
26-
### Requirements
27-
28-
The following requirements apply to GKE Hypercompute Cluster:
29-
30-
* The B200 GPUs in A4 VMs require a minimum of 570 GPU driver version, which is available in GKE 1.32 as `LATEST` driver version. For A4, you must set `gpu_driver_version: "LATEST"` with GKE 1.32.
31-
* To use GPUDirect RDMA, use GKE patch version 1.32.1-gke.1420000 or higher.
32-
* To use GPUDirect RDMA, the GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.
33-
34-
## Reserve capacity
35-
36-
To ensure that your workloads have the A4 GPU resources required for these instructions, you can create a [future reservation request](https://cloud.google.com/compute/docs/instances/future-reservations-overview). With this request, you can reserve blocks of capacity for a defined duration in the future. At that date and time in the future, Compute Engine automatically provisions the blocks of capacity by creating on-demand reservations that you can immediately consume by provisioning node pools for this cluster.
37-
38-
Additionally, as your reserved capacity might span multiple
39-
[blocks](https://cloud.google.com/ai-hypercomputer/docs/terminology#block), we recommend that you create GKE nodes on a specific block within your reservation.
40-
41-
Do the following steps to request capacity and gather the required information
42-
to create nodes on a specific block within your reservation:
43-
44-
1. [Request capacity](https://cloud.google.com/ai-hypercomputer/docs/request-capacity).
45-
46-
1. To get the name of the blocks that are available for your reservation,
47-
run the following command:
48-
49-
```sh
50-
gcloud beta compute reservations blocks list RESERVATION_NAME \
51-
--zone=COMPUTE_ZONE --format "value(name)"
52-
```
53-
54-
Replace the following:
55-
56-
* `RESERVATION_NAME`: the name of your reservation.
57-
* `COMPUTE_ZONE`: the compute zone of your reservation.
58-
59-
The output has the following format: BLOCK_NAME.
60-
For example the output might be similar to the following: `example-res1-block-0001`.
61-
62-
1. If you want to target specific blocks within a reservation when
63-
provisioning GKE node pools, you must specify the full reference
64-
to your block as follows:
65-
66-
```none
67-
RESERVATION_NAME/reservationBlocks/BLOCK_NAME
68-
```
69-
70-
For example, using the example output in the preceding step, the full path is as follows: `example-res1/reservationBlocks/example-res1-block-0001`
71-
72-
## Create a cluster using Cluster Toolkit
73-
74-
This section guides you through the cluster creation process, ensuring that your project follows best practices and meets the [requirements](#requirements) for GKE Hypercompute Cluster.
75-
76-
> **NOTE:** If you would like to create more than one cluster in a project, make sure you update the deployment name.
77-
78-
1. [Launch Cloud Shell](https://cloud.google.com/shell/docs/launching-cloud-shell). You can use a different environment; however, we recommend Cloud Shell because the dependencies are already pre-installed for Cluster Toolkit. If you don't want to use Cloud Shell, follow the instructions to [install dependencies](https://cloud.google.com/cluster-toolkit/docs/setup/install-dependencies) to prepare a different environment.
79-
80-
1. Clone the Cluster Toolkit from the git repository:
81-
82-
```sh
83-
cd ~
84-
git clone https://github.com/GoogleCloudPlatform/cluster-toolkit.git
85-
```
86-
87-
1. Install the Cluster Toolkit:
88-
89-
```sh
90-
cd cluster-toolkit && git checkout main && make
91-
```
92-
93-
1. Create a Cloud Storage bucket to store the state of the Terraform deployment:
94-
95-
```sh
96-
gcloud storage buckets create gs://BUCKET_NAME \
97-
--default-storage-class=STANDARD \
98-
--location=COMPUTE_REGION \
99-
--uniform-bucket-level-access
100-
gcloud storage buckets update gs://BUCKET_NAME --versioning
101-
```
102-
103-
Replace the following variables:
104-
105-
* `BUCKET_NAME`: the name of the new Cloud Storage bucket.
106-
* `COMPUTE_REGION`: the compute region where you want to store the state of the Terraform deployment.
107-
108-
1. In the [`examples/gke-a4/gke-a4-deployment.yaml`](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/develop/examples/gke-a4/gke-a4-deployment.yaml) file, replace the following variables in the `terraform_backend_defaults` and `vars` sections to match the specific values for your deployment:
109-
110-
* `bucket`: the name of the Cloud Storage bucket you created in the previous step.
111-
* `project_id`: your Google Cloud project ID.
112-
* `region`: the compute region for the cluster.
113-
* `zone`: the compute zone for the node pool of A4 machines.
114-
* `authorized_cidr`: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Terraform.
115-
* `extended_reservation`: the name of your reservation in the form of <project>/<reservation-name>/reservationBlocks/<reservation-block-name>
116-
* `static_node_count`: the number of A4 nodes in your cluster.
117-
118-
To modify advanced settings, edit
119-
`examples/gke-a4/gke-a4.yaml`.
120-
121-
1. Generate [Application Default Credentials (ADC)](https://cloud.google.com/docs/authentication/provide-credentials-adc#google-idp) to provide access to Terraform.
122-
123-
1. Deploy the blueprint to provision the GKE infrastructure
124-
using A4 machine types:
125-
126-
```sh
127-
cd ~/cluster-toolkit
128-
./gcluster deploy -d \
129-
examples/gke-a4/gke-a4-deployment.yaml \
130-
examples/gke-a4/gke-a4.yaml
131-
```
132-
133-
## Deploy and run NCCL test with Topology Aware Scheduling (TAS)
134-
135-
To validate the functionality of the provisioned cluster, you can run a [NCCL test](https://github.com/NVIDIA/nccl-tests). To run a NCCL test with [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/),
136-
complete the following steps.
137-
138-
1. Connect to your cluster:
139-
140-
```sh
141-
gcloud container clusters get-credentials gke-a4
142-
```
143-
144-
1. Deploy an all-gather NCCL performance test with Topology Aware Scheduling
145-
enabled by using the [nccl-jobset-example.yaml](https://github.com/GoogleCloudPlatform/cluster-toolkit/blob/develop/examples/gke-a4/nccl-jobset-example.yaml) file.
146-
147-
By default, this test uses two nodes. To change the number of nodes,
148-
modify the YAML file to change the following values from `2` to your required
149-
number of nodes:
150-
151-
* `parallelism`
152-
* `completions`
153-
* `N_NODES`
154-
155-
Create the resources to run the test:
156-
157-
```sh
158-
kubectl create -f ~/cluster-toolkit/examples/gke-a4/nccl-jobset-example.yaml
159-
```
160-
161-
This command returns a JobSet name.
162-
163-
The output should be similar to the following:
164-
165-
```sh
166-
jobset.jobset.x-k8s.io/ag-4-9lkmq created
167-
```
168-
169-
1. To view the results of the NCCL test, run this command to view all of the
170-
running Pods:
171-
172-
```sh
173-
kubectl get pods
174-
```
175-
176-
The output should be similar to the following:
177-
178-
```sh
179-
NAME READY STATUS RESTARTS AGE
180-
ag-2-jnftb-w-0-0-8wrqq 0/1 Completed 0 74s
181-
ag-2-jnftb-w-0-1-kcxjj 0/1 Completed 0 74s
182-
```
183-
184-
1. Find a Pod name matching the pattern `jobset-name-w-0-0-*`. The logs of this
185-
Pod contain the results of the NCCL test.
186-
187-
To fetch the logs for this Pod, run this command:
188-
189-
```sh
190-
kubectl logs ag-2-jnftb-w-0-0-8wrqq
191-
```
192-
193-
The output should be similar to the following:
194-
195-
```sh
196-
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
197-
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
198-
1024 16 float none -1 39.23 0.03 0.02 0 35.16 0.03 0.03 0
199-
2048 32 float none -1 36.35 0.06 0.05 0 35.80 0.06 0.05 0
200-
4096 64 float none -1 36.21 0.11 0.11 0 35.88 0.11 0.11 0
201-
8192 128 float none -1 36.87 0.22 0.21 0 36.60 0.22 0.21 0
202-
16384 256 float none -1 37.41 0.44 0.41 0 37.16 0.44 0.41 0
203-
32768 512 float none -1 39.60 0.83 0.78 0 39.18 0.84 0.78 0
204-
65536 1024 float none -1 40.90 1.60 1.50 0 41.00 1.60 1.50 0
205-
131072 2048 float none -1 45.50 2.88 2.70 0 41.97 3.12 2.93 0
206-
262144 4096 float none -1 46.80 5.60 5.25 0 43.63 6.01 5.63 0
207-
524288 8192 float none -1 46.44 11.29 10.58 0 48.86 10.73 10.06 0
208-
1048576 16384 float none -1 81.56 12.86 12.05 0 80.30 13.06 12.24 0
209-
2097152 32768 float none -1 86.29 24.30 22.78 0 84.16 24.92 23.36 0
210-
4194304 65536 float none -1 95.18 44.07 41.31 0 89.88 46.67 43.75 0
211-
8388608 131072 float none -1 103.9 80.75 75.70 0 103.7 80.88 75.82 0
212-
16777216 262144 float none -1 132.9 126.23 118.34 0 132.4 126.72 118.80 0
213-
33554432 524288 float none -1 185.7 180.69 169.39 0 183.7 182.65 171.23 0
214-
67108864 1048576 float none -1 285.6 235.01 220.32 0 292.3 229.59 215.24 0
215-
134217728 2097152 float none -1 477.4 281.17 263.60 0 470.8 285.10 267.28 0
216-
268435456 4194304 float none -1 792.9 338.55 317.40 0 775.8 346.02 324.40 0
217-
536870912 8388608 float none -1 1456.3 368.65 345.61 0 1446.0 371.28 348.07 0
218-
1073741824 16777216 float none -1 2809.4 382.20 358.32 0 2788.3 385.08 361.02 0
219-
2147483648 33554432 float none -1 5548.2 387.06 362.87 0 5457.9 393.46 368.87 0
220-
4294967296 67108864 float none -1 11017 389.83 365.47 0 10806 397.48 372.63 0
221-
8589934592 134217728 float none -1 21986 390.71 366.29 0 21499 399.55 374.57 0
222-
# Out of bounds values : 0 OK
223-
# Avg bus bandwidth : 128.335
224-
```
225-
226-
## Clean up
227-
228-
To avoid recurring charges for the resources used on this page, clean up the resources provisioned by Cluster Toolkit, including the VPC networks and GKE cluster:
229-
230-
```sh
231-
./gcluster destroy gke-a4/
232-
```
3+
Refer to [Deploy and run NCCL test with Topology Aware Scheduling (TAS)](https://cloud.google.com/ai-hypercomputer/docs/create/gke-ai-hypercompute#deploy-run-nccl-tas-test) for instructions on running a NCCL test on the GKE-A4 cluster.

0 commit comments

Comments
 (0)