|
| 1 | +# AICR - Critical User Journey (CUJ) 1 — GKE |
| 2 | + |
| 3 | +## Assumptions |
| 4 | + |
| 5 | +* Assuming user is already authenticated to a GKE cluster with 2+ H100 (a3-megagpu-8g) nodes. |
| 6 | +* GKE cluster runs Container-Optimized OS (COS) with GPU drivers pre-installed. |
| 7 | +* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster. |
| 8 | +* System nodes have no custom taints (GKE managed pods don't tolerate them). |
| 9 | + |
| 10 | +## Snapshot |
| 11 | + |
| 12 | +```shell |
| 13 | +aicr snapshot \ |
| 14 | + --namespace aicr-validation \ |
| 15 | + --node-selector nodeGroup=gpu-worker \ |
| 16 | + --toleration dedicated=gpu-workload:NoSchedule \ |
| 17 | + --toleration nvidia.com/gpu=present:NoSchedule \ |
| 18 | + --output snapshot.yaml |
| 19 | +``` |
| 20 | + |
| 21 | +## Gen Recipe |
| 22 | + |
| 23 | +```shell |
| 24 | +aicr recipe \ |
| 25 | + --service gke \ |
| 26 | + --accelerator h100 \ |
| 27 | + --intent training \ |
| 28 | + --os cos \ |
| 29 | + --platform kubeflow \ |
| 30 | + --output recipe.yaml |
| 31 | +``` |
| 32 | + |
| 33 | +## Validate Recipe Constraints |
| 34 | + |
| 35 | +```shell |
| 36 | +aicr validate \ |
| 37 | + --recipe recipe.yaml \ |
| 38 | + --snapshot snapshot.yaml \ |
| 39 | + --no-cluster \ |
| 40 | + --phase deployment \ |
| 41 | + --output dry-run.json |
| 42 | +``` |
| 43 | + |
| 44 | +## Generate Bundle |
| 45 | + |
| 46 | +```shell |
| 47 | +aicr bundle \ |
| 48 | + --recipe recipe.yaml \ |
| 49 | + --accelerated-node-selector nodeGroup=gpu-worker \ |
| 50 | + --accelerated-node-toleration dedicated=gpu-workload:NoSchedule \ |
| 51 | + --accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \ |
| 52 | + --system-node-selector nodeGroup=system-worker \ |
| 53 | + --output bundle |
| 54 | +``` |
| 55 | + |
| 56 | +> Note: GKE system nodes should not have custom taints (breaks konnectivity-agent and other GKE managed pods). Only `--system-node-selector` is needed, no `--system-node-toleration`. |
| 57 | +
|
| 58 | +## Install Bundle into the Cluster |
| 59 | + |
| 60 | +```shell |
| 61 | +cd ./bundle && chmod +x deploy.sh && ./deploy.sh |
| 62 | +``` |
| 63 | + |
| 64 | +> Note: If skyhook-operator is already installed on the cluster, comment out or skip the skyhook-operator and skyhook-customizations sections in deploy.sh to avoid upgrade conflicts. |
| 65 | +
|
| 66 | +## Validate Cluster |
| 67 | + |
| 68 | +```shell |
| 69 | +aicr validate \ |
| 70 | + --recipe recipe.yaml \ |
| 71 | + --toleration dedicated=gpu-workload:NoSchedule \ |
| 72 | + --toleration nvidia.com/gpu=present:NoSchedule \ |
| 73 | + --phase conformance \ |
| 74 | + --output report.json |
| 75 | +``` |
| 76 | + |
| 77 | +## Run Job |
| 78 | + |
| 79 | +Run a simple distributed PyTorch training job using the [Kubeflow TrainJob API](https://blog.kubeflow.org/trainer/intro/): |
| 80 | + |
| 81 | +```shell |
| 82 | +# Create the TrainJob |
| 83 | +kubectl apply -f - <<EOF |
| 84 | +apiVersion: trainer.kubeflow.org/v1alpha1 |
| 85 | +kind: TrainJob |
| 86 | +metadata: |
| 87 | + name: pytorch-mnist |
| 88 | + namespace: kubeflow |
| 89 | +spec: |
| 90 | + trainer: |
| 91 | + numNodes: 1 |
| 92 | + image: kubeflow/pytorch-dist-mnist:v1-9e12c68 |
| 93 | + command: |
| 94 | + - "python3" |
| 95 | + - "/opt/mnist/src/mnist.py" |
| 96 | + - "--epochs=1" |
| 97 | + resourcesPerNode: |
| 98 | + requests: |
| 99 | + nvidia.com/gpu: 1 |
| 100 | + limits: |
| 101 | + nvidia.com/gpu: 1 |
| 102 | + podTemplateOverrides: |
| 103 | + - targetJobs: |
| 104 | + - name: node |
| 105 | + spec: |
| 106 | + tolerations: |
| 107 | + - operator: Exists |
| 108 | + runtimeRef: |
| 109 | + name: torch-distributed |
| 110 | + apiGroup: trainer.kubeflow.org |
| 111 | + kind: ClusterTrainingRuntime |
| 112 | +EOF |
| 113 | + |
| 114 | +# Monitor the TrainJob |
| 115 | +kubectl get trainjobs -n kubeflow |
| 116 | +kubectl get pods -n kubeflow -l trainer.kubeflow.org/job-name=pytorch-mnist |
| 117 | +kubectl logs -f -n kubeflow -l trainer.kubeflow.org/job-name=pytorch-mnist |
| 118 | +``` |
| 119 | + |
| 120 | +## Performance Validation |
| 121 | + |
| 122 | +> **Note:** `aicr validate --phase performance` is not yet automated for GKE. |
| 123 | +> The GKE NCCL test uses raw Pods with a TCPXO daemon sidecar (required for GPUDirect), |
| 124 | +> which differs from the EKS TrainJob-based approach. Run the test manually as shown below. |
| 125 | +> Automated support is tracked as a follow-up. |
| 126 | +
|
| 127 | +### Option 1: Using testdata manifests (matches validator framework) |
| 128 | + |
| 129 | +```shell |
| 130 | +export NAMESPACE=nccl-perf |
| 131 | +export GPU_COUNT_PER_NODE=8 |
| 132 | +export GPU_COUNT=16 |
| 133 | +export WORKER_COUNT=2 |
| 134 | +export TEST_TYPE=all_reduce_perf |
| 135 | +export MIN_MESSAGE_SIZE=1M |
| 136 | +export MAX_MESSAGE_SIZE=8G |
| 137 | + |
| 138 | +kubectl create ns $NAMESPACE |
| 139 | +envsubst < validators/performance/testdata/h100/gke/runtime.yaml | kubectl apply -f - |
| 140 | + |
| 141 | +# Wait for pods to be 2/2 Running |
| 142 | +kubectl get pods -n $NAMESPACE -o wide -w |
| 143 | + |
| 144 | +# Trigger the AllReduce benchmark from host-1 |
| 145 | +kubectl exec nccl-test-host-1 -n $NAMESPACE -c nccl-test -- \ |
| 146 | + /scripts/allreduce.sh nccl-host-1 nccl-host-2 |
| 147 | + |
| 148 | +# Expected: ~335 GB/s busBW at 8 GB (AllReduce), ~87 GB/s avg |
| 149 | +# Clean up |
| 150 | +kubectl delete ns $NAMESPACE |
| 151 | +``` |
| 152 | + |
| 153 | +### Option 2: Using standalone demo manifest |
| 154 | + |
| 155 | +```shell |
| 156 | +kubectl create ns nccl-test |
| 157 | +kubectl apply -f demos/workloads/training/gke-nccl-test-tcpxo.yaml -n nccl-test |
| 158 | + |
| 159 | +# Wait for pods to be 2/2 Running |
| 160 | +kubectl get pods -n nccl-test -o wide -w |
| 161 | + |
| 162 | +# Trigger the AllReduce benchmark from host-1 |
| 163 | +kubectl exec nccl-test-host-1 -n nccl-test -c nccl-test -- bash -c ' |
| 164 | + /scripts/init_ssh.sh nccl-host-1 nccl-host-2 && |
| 165 | + pushd /scripts && /scripts/gen_hostfiles.sh nccl-host-1 nccl-host-2 && popd && |
| 166 | + BENCHMARK=all_reduce_perf NHOSTS=2 NCCL_LIB_DIR="/usr/local/nvidia/lib64" \ |
| 167 | + LD_LIBRARY_PATH="/usr/local/nvidia/lib64" /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh' |
| 168 | + |
| 169 | +# Expected: ~335 GB/s busBW at 8 GB (AllReduce), ~87 GB/s avg |
| 170 | +# Clean up |
| 171 | +kubectl delete ns nccl-test |
| 172 | +``` |
| 173 | + |
| 174 | +### Prerequisites |
| 175 | + |
| 176 | +- GKE cluster with multi-NIC networking (8 GPU NICs per a3-megagpu-8g node) |
| 177 | +- `Network` + `GKENetworkParamSet` CRs configured for GPU NICs (infrastructure, cluster-specific) |
| 178 | +- `nccl-tcpxo-installer` DaemonSet deployed on GPU nodes (included in AICR bundle) |
| 179 | +- `nri-device-injector` DaemonSet deployed on GPU nodes (included in AICR bundle) |
| 180 | +- Without multi-NIC, NCCL falls back to TCP (~4 GB/s vs ~335 GB/s with TCPXO) |
| 181 | + |
| 182 | +### TCPXO Runtime Requirements |
| 183 | + |
| 184 | +Each workload pod that needs GPUDirect TCPXO must include a `tcpxo-daemon` sidecar container. |
| 185 | + |
| 186 | +**Recommended profile** (validated on GKE 1.35 / a3-megagpu-8g): |
| 187 | +- `hostNetwork: true` — required for PCI sysfs visibility |
| 188 | +- `privileged: false` — not needed with NRI device injection |
| 189 | +- NRI annotations on the pod: `devices.gke.io/container.tcpxo-daemon` (GPU devices) and `networking.gke.io/interfaces` (multi-NIC mapping with cluster-specific network names) |
| 190 | +- `securityContext.capabilities: [NET_ADMIN, NET_BIND_SERVICE]` on the tcpxo-daemon container |
| 191 | +- Requires NRI device injector DaemonSet deployed on GPU nodes |
| 192 | + |
| 193 | +**Fallback profile** (if NRI injector is not available): |
| 194 | +- `hostNetwork: true` + `privileged: true` |
| 195 | +- No annotations needed |
| 196 | + |
| 197 | +> **Known issue:** Without `hostNetwork: true`, the TCPXO daemon cannot enumerate all GPUs via PCI sysfs — the container runtime restricts sysfs visibility, causing the daemon to detect fewer GPUs in the PCI tree than CUDA reports, and exit. NRI annotations provide `/dev/nvidia*` device access but do not restore full PCI sysfs visibility. This is a GKE container runtime limitation. |
| 198 | +
|
| 199 | +### Understanding the results |
| 200 | + |
| 201 | +Each pod runs two containers: a `tcpxo-daemon` sidecar (manages GPUDirect TCPX data path) and the `nccl-test` container. The TCPXO sidecar is required for any workload that needs high-speed inter-node GPU communication on GKE. |
| 202 | + |
| 203 | +| Metric | Without TCPXO | With TCPXO | |
| 204 | +|--------|--------------|------------| |
| 205 | +| AllReduce busBW (8 GB) | ~4 GB/s | ~335 GB/s | |
| 206 | +| AllReduce avg busBW | ~4 GB/s | ~87 GB/s | |
| 207 | + |
| 208 | +## Success |
| 209 | + |
| 210 | +Job success + Fabric bandwidth within range |
| 211 | + |
| 212 | +> Synthetic workload, perf checks beyond the basic fabric validation is out of scope here. |
0 commit comments