Skip to content

Commit 5033975

Browse files
committed
feat(recipes): add GKE COS training overlays for H100
Add complete GKE Container-Optimized OS (COS) training recipe chain with GPUDirect TCPXO networking, NRI device injection, and Kubeflow Trainer support. Recipe chain: base → gke-cos → gke-cos-training → h100-gke-cos-training → h100-gke-cos-training-kubeflow New overlays: - gke-cos-training: GKE COS + training intent with GPU Operator values - h100-gke-cos-training: H100-specific with TCPXO, NRI, skyhook tuning - h100-gke-cos-training-kubeflow: adds Kubeflow Trainer for TrainJob New components: - gke-nccl-tcpxo: NCCL TCPXO installer + NRI device injector manifests - gpu-operator/values-gke-cos-training.yaml: training GPU Operator values - gpu-operator/manifests/gke-resource-quota.yaml: system-critical quota - skyhook-customizations/manifests/tuning-gke.yaml: COS kernel tuning Validator changes: - Skip GKE NCCL performance test with informative warning (not yet automated; requires raw Pods with TCPXO sidecar) - GKE H100 testdata added for manual execution Evidence collection: - Auto-detect cluster description from node metadata instead of hardcoded recipe name Also includes: - Demo workloads for GKE NCCL TCPXO benchmark and CUJ1 guide - Fix vllm-agg tolerations/nodeSelectors to match AICR convention - Skyhook no-op runtimeRequired/autoTaintNewNodes default to false Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent 8550939 commit 5033975

File tree

21 files changed

+1586
-28
lines changed

21 files changed

+1586
-28
lines changed

.github/workflows/vuln-scan.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ jobs:
6969
output: ${{ env.SARIF_OUTPUT }}
7070
severity: ${{ env.SEVERITY_LEVELS }}
7171
skip-dirs: 'vendor,node_modules,distros/kubernetes,tests,tilt'
72+
trivyignores: '.trivyignore.yaml'
7273
limit-severities-for-sarif: true
7374

7475
- name: Check SARIF file exists

.trivyignore.yaml

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
misconfigurations:
2+
- id: AVD-KSV-0009
3+
paths:
4+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
5+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
6+
- "validators/performance/testdata/h100/gke/runtime.yaml"
7+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
8+
statement: "GPUDirect TCPXO/NRI requires hostNetwork on GKE for PCI/sysfs visibility and runtime networking behavior."
9+
10+
- id: AVD-KSV-0010
11+
paths:
12+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
13+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
14+
statement: "Host PID namespace is required for nsenter and host runtime/device integration in TCPXO/NRI daemonsets."
15+
16+
- id: AVD-KSV-0017
17+
paths:
18+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
19+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
20+
- "validators/performance/testdata/h100/gke/runtime.yaml"
21+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
22+
statement: "Privileged mode is required by the upstream GPUDirect TCPXO/NRI setup and validation workloads."
23+
24+
- id: AVD-KSV-0014
25+
paths:
26+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
27+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
28+
- "validators/performance/testdata/h100/gke/runtime.yaml"
29+
- "validators/performance/testdata/h100/gke/trainjob.yaml"
30+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
31+
statement: "These installer/test containers write runtime files and cannot run with readOnlyRootFilesystem=true."
32+
33+
- id: AVD-KSV-0118
34+
paths:
35+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
36+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
37+
- "validators/performance/testdata/h100/gke/runtime.yaml"
38+
- "validators/performance/testdata/h100/gke/trainjob.yaml"
39+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
40+
statement: "Root/default security context is required by current GPUDirect TCPXO/NRI images and bootstrap flow."
41+
42+
- id: AVD-KSV-0121
43+
paths:
44+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
45+
statement: "/dev hostPath mount is required for NRI device injection."
46+
47+
- id: AVD-KSV-0023
48+
paths:
49+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
50+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
51+
- "validators/performance/testdata/h100/gke/runtime.yaml"
52+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
53+
statement: "hostPath mounts are required for GPUDirect library/device access and installer behavior."
54+
55+
- id: AVD-KSV-0037
56+
paths:
57+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
58+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
59+
statement: "These are node-level system daemonsets and must run in kube-system."
60+
61+
- id: AVD-KSV-0001
62+
paths:
63+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
64+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
65+
- "validators/performance/testdata/h100/gke/runtime.yaml"
66+
- "validators/performance/testdata/h100/gke/trainjob.yaml"
67+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
68+
statement: "allowPrivilegeEscalation=false is not compatible with these upstream GPUDirect setup/test images."
69+
70+
- id: AVD-KSV-0012
71+
paths:
72+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
73+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
74+
- "validators/performance/testdata/h100/gke/runtime.yaml"
75+
- "validators/performance/testdata/h100/gke/trainjob.yaml"
76+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
77+
statement: "runAsNonRoot is not currently compatible with required GPUDirect setup/test operations."
78+
79+
- id: AVD-KSV-0013
80+
paths:
81+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
82+
- "validators/performance/testdata/h100/gke/trainjob.yaml"
83+
statement: "Upstream images currently use unpinned/latest tags in these manifests."
84+
85+
- id: AVD-KSV-0104
86+
paths:
87+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
88+
- "recipes/components/gke-nccl-tcpxo/manifests/nri-device-injector.yaml"
89+
- "validators/performance/testdata/h100/gke/runtime.yaml"
90+
- "validators/performance/testdata/h100/gke/trainjob.yaml"
91+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
92+
statement: "seccomp profile hardening is deferred for these compatibility-sensitive upstream GPUDirect manifests."
93+
94+
- id: AVD-KSV-0125
95+
paths:
96+
- "recipes/components/gke-nccl-tcpxo/manifests/nccl-tcpxo-installer.yaml"
97+
- "validators/performance/testdata/h100/gke/runtime.yaml"
98+
- "validators/performance/testdata/h100/gke/trainjob.yaml"
99+
- "demos/workloads/training/gke-nccl-test-tcpxo.yaml"
100+
statement: "Image sources are inherited from upstream GKE GPUDirect TCPXO examples."

demos/cuj1-gke.md

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# AICR - Critical User Journey (CUJ) 1 — GKE
2+
3+
## Assumptions
4+
5+
* Assuming user is already authenticated to a GKE cluster with 2+ H100 (a3-megagpu-8g) nodes.
6+
* GKE cluster runs Container-Optimized OS (COS) with GPU drivers pre-installed.
7+
* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster.
8+
* System nodes have no custom taints (GKE managed pods don't tolerate them).
9+
10+
## Snapshot
11+
12+
```shell
13+
aicr snapshot \
14+
--namespace aicr-validation \
15+
--node-selector nodeGroup=gpu-worker \
16+
--toleration dedicated=gpu-workload:NoSchedule \
17+
--toleration nvidia.com/gpu=present:NoSchedule \
18+
--output snapshot.yaml
19+
```
20+
21+
## Gen Recipe
22+
23+
```shell
24+
aicr recipe \
25+
--service gke \
26+
--accelerator h100 \
27+
--intent training \
28+
--os cos \
29+
--platform kubeflow \
30+
--output recipe.yaml
31+
```
32+
33+
## Validate Recipe Constraints
34+
35+
```shell
36+
aicr validate \
37+
--recipe recipe.yaml \
38+
--snapshot snapshot.yaml \
39+
--no-cluster \
40+
--phase deployment \
41+
--output dry-run.json
42+
```
43+
44+
## Generate Bundle
45+
46+
```shell
47+
aicr bundle \
48+
--recipe recipe.yaml \
49+
--accelerated-node-selector nodeGroup=gpu-worker \
50+
--accelerated-node-toleration dedicated=gpu-workload:NoSchedule \
51+
--accelerated-node-toleration nvidia.com/gpu=present:NoSchedule \
52+
--system-node-selector nodeGroup=system-worker \
53+
--output bundle
54+
```
55+
56+
> Note: GKE system nodes should not have custom taints (breaks konnectivity-agent and other GKE managed pods). Only `--system-node-selector` is needed, no `--system-node-toleration`.
57+
58+
## Install Bundle into the Cluster
59+
60+
```shell
61+
cd ./bundle && chmod +x deploy.sh && ./deploy.sh
62+
```
63+
64+
> Note: If skyhook-operator is already installed on the cluster, comment out or skip the skyhook-operator and skyhook-customizations sections in deploy.sh to avoid upgrade conflicts.
65+
66+
## Validate Cluster
67+
68+
```shell
69+
aicr validate \
70+
--recipe recipe.yaml \
71+
--toleration dedicated=gpu-workload:NoSchedule \
72+
--toleration nvidia.com/gpu=present:NoSchedule \
73+
--phase conformance \
74+
--output report.json
75+
```
76+
77+
## Run Job
78+
79+
Run a simple distributed PyTorch training job using the [Kubeflow TrainJob API](https://blog.kubeflow.org/trainer/intro/):
80+
81+
```shell
82+
# Create the TrainJob
83+
kubectl apply -f - <<EOF
84+
apiVersion: trainer.kubeflow.org/v1alpha1
85+
kind: TrainJob
86+
metadata:
87+
name: pytorch-mnist
88+
namespace: kubeflow
89+
spec:
90+
trainer:
91+
numNodes: 1
92+
image: kubeflow/pytorch-dist-mnist:v1-9e12c68
93+
command:
94+
- "python3"
95+
- "/opt/mnist/src/mnist.py"
96+
- "--epochs=1"
97+
resourcesPerNode:
98+
requests:
99+
nvidia.com/gpu: 1
100+
limits:
101+
nvidia.com/gpu: 1
102+
podTemplateOverrides:
103+
- targetJobs:
104+
- name: node
105+
spec:
106+
tolerations:
107+
- operator: Exists
108+
runtimeRef:
109+
name: torch-distributed
110+
apiGroup: trainer.kubeflow.org
111+
kind: ClusterTrainingRuntime
112+
EOF
113+
114+
# Monitor the TrainJob
115+
kubectl get trainjobs -n kubeflow
116+
kubectl get pods -n kubeflow -l trainer.kubeflow.org/job-name=pytorch-mnist
117+
kubectl logs -f -n kubeflow -l trainer.kubeflow.org/job-name=pytorch-mnist
118+
```
119+
120+
## Performance Validation
121+
122+
> **Note:** `aicr validate --phase performance` is not yet automated for GKE.
123+
> The GKE NCCL test uses raw Pods with a TCPXO daemon sidecar (required for GPUDirect),
124+
> which differs from the EKS TrainJob-based approach. Run the test manually as shown below.
125+
> Automated support is tracked as a follow-up.
126+
127+
### Option 1: Using testdata manifests (matches validator framework)
128+
129+
```shell
130+
export NAMESPACE=nccl-perf
131+
export GPU_COUNT_PER_NODE=8
132+
export GPU_COUNT=16
133+
export WORKER_COUNT=2
134+
export TEST_TYPE=all_reduce_perf
135+
export MIN_MESSAGE_SIZE=1M
136+
export MAX_MESSAGE_SIZE=8G
137+
138+
kubectl create ns $NAMESPACE
139+
envsubst < validators/performance/testdata/h100/gke/runtime.yaml | kubectl apply -f -
140+
141+
# Wait for pods to be 2/2 Running
142+
kubectl get pods -n $NAMESPACE -o wide -w
143+
144+
# Trigger the AllReduce benchmark from host-1
145+
kubectl exec nccl-test-host-1 -n $NAMESPACE -c nccl-test -- \
146+
/scripts/allreduce.sh nccl-host-1 nccl-host-2
147+
148+
# Expected: ~335 GB/s busBW at 8 GB (AllReduce), ~87 GB/s avg
149+
# Clean up
150+
kubectl delete ns $NAMESPACE
151+
```
152+
153+
### Option 2: Using standalone demo manifest
154+
155+
```shell
156+
kubectl create ns nccl-test
157+
kubectl apply -f demos/workloads/training/gke-nccl-test-tcpxo.yaml -n nccl-test
158+
159+
# Wait for pods to be 2/2 Running
160+
kubectl get pods -n nccl-test -o wide -w
161+
162+
# Trigger the AllReduce benchmark from host-1
163+
kubectl exec nccl-test-host-1 -n nccl-test -c nccl-test -- bash -c '
164+
/scripts/init_ssh.sh nccl-host-1 nccl-host-2 &&
165+
pushd /scripts && /scripts/gen_hostfiles.sh nccl-host-1 nccl-host-2 && popd &&
166+
BENCHMARK=all_reduce_perf NHOSTS=2 NCCL_LIB_DIR="/usr/local/nvidia/lib64" \
167+
LD_LIBRARY_PATH="/usr/local/nvidia/lib64" /scripts/demo-run-nccl-test-tcpxo-via-mpi.sh'
168+
169+
# Expected: ~335 GB/s busBW at 8 GB (AllReduce), ~87 GB/s avg
170+
# Clean up
171+
kubectl delete ns nccl-test
172+
```
173+
174+
### Prerequisites
175+
176+
- GKE cluster with multi-NIC networking (8 GPU NICs per a3-megagpu-8g node)
177+
- `Network` + `GKENetworkParamSet` CRs configured for GPU NICs (infrastructure, cluster-specific)
178+
- `nccl-tcpxo-installer` DaemonSet deployed on GPU nodes (included in AICR bundle)
179+
- `nri-device-injector` DaemonSet deployed on GPU nodes (included in AICR bundle)
180+
- Without multi-NIC, NCCL falls back to TCP (~4 GB/s vs ~335 GB/s with TCPXO)
181+
182+
### TCPXO Runtime Requirements
183+
184+
Each workload pod that needs GPUDirect TCPXO must include a `tcpxo-daemon` sidecar container.
185+
186+
**Recommended profile** (validated on GKE 1.35 / a3-megagpu-8g):
187+
- `hostNetwork: true` — required for PCI sysfs visibility
188+
- `privileged: false` — not needed with NRI device injection
189+
- NRI annotations on the pod: `devices.gke.io/container.tcpxo-daemon` (GPU devices) and `networking.gke.io/interfaces` (multi-NIC mapping with cluster-specific network names)
190+
- `securityContext.capabilities: [NET_ADMIN, NET_BIND_SERVICE]` on the tcpxo-daemon container
191+
- Requires NRI device injector DaemonSet deployed on GPU nodes
192+
193+
**Fallback profile** (if NRI injector is not available):
194+
- `hostNetwork: true` + `privileged: true`
195+
- No annotations needed
196+
197+
> **Known issue:** Without `hostNetwork: true`, the TCPXO daemon cannot enumerate all GPUs via PCI sysfs — the container runtime restricts sysfs visibility, causing the daemon to detect fewer GPUs in the PCI tree than CUDA reports, and exit. NRI annotations provide `/dev/nvidia*` device access but do not restore full PCI sysfs visibility. This is a GKE container runtime limitation.
198+
199+
### Understanding the results
200+
201+
Each pod runs two containers: a `tcpxo-daemon` sidecar (manages GPUDirect TCPX data path) and the `nccl-test` container. The TCPXO sidecar is required for any workload that needs high-speed inter-node GPU communication on GKE.
202+
203+
| Metric | Without TCPXO | With TCPXO |
204+
|--------|--------------|------------|
205+
| AllReduce busBW (8 GB) | ~4 GB/s | ~335 GB/s |
206+
| AllReduce avg busBW | ~4 GB/s | ~87 GB/s |
207+
208+
## Success
209+
210+
Job success + Fabric bandwidth within range
211+
212+
> Synthetic workload, perf checks beyond the basic fabric validation is out of scope here.

demos/workloads/inference/vllm-agg.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -64,15 +64,15 @@ spec:
6464
value: zmq
6565
extraPodSpec:
6666
nodeSelector:
67-
dedicated: cpu-workload
67+
nodeGroup: cpu-worker
6868
tolerations:
6969
- key: dedicated
7070
operator: Equal
71-
value: cpu-workload
71+
value: worker-workload
7272
effect: NoSchedule
7373
- key: dedicated
7474
operator: Equal
75-
value: cpu-workload
75+
value: worker-workload
7676
effect: NoExecute
7777
mainContainer:
7878
image: nvcr.io/nvidia/ai-dynamo/dynamo-frontend:0.9.0
@@ -89,15 +89,15 @@ spec:
8989
gpu: "1"
9090
extraPodSpec:
9191
nodeSelector:
92-
dedicated: gpu-workload
92+
nodeGroup: gpu-worker
9393
tolerations:
9494
- key: dedicated
9595
operator: Equal
96-
value: gpu-workload
96+
value: worker-workload
9797
effect: NoSchedule
9898
- key: dedicated
9999
operator: Equal
100-
value: gpu-workload
100+
value: worker-workload
101101
effect: NoExecute
102102
mainContainer:
103103
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0

0 commit comments

Comments
 (0)