You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The h100-aks-ubuntu-training overlay references nccl-all-reduce-bw in its validation performance checks, added in #415, but there is no AKS TrainingRuntime template at validators/performance/testdata/h100/aks/runtime.yaml. The NCCL validator would fail at resource application time because it cannot find a service-specific runtime for AKS.
Currently only EKS and GKE have runtime templates:
Problem
The
h100-aks-ubuntu-trainingoverlay referencesnccl-all-reduce-bwin its validation performance checks, added in #415, but there is no AKS TrainingRuntime template atvalidators/performance/testdata/h100/aks/runtime.yaml. The NCCL validator would fail at resource application time because it cannot find a service-specific runtime for AKS.Currently only EKS and GKE have runtime templates:
validators/performance/testdata/h100/eks/runtime.yaml(EFA)validators/performance/testdata/h100/gke/runtime.yaml(TCPXO/FastRak)What's needed
AKS H100 (ND H100 v5 / ND H200 v5) uses InfiniBand for GPU-to-GPU communication. The AKS runtime needs:
NCCL_IB_HCA,NCCL_IB_GID_INDEX, etc.)Affected overlay
recipes/overlays/h100-aks-ubuntu-training.yaml#L58-L60References
validators/performance/testdata/h100/eks/runtime.yamlvalidators/performance/testdata/h100/gke/runtime.yamlvalidators/performance/nccl_all_reduce_bw_constraint.go