You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: "Kubernetes platforms powered by NVIDIA AI Cluster Runtime (AICR) are CNCF AI Conformant. AICR generates validated, GPU-accelerated Kubernetes configurations that satisfy all CNCF AI Conformance requirements."
10
+
description: >-
11
+
Kubernetes platforms powered by NVIDIA AI Cluster Runtime (AICR) are CNCF AI
notes: "DRA API (resource.k8s.io/v1) is enabled with DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice resources available. The NVIDIA DRA driver runs as controller and kubelet-plugin pods, advertising individual H100 GPU devices via ResourceSlices with unique UUIDs, PCI bus IDs, CUDA compute capability, and memory capacity. GPU allocation to pods is mediated through ResourceClaims."
24
+
notes: >-
25
+
DRA API (resource.k8s.io/v1) is enabled with DeviceClass, ResourceClaim,
26
+
ResourceClaimTemplate, and ResourceSlice resources available. The NVIDIA
27
+
DRA driver runs as controller and kubelet-plugin pods, advertising
28
+
individual H100 GPU devices via ResourceSlices with unique UUIDs, PCI
29
+
bus IDs, CUDA compute capability, and memory capacity. GPU allocation to
30
+
pods is mediated through ResourceClaims.
22
31
networking:
23
32
- id: ai_inference
24
-
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
33
+
description: >-
34
+
Support the Kubernetes Gateway API with an implementation for advanced
35
+
traffic management for inference services, which enables capabilities
36
+
like weighted traffic splitting, header-based routing (for OpenAI
37
+
protocol headers), and optional integration with service meshes.
notes: "kgateway controller is deployed with full Gateway API CRD support (GatewayClass, Gateway, HTTPRoute, GRPCRoute, ReferenceGrant). Inference extension CRDs (InferencePool, InferenceModelRewrite, InferenceObjective) are registered. An active inference gateway is verified with GatewayClass Accepted=True and Gateway Programmed=True conditions."
42
+
notes: >-
43
+
kgateway controller is deployed with full Gateway API CRD support
InferenceObjective) are registered. An active inference gateway is
47
+
verified with GatewayClass Accepted=True and Gateway Programmed=True
48
+
conditions.
30
49
schedulingOrchestration:
31
50
- id: gang_scheduling
32
-
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
51
+
description: >-
52
+
The platform must allow for the installation and successful operation of
53
+
at least one gang scheduling solution that ensures all-or-nothing
54
+
scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To
55
+
be conformant, the vendor must demonstrate that their platform can
notes: "KAI Scheduler is deployed with operator, scheduler, admission controller, pod-grouper, and queue-controller components. PodGroup CRD (scheduling.run.ai) is registered. Gang scheduling is verified by deploying a PodGroup with minMember=2 and two GPU pods, demonstrating all-or-nothing atomic scheduling."
61
+
notes: >-
62
+
KAI Scheduler is deployed with operator, scheduler, admission
63
+
controller, pod-grouper, and queue-controller components. PodGroup CRD
64
+
(scheduling.run.ai) is registered. Gang scheduling is verified by
65
+
deploying a PodGroup with minMember=2 and two GPU pods, demonstrating
66
+
all-or-nothing atomic scheduling.
38
67
- id: cluster_autoscaling
39
-
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
68
+
description: >-
69
+
If the platform provides a cluster autoscaler or an equivalent
70
+
mechanism, it must be able to scale up/down node groups containing
71
+
specific accelerator types based on pending pods requesting those
notes: "Demonstrated on EKS with a GPU Auto Scaling Group (p5.48xlarge, 8x H100 per node). The ASG is tagged for Cluster Autoscaler discovery (k8s.io/cluster-autoscaler/enabled, k8s.io/cluster-autoscaler/<cluster>=owned) and supports scaling from min=1 to max=2 GPU nodes based on pending pod demand."
77
+
notes: >-
78
+
Demonstrated on EKS with a GPU Auto Scaling Group (p5.48xlarge, 8x H100
79
+
per node). The ASG is tagged for Cluster Autoscaler discovery
80
+
(k8s.io/cluster-autoscaler/enabled,
81
+
k8s.io/cluster-autoscaler/<cluster>=owned) and supports scaling from
82
+
min=1 to max=2 GPU nodes based on pending pod demand.
45
83
- id: pod_autoscaling
46
-
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
84
+
description: >-
85
+
If the platform supports the HorizontalPodAutoscaler, it must function
86
+
correctly for pods utilizing accelerators. This includes the ability to
87
+
scale these Pods based on custom metrics relevant to AI/ML workloads.
notes: "Prometheus adapter exposes GPU custom metrics (gpu_utilization, gpu_memory_used, gpu_power_usage) via the Kubernetes custom metrics API. HPA is configured to target gpu_utilization at 50% threshold. Under GPU stress testing (CUDA N-Body Simulation), HPA successfully scales replicas from 1 to 2 pods when utilization exceeds the target, and scales back down when GPU load is removed."
replicas from 1 to 2 pods when utilization exceeds the target, and
98
+
scales back down when GPU load is removed.
52
99
observability:
53
100
- id: accelerator_metrics
54
-
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
101
+
description: >-
102
+
For supported accelerator types, the platform must allow for the
103
+
installation and successful operation of at least one accelerator
104
+
metrics solution that exposes fine-grained performance metrics via a
105
+
standardized, machine-readable metrics endpoint. This must include a
106
+
core set of metrics for per-accelerator utilization and memory usage.
107
+
Additionally, other relevant metrics such as temperature, power draw,
108
+
and interconnect bandwidth should be exposed if the underlying hardware
109
+
or virtualization layer makes them available. The list of metrics should
110
+
align with emerging standards, such as OpenTelemetry metrics, to ensure
111
+
interoperability. The platform may provide a managed solution, but this
notes: "DCGM Exporter runs on GPU nodes exposing metrics at :9400/metrics in Prometheus format. Per-GPU metrics include utilization, memory usage, temperature (26-31C), and power draw (66-115W). Metrics include pod/namespace/container labels for per-workload attribution. Prometheus actively scrapes DCGM metrics via ServiceMonitor."
117
+
notes: >-
118
+
DCGM Exporter runs on GPU nodes exposing metrics at :9400/metrics in
119
+
Prometheus format. Per-GPU metrics include utilization, memory usage,
120
+
temperature (26-31C), and power draw (66-115W). Metrics include
121
+
pod/namespace/container labels for per-workload attribution. Prometheus
122
+
actively scrapes DCGM metrics via ServiceMonitor.
60
123
- id: ai_service_metrics
61
-
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
124
+
description: >-
125
+
Provide a monitoring system capable of discovering and collecting
126
+
metrics from workloads that expose them in a standard format (e.g.
127
+
Prometheus exposition format). This ensures easy integration for
128
+
collecting key metrics from common AI frameworks and servers.
notes: "Prometheus and Grafana are deployed as the monitoring stack. Prometheus discovers and scrapes workloads exposing metrics in Prometheus exposition format via ServiceMonitors. The prometheus-adapter bridges these metrics into the Kubernetes custom metrics API for consumption by HPA and other controllers."
133
+
notes: >-
134
+
Prometheus and Grafana are deployed as the monitoring stack. Prometheus
135
+
discovers and scrapes workloads exposing metrics in Prometheus
136
+
exposition format via ServiceMonitors. The prometheus-adapter bridges
137
+
these metrics into the Kubernetes custom metrics API for consumption by
138
+
HPA and other controllers.
67
139
security:
68
140
- id: secure_accelerator_access
69
-
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
141
+
description: >-
142
+
Ensure that access to accelerators from within containers is properly
143
+
isolated and mediated by the Kubernetes resource management framework
144
+
(device plugin or DRA) and container runtime, preventing unauthorized
notes: "GPU Operator manages all GPU lifecycle components (driver, device-plugin, DCGM, toolkit, validator, MIG manager). 8x H100 GPUs are individually advertised via ResourceSlices with DRA. Pod volumes contain only kube-api-access projected tokens — no hostPath mounts to /dev/nvidia devices. Device isolation is verified: a test pod requesting 1 GPU sees only the single allocated device."
150
+
notes: >-
151
+
GPU Operator manages all GPU lifecycle components (driver, device-plugin,
152
+
DCGM, toolkit, validator, MIG manager). 8x H100 GPUs are individually
153
+
advertised via ResourceSlices with DRA. Pod volumes contain only
154
+
kube-api-access projected tokens — no hostPath mounts to /dev/nvidia
155
+
devices. Device isolation is verified: a test pod requesting 1 GPU sees
156
+
only the single allocated device.
75
157
operator:
76
158
- id: robust_controller
77
-
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
159
+
description: >-
160
+
The platform must prove that at least one complex AI operator with a
161
+
CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This
162
+
includes verifying that the operator's pods run correctly, its webhooks
163
+
are operational, and its custom resources can be reconciled.
notes: "NVIDIA Dynamo operator is deployed with 6 CRDs (DynamoGraphDeployment, DynamoComponentDeployment, DynamoGraphDeploymentRequest, DynamoGraphDeploymentScalingAdapter, DynamoModel, DynamoWorkerMetadata). Validating webhooks are active and verified via rejection test (invalid CR correctly denied). A DynamoGraphDeployment custom resource is reconciled with frontend and GPU-enabled worker pods running successfully."
168
+
notes: >-
169
+
NVIDIA Dynamo operator is deployed with 6 CRDs (DynamoGraphDeployment,
0 commit comments