Skip to content

Commit ce95913

Browse files
authored
Merge branch 'main' into feat/helm-values-check
2 parents 3363b0f + 227a992 commit ce95913

File tree

2 files changed

+199
-0
lines changed

2 files changed

+199
-0
lines changed
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
metadata:
2+
kubernetesVersion: v1.34
3+
platformName: "Kubernetes Platforms Powered by NVIDIA AI Cluster Runtime (AICR)"
4+
platformVersion: "0.7.7"
5+
vendorName: "NVIDIA"
6+
websiteUrl: "https://github.com/NVIDIA/aicr"
7+
repoUrl: "https://github.com/NVIDIA/aicr"
8+
documentationUrl: "https://github.com/NVIDIA/aicr/blob/main/README.md"
9+
productLogoUrl: "https://www.nvidia.com/favicon.ico"
10+
description: >-
11+
Kubernetes platforms powered by NVIDIA AI Cluster Runtime (AICR) are CNCF AI
12+
Conformant. AICR generates validated, GPU-accelerated Kubernetes
13+
configurations that satisfy all CNCF AI Conformance requirements.
14+
contactEmailAddress: "aicr-maintainers@nvidia.com"
15+
16+
spec:
17+
accelerators:
18+
- id: dra_support
19+
description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts."
20+
level: MUST
21+
status: "Implemented"
22+
evidence:
23+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/dra-support.md"
24+
notes: >-
25+
DRA API (resource.k8s.io/v1) is enabled with DeviceClass, ResourceClaim,
26+
ResourceClaimTemplate, and ResourceSlice resources available. The NVIDIA
27+
DRA driver runs as controller and kubelet-plugin pods, advertising
28+
individual H100 GPU devices via ResourceSlices with unique UUIDs, PCI
29+
bus IDs, CUDA compute capability, and memory capacity. GPU allocation to
30+
pods is mediated through ResourceClaims.
31+
networking:
32+
- id: ai_inference
33+
description: >-
34+
Support the Kubernetes Gateway API with an implementation for advanced
35+
traffic management for inference services, which enables capabilities
36+
like weighted traffic splitting, header-based routing (for OpenAI
37+
protocol headers), and optional integration with service meshes.
38+
level: MUST
39+
status: "Implemented"
40+
evidence:
41+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/inference-gateway.md"
42+
notes: >-
43+
kgateway controller is deployed with full Gateway API CRD support
44+
(GatewayClass, Gateway, HTTPRoute, GRPCRoute, ReferenceGrant). Inference
45+
extension CRDs (InferencePool, InferenceModelRewrite,
46+
InferenceObjective) are registered. An active inference gateway is
47+
verified with GatewayClass Accepted=True and Gateway Programmed=True
48+
conditions.
49+
schedulingOrchestration:
50+
- id: gang_scheduling
51+
description: >-
52+
The platform must allow for the installation and successful operation of
53+
at least one gang scheduling solution that ensures all-or-nothing
54+
scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To
55+
be conformant, the vendor must demonstrate that their platform can
56+
successfully run at least one such solution.
57+
level: MUST
58+
status: "Implemented"
59+
evidence:
60+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/gang-scheduling.md"
61+
notes: >-
62+
KAI Scheduler is deployed with operator, scheduler, admission
63+
controller, pod-grouper, and queue-controller components. PodGroup CRD
64+
(scheduling.run.ai) is registered. Gang scheduling is verified by
65+
deploying a PodGroup with minMember=2 and two GPU pods, demonstrating
66+
all-or-nothing atomic scheduling.
67+
- id: cluster_autoscaling
68+
description: >-
69+
If the platform provides a cluster autoscaler or an equivalent
70+
mechanism, it must be able to scale up/down node groups containing
71+
specific accelerator types based on pending pods requesting those
72+
accelerators.
73+
level: MUST
74+
status: "Implemented"
75+
evidence:
76+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/cluster-autoscaling.md"
77+
notes: >-
78+
Demonstrated on EKS with a GPU Auto Scaling Group (p5.48xlarge, 8x H100
79+
per node). The ASG is tagged for Cluster Autoscaler discovery
80+
(k8s.io/cluster-autoscaler/enabled,
81+
k8s.io/cluster-autoscaler/<cluster>=owned) and supports scaling from
82+
min=1 to max=2 GPU nodes based on pending pod demand.
83+
- id: pod_autoscaling
84+
description: >-
85+
If the platform supports the HorizontalPodAutoscaler, it must function
86+
correctly for pods utilizing accelerators. This includes the ability to
87+
scale these Pods based on custom metrics relevant to AI/ML workloads.
88+
level: MUST
89+
status: "Implemented"
90+
evidence:
91+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/pod-autoscaling.md"
92+
notes: >-
93+
Prometheus adapter exposes GPU custom metrics (gpu_utilization,
94+
gpu_memory_used, gpu_power_usage) via the Kubernetes custom metrics API.
95+
HPA is configured to target gpu_utilization at 50% threshold. Under GPU
96+
stress testing (CUDA N-Body Simulation), HPA successfully scales
97+
replicas from 1 to 2 pods when utilization exceeds the target, and
98+
scales back down when GPU load is removed.
99+
observability:
100+
- id: accelerator_metrics
101+
description: >-
102+
For supported accelerator types, the platform must allow for the
103+
installation and successful operation of at least one accelerator
104+
metrics solution that exposes fine-grained performance metrics via a
105+
standardized, machine-readable metrics endpoint. This must include a
106+
core set of metrics for per-accelerator utilization and memory usage.
107+
Additionally, other relevant metrics such as temperature, power draw,
108+
and interconnect bandwidth should be exposed if the underlying hardware
109+
or virtualization layer makes them available. The list of metrics should
110+
align with emerging standards, such as OpenTelemetry metrics, to ensure
111+
interoperability. The platform may provide a managed solution, but this
112+
is not required for conformance.
113+
level: MUST
114+
status: "Implemented"
115+
evidence:
116+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/accelerator-metrics.md"
117+
notes: >-
118+
DCGM Exporter runs on GPU nodes exposing metrics at :9400/metrics in
119+
Prometheus format. Per-GPU metrics include utilization, memory usage,
120+
temperature (26-31C), and power draw (66-115W). Metrics include
121+
pod/namespace/container labels for per-workload attribution. Prometheus
122+
actively scrapes DCGM metrics via ServiceMonitor.
123+
- id: ai_service_metrics
124+
description: >-
125+
Provide a monitoring system capable of discovering and collecting
126+
metrics from workloads that expose them in a standard format (e.g.
127+
Prometheus exposition format). This ensures easy integration for
128+
collecting key metrics from common AI frameworks and servers.
129+
level: MUST
130+
status: "Implemented"
131+
evidence:
132+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/accelerator-metrics.md"
133+
notes: >-
134+
Prometheus and Grafana are deployed as the monitoring stack. Prometheus
135+
discovers and scrapes workloads exposing metrics in Prometheus
136+
exposition format via ServiceMonitors. The prometheus-adapter bridges
137+
these metrics into the Kubernetes custom metrics API for consumption by
138+
HPA and other controllers.
139+
security:
140+
- id: secure_accelerator_access
141+
description: >-
142+
Ensure that access to accelerators from within containers is properly
143+
isolated and mediated by the Kubernetes resource management framework
144+
(device plugin or DRA) and container runtime, preventing unauthorized
145+
access or interference between workloads.
146+
level: MUST
147+
status: "Implemented"
148+
evidence:
149+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/secure-accelerator-access.md"
150+
notes: >-
151+
GPU Operator manages all GPU lifecycle components (driver, device-plugin,
152+
DCGM, toolkit, validator, MIG manager). 8x H100 GPUs are individually
153+
advertised via ResourceSlices with DRA. Pod volumes contain only
154+
kube-api-access projected tokens — no hostPath mounts to /dev/nvidia
155+
devices. Device isolation is verified: a test pod requesting 1 GPU sees
156+
only the single allocated device.
157+
operator:
158+
- id: robust_controller
159+
description: >-
160+
The platform must prove that at least one complex AI operator with a
161+
CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This
162+
includes verifying that the operator's pods run correctly, its webhooks
163+
are operational, and its custom resources can be reconciled.
164+
level: MUST
165+
status: "Implemented"
166+
evidence:
167+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/robust-operator.md"
168+
notes: >-
169+
NVIDIA Dynamo operator is deployed with 6 CRDs (DynamoGraphDeployment,
170+
DynamoComponentDeployment, DynamoGraphDeploymentRequest,
171+
DynamoGraphDeploymentScalingAdapter, DynamoModel, DynamoWorkerMetadata).
172+
Validating webhooks are active and verified via rejection test (invalid
173+
CR correctly denied). A DynamoGraphDeployment custom resource is
174+
reconciled with frontend and GPU-enabled worker pods running
175+
successfully.
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Kubernetes Platforms Powered by NVIDIA AI Cluster Runtime (AICR)
2+
3+
Kubernetes platforms powered by [NVIDIA AI Cluster Runtime (AICR)](https://github.com/NVIDIA/aicr) are CNCF AI Conformant. AICR generates validated, GPU-accelerated Kubernetes configurations that satisfy all CNCF AI Conformance requirements for accelerator management, scheduling, observability, security, and inference networking.
4+
5+
## Conformance Submission
6+
7+
- [PRODUCT.yaml](PRODUCT.yaml)
8+
9+
## Evidence
10+
11+
Evidence was collected on a Kubernetes v1.34 cluster with NVIDIA H100 80GB HBM3 GPUs using the AICR recipe `h100-eks-ubuntu-inference-dynamo`.
12+
13+
| # | Requirement | Feature | Result | Evidence |
14+
|---|-------------|---------|--------|----------|
15+
| 1 | `dra_support` | Dynamic Resource Allocation | PASS | [dra-support.md](../evidence/dra-support.md) |
16+
| 2 | `gang_scheduling` | Gang Scheduling (KAI Scheduler) | PASS | [gang-scheduling.md](../evidence/gang-scheduling.md) |
17+
| 3 | `secure_accelerator_access` | Secure Accelerator Access | PASS | [secure-accelerator-access.md](../evidence/secure-accelerator-access.md) |
18+
| 4 | `accelerator_metrics` / `ai_service_metrics` | Accelerator & AI Service Metrics | PASS | [accelerator-metrics.md](../evidence/accelerator-metrics.md) |
19+
| 5 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](../evidence/inference-gateway.md) |
20+
| 6 | `robust_controller` | Robust AI Operator (Dynamo) | PASS | [robust-operator.md](../evidence/robust-operator.md) |
21+
| 7 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU Metrics) | PASS | [pod-autoscaling.md](../evidence/pod-autoscaling.md) |
22+
| 8 | `cluster_autoscaling` | Cluster Autoscaling (EKS ASG) | PASS | [cluster-autoscaling.md](../evidence/cluster-autoscaling.md) |
23+
24+
All 9 conformance requirement IDs across 8 evidence files are **Implemented** (`accelerator_metrics` and `ai_service_metrics` share a single evidence file).

0 commit comments

Comments
 (0)