Skip to content

Commit d3a0ad3

Browse files
authored
feat(docs): add CNCF AI conformance submission for v1.34 (#206)
Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent 5f2a827 commit d3a0ad3

File tree

2 files changed

+106
-0
lines changed

2 files changed

+106
-0
lines changed
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
metadata:
2+
kubernetesVersion: v1.34
3+
platformName: "Kubernetes Platforms Powered by NVIDIA AI Cluster Runtime (AICR)"
4+
platformVersion: "0.7.7"
5+
vendorName: "NVIDIA"
6+
websiteUrl: "https://github.com/NVIDIA/aicr"
7+
repoUrl: "https://github.com/NVIDIA/aicr"
8+
documentationUrl: "https://github.com/NVIDIA/aicr/blob/main/README.md"
9+
productLogoUrl: "https://www.nvidia.com/favicon.ico"
10+
description: "Kubernetes platforms powered by NVIDIA AI Cluster Runtime (AICR) are CNCF AI Conformant. AICR generates validated, GPU-accelerated Kubernetes configurations that satisfy all CNCF AI Conformance requirements."
11+
contactEmailAddress: "aicr-maintainers@nvidia.com"
12+
13+
spec:
14+
accelerators:
15+
- id: dra_support
16+
description: "Support Dynamic Resource Allocation (DRA) APIs to enable more flexible and fine-grained resource requests beyond simple counts."
17+
level: MUST
18+
status: "Implemented"
19+
evidence:
20+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/dra-support.md"
21+
notes: "DRA API (resource.k8s.io/v1) is enabled with DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice resources available. The NVIDIA DRA driver runs as controller and kubelet-plugin pods, advertising individual H100 GPU devices via ResourceSlices with unique UUIDs, PCI bus IDs, CUDA compute capability, and memory capacity. GPU allocation to pods is mediated through ResourceClaims."
22+
networking:
23+
- id: ai_inference
24+
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
25+
level: MUST
26+
status: "Implemented"
27+
evidence:
28+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/inference-gateway.md"
29+
notes: "kgateway controller is deployed with full Gateway API CRD support (GatewayClass, Gateway, HTTPRoute, GRPCRoute, ReferenceGrant). Inference extension CRDs (InferencePool, InferenceModelRewrite, InferenceObjective) are registered. An active inference gateway is verified with GatewayClass Accepted=True and Gateway Programmed=True conditions."
30+
schedulingOrchestration:
31+
- id: gang_scheduling
32+
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
33+
level: MUST
34+
status: "Implemented"
35+
evidence:
36+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/gang-scheduling.md"
37+
notes: "KAI Scheduler is deployed with operator, scheduler, admission controller, pod-grouper, and queue-controller components. PodGroup CRD (scheduling.run.ai) is registered. Gang scheduling is verified by deploying a PodGroup with minMember=2 and two GPU pods, demonstrating all-or-nothing atomic scheduling."
38+
- id: cluster_autoscaling
39+
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
40+
level: MUST
41+
status: "Implemented"
42+
evidence:
43+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/cluster-autoscaling.md"
44+
notes: "Demonstrated on EKS with a GPU Auto Scaling Group (p5.48xlarge, 8x H100 per node). The ASG is tagged for Cluster Autoscaler discovery (k8s.io/cluster-autoscaler/enabled, k8s.io/cluster-autoscaler/<cluster>=owned) and supports scaling from min=1 to max=2 GPU nodes based on pending pod demand."
45+
- id: pod_autoscaling
46+
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
47+
level: MUST
48+
status: "Implemented"
49+
evidence:
50+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/pod-autoscaling.md"
51+
notes: "Prometheus adapter exposes GPU custom metrics (gpu_utilization, gpu_memory_used, gpu_power_usage) via the Kubernetes custom metrics API. HPA is configured to target gpu_utilization at 50% threshold. Under GPU stress testing (CUDA N-Body Simulation), HPA successfully scales replicas from 1 to 2 pods when utilization exceeds the target, and scales back down when GPU load is removed."
52+
observability:
53+
- id: accelerator_metrics
54+
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
55+
level: MUST
56+
status: "Implemented"
57+
evidence:
58+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/accelerator-metrics.md"
59+
notes: "DCGM Exporter runs on GPU nodes exposing metrics at :9400/metrics in Prometheus format. Per-GPU metrics include utilization, memory usage, temperature (26-31C), and power draw (66-115W). Metrics include pod/namespace/container labels for per-workload attribution. Prometheus actively scrapes DCGM metrics via ServiceMonitor."
60+
- id: ai_service_metrics
61+
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
62+
level: MUST
63+
status: "Implemented"
64+
evidence:
65+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/accelerator-metrics.md"
66+
notes: "Prometheus and Grafana are deployed as the monitoring stack. Prometheus discovers and scrapes workloads exposing metrics in Prometheus exposition format via ServiceMonitors. The prometheus-adapter bridges these metrics into the Kubernetes custom metrics API for consumption by HPA and other controllers."
67+
security:
68+
- id: secure_accelerator_access
69+
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
70+
level: MUST
71+
status: "Implemented"
72+
evidence:
73+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/secure-accelerator-access.md"
74+
notes: "GPU Operator manages all GPU lifecycle components (driver, device-plugin, DCGM, toolkit, validator, MIG manager). 8x H100 GPUs are individually advertised via ResourceSlices with DRA. Pod volumes contain only kube-api-access projected tokens — no hostPath mounts to /dev/nvidia devices. Device isolation is verified: a test pod requesting 1 GPU sees only the single allocated device."
75+
operator:
76+
- id: robust_controller
77+
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
78+
level: MUST
79+
status: "Implemented"
80+
evidence:
81+
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/robust-operator.md"
82+
notes: "NVIDIA Dynamo operator is deployed with 6 CRDs (DynamoGraphDeployment, DynamoComponentDeployment, DynamoGraphDeploymentRequest, DynamoGraphDeploymentScalingAdapter, DynamoModel, DynamoWorkerMetadata). Validating webhooks are active and verified via rejection test (invalid CR correctly denied). A DynamoGraphDeployment custom resource is reconciled with frontend and GPU-enabled worker pods running successfully."
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Kubernetes Platforms Powered by NVIDIA AI Cluster Runtime (AICR)
2+
3+
Kubernetes platforms powered by [NVIDIA AI Cluster Runtime (AICR)](https://github.com/NVIDIA/aicr) are CNCF AI Conformant. AICR generates validated, GPU-accelerated Kubernetes configurations that satisfy all CNCF AI Conformance requirements for accelerator management, scheduling, observability, security, and inference networking.
4+
5+
## Conformance Submission
6+
7+
- [PRODUCT.yaml](PRODUCT.yaml)
8+
9+
## Evidence
10+
11+
Evidence was collected on a Kubernetes v1.34 cluster with NVIDIA H100 80GB HBM3 GPUs using the AICR recipe `h100-eks-ubuntu-inference-dynamo`.
12+
13+
| # | Requirement | Feature | Result | Evidence |
14+
|---|-------------|---------|--------|----------|
15+
| 1 | `dra_support` | Dynamic Resource Allocation | PASS | [dra-support.md](../evidence/dra-support.md) |
16+
| 2 | `gang_scheduling` | Gang Scheduling (KAI Scheduler) | PASS | [gang-scheduling.md](../evidence/gang-scheduling.md) |
17+
| 3 | `secure_accelerator_access` | Secure Accelerator Access | PASS | [secure-accelerator-access.md](../evidence/secure-accelerator-access.md) |
18+
| 4 | `accelerator_metrics` / `ai_service_metrics` | Accelerator & AI Service Metrics | PASS | [accelerator-metrics.md](../evidence/accelerator-metrics.md) |
19+
| 5 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](../evidence/inference-gateway.md) |
20+
| 6 | `robust_controller` | Robust AI Operator (Dynamo) | PASS | [robust-operator.md](../evidence/robust-operator.md) |
21+
| 7 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU Metrics) | PASS | [pod-autoscaling.md](../evidence/pod-autoscaling.md) |
22+
| 8 | `cluster_autoscaling` | Cluster Autoscaling (EKS ASG) | PASS | [cluster-autoscaling.md](../evidence/cluster-autoscaling.md) |
23+
24+
All 9 conformance requirement IDs across 8 evidence files are **Implemented** (`accelerator_metrics` and `ai_service_metrics` share a single evidence file).

0 commit comments

Comments
 (0)