Skip to content

Commit 227a992

Browse files
authored
fix(conformance): wrap PRODUCT.yaml lines for yamllint (#207)
1 parent d3a0ad3 commit 227a992

File tree

1 file changed

+111
-18
lines changed

1 file changed

+111
-18
lines changed

docs/conformance/cncf/submission/PRODUCT.yaml

Lines changed: 111 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,10 @@ metadata:
77
repoUrl: "https://github.com/NVIDIA/aicr"
88
documentationUrl: "https://github.com/NVIDIA/aicr/blob/main/README.md"
99
productLogoUrl: "https://www.nvidia.com/favicon.ico"
10-
description: "Kubernetes platforms powered by NVIDIA AI Cluster Runtime (AICR) are CNCF AI Conformant. AICR generates validated, GPU-accelerated Kubernetes configurations that satisfy all CNCF AI Conformance requirements."
10+
description: >-
11+
Kubernetes platforms powered by NVIDIA AI Cluster Runtime (AICR) are CNCF AI
12+
Conformant. AICR generates validated, GPU-accelerated Kubernetes
13+
configurations that satisfy all CNCF AI Conformance requirements.
1114
contactEmailAddress: "[email protected]"
1215

1316
spec:
@@ -18,65 +21,155 @@ spec:
1821
status: "Implemented"
1922
evidence:
2023
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/dra-support.md"
21-
notes: "DRA API (resource.k8s.io/v1) is enabled with DeviceClass, ResourceClaim, ResourceClaimTemplate, and ResourceSlice resources available. The NVIDIA DRA driver runs as controller and kubelet-plugin pods, advertising individual H100 GPU devices via ResourceSlices with unique UUIDs, PCI bus IDs, CUDA compute capability, and memory capacity. GPU allocation to pods is mediated through ResourceClaims."
24+
notes: >-
25+
DRA API (resource.k8s.io/v1) is enabled with DeviceClass, ResourceClaim,
26+
ResourceClaimTemplate, and ResourceSlice resources available. The NVIDIA
27+
DRA driver runs as controller and kubelet-plugin pods, advertising
28+
individual H100 GPU devices via ResourceSlices with unique UUIDs, PCI
29+
bus IDs, CUDA compute capability, and memory capacity. GPU allocation to
30+
pods is mediated through ResourceClaims.
2231
networking:
2332
- id: ai_inference
24-
description: "Support the Kubernetes Gateway API with an implementation for advanced traffic management for inference services, which enables capabilities like weighted traffic splitting, header-based routing (for OpenAI protocol headers), and optional integration with service meshes."
33+
description: >-
34+
Support the Kubernetes Gateway API with an implementation for advanced
35+
traffic management for inference services, which enables capabilities
36+
like weighted traffic splitting, header-based routing (for OpenAI
37+
protocol headers), and optional integration with service meshes.
2538
level: MUST
2639
status: "Implemented"
2740
evidence:
2841
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/inference-gateway.md"
29-
notes: "kgateway controller is deployed with full Gateway API CRD support (GatewayClass, Gateway, HTTPRoute, GRPCRoute, ReferenceGrant). Inference extension CRDs (InferencePool, InferenceModelRewrite, InferenceObjective) are registered. An active inference gateway is verified with GatewayClass Accepted=True and Gateway Programmed=True conditions."
42+
notes: >-
43+
kgateway controller is deployed with full Gateway API CRD support
44+
(GatewayClass, Gateway, HTTPRoute, GRPCRoute, ReferenceGrant). Inference
45+
extension CRDs (InferencePool, InferenceModelRewrite,
46+
InferenceObjective) are registered. An active inference gateway is
47+
verified with GatewayClass Accepted=True and Gateway Programmed=True
48+
conditions.
3049
schedulingOrchestration:
3150
- id: gang_scheduling
32-
description: "The platform must allow for the installation and successful operation of at least one gang scheduling solution that ensures all-or-nothing scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To be conformant, the vendor must demonstrate that their platform can successfully run at least one such solution."
51+
description: >-
52+
The platform must allow for the installation and successful operation of
53+
at least one gang scheduling solution that ensures all-or-nothing
54+
scheduling for distributed AI workloads (e.g. Kueue, Volcano, etc.) To
55+
be conformant, the vendor must demonstrate that their platform can
56+
successfully run at least one such solution.
3357
level: MUST
3458
status: "Implemented"
3559
evidence:
3660
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/gang-scheduling.md"
37-
notes: "KAI Scheduler is deployed with operator, scheduler, admission controller, pod-grouper, and queue-controller components. PodGroup CRD (scheduling.run.ai) is registered. Gang scheduling is verified by deploying a PodGroup with minMember=2 and two GPU pods, demonstrating all-or-nothing atomic scheduling."
61+
notes: >-
62+
KAI Scheduler is deployed with operator, scheduler, admission
63+
controller, pod-grouper, and queue-controller components. PodGroup CRD
64+
(scheduling.run.ai) is registered. Gang scheduling is verified by
65+
deploying a PodGroup with minMember=2 and two GPU pods, demonstrating
66+
all-or-nothing atomic scheduling.
3867
- id: cluster_autoscaling
39-
description: "If the platform provides a cluster autoscaler or an equivalent mechanism, it must be able to scale up/down node groups containing specific accelerator types based on pending pods requesting those accelerators."
68+
description: >-
69+
If the platform provides a cluster autoscaler or an equivalent
70+
mechanism, it must be able to scale up/down node groups containing
71+
specific accelerator types based on pending pods requesting those
72+
accelerators.
4073
level: MUST
4174
status: "Implemented"
4275
evidence:
4376
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/cluster-autoscaling.md"
44-
notes: "Demonstrated on EKS with a GPU Auto Scaling Group (p5.48xlarge, 8x H100 per node). The ASG is tagged for Cluster Autoscaler discovery (k8s.io/cluster-autoscaler/enabled, k8s.io/cluster-autoscaler/<cluster>=owned) and supports scaling from min=1 to max=2 GPU nodes based on pending pod demand."
77+
notes: >-
78+
Demonstrated on EKS with a GPU Auto Scaling Group (p5.48xlarge, 8x H100
79+
per node). The ASG is tagged for Cluster Autoscaler discovery
80+
(k8s.io/cluster-autoscaler/enabled,
81+
k8s.io/cluster-autoscaler/<cluster>=owned) and supports scaling from
82+
min=1 to max=2 GPU nodes based on pending pod demand.
4583
- id: pod_autoscaling
46-
description: "If the platform supports the HorizontalPodAutoscaler, it must function correctly for pods utilizing accelerators. This includes the ability to scale these Pods based on custom metrics relevant to AI/ML workloads."
84+
description: >-
85+
If the platform supports the HorizontalPodAutoscaler, it must function
86+
correctly for pods utilizing accelerators. This includes the ability to
87+
scale these Pods based on custom metrics relevant to AI/ML workloads.
4788
level: MUST
4889
status: "Implemented"
4990
evidence:
5091
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/pod-autoscaling.md"
51-
notes: "Prometheus adapter exposes GPU custom metrics (gpu_utilization, gpu_memory_used, gpu_power_usage) via the Kubernetes custom metrics API. HPA is configured to target gpu_utilization at 50% threshold. Under GPU stress testing (CUDA N-Body Simulation), HPA successfully scales replicas from 1 to 2 pods when utilization exceeds the target, and scales back down when GPU load is removed."
92+
notes: >-
93+
Prometheus adapter exposes GPU custom metrics (gpu_utilization,
94+
gpu_memory_used, gpu_power_usage) via the Kubernetes custom metrics API.
95+
HPA is configured to target gpu_utilization at 50% threshold. Under GPU
96+
stress testing (CUDA N-Body Simulation), HPA successfully scales
97+
replicas from 1 to 2 pods when utilization exceeds the target, and
98+
scales back down when GPU load is removed.
5299
observability:
53100
- id: accelerator_metrics
54-
description: "For supported accelerator types, the platform must allow for the installation and successful operation of at least one accelerator metrics solution that exposes fine-grained performance metrics via a standardized, machine-readable metrics endpoint. This must include a core set of metrics for per-accelerator utilization and memory usage. Additionally, other relevant metrics such as temperature, power draw, and interconnect bandwidth should be exposed if the underlying hardware or virtualization layer makes them available. The list of metrics should align with emerging standards, such as OpenTelemetry metrics, to ensure interoperability. The platform may provide a managed solution, but this is not required for conformance."
101+
description: >-
102+
For supported accelerator types, the platform must allow for the
103+
installation and successful operation of at least one accelerator
104+
metrics solution that exposes fine-grained performance metrics via a
105+
standardized, machine-readable metrics endpoint. This must include a
106+
core set of metrics for per-accelerator utilization and memory usage.
107+
Additionally, other relevant metrics such as temperature, power draw,
108+
and interconnect bandwidth should be exposed if the underlying hardware
109+
or virtualization layer makes them available. The list of metrics should
110+
align with emerging standards, such as OpenTelemetry metrics, to ensure
111+
interoperability. The platform may provide a managed solution, but this
112+
is not required for conformance.
55113
level: MUST
56114
status: "Implemented"
57115
evidence:
58116
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/accelerator-metrics.md"
59-
notes: "DCGM Exporter runs on GPU nodes exposing metrics at :9400/metrics in Prometheus format. Per-GPU metrics include utilization, memory usage, temperature (26-31C), and power draw (66-115W). Metrics include pod/namespace/container labels for per-workload attribution. Prometheus actively scrapes DCGM metrics via ServiceMonitor."
117+
notes: >-
118+
DCGM Exporter runs on GPU nodes exposing metrics at :9400/metrics in
119+
Prometheus format. Per-GPU metrics include utilization, memory usage,
120+
temperature (26-31C), and power draw (66-115W). Metrics include
121+
pod/namespace/container labels for per-workload attribution. Prometheus
122+
actively scrapes DCGM metrics via ServiceMonitor.
60123
- id: ai_service_metrics
61-
description: "Provide a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (e.g. Prometheus exposition format). This ensures easy integration for collecting key metrics from common AI frameworks and servers."
124+
description: >-
125+
Provide a monitoring system capable of discovering and collecting
126+
metrics from workloads that expose them in a standard format (e.g.
127+
Prometheus exposition format). This ensures easy integration for
128+
collecting key metrics from common AI frameworks and servers.
62129
level: MUST
63130
status: "Implemented"
64131
evidence:
65132
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/accelerator-metrics.md"
66-
notes: "Prometheus and Grafana are deployed as the monitoring stack. Prometheus discovers and scrapes workloads exposing metrics in Prometheus exposition format via ServiceMonitors. The prometheus-adapter bridges these metrics into the Kubernetes custom metrics API for consumption by HPA and other controllers."
133+
notes: >-
134+
Prometheus and Grafana are deployed as the monitoring stack. Prometheus
135+
discovers and scrapes workloads exposing metrics in Prometheus
136+
exposition format via ServiceMonitors. The prometheus-adapter bridges
137+
these metrics into the Kubernetes custom metrics API for consumption by
138+
HPA and other controllers.
67139
security:
68140
- id: secure_accelerator_access
69-
description: "Ensure that access to accelerators from within containers is properly isolated and mediated by the Kubernetes resource management framework (device plugin or DRA) and container runtime, preventing unauthorized access or interference between workloads."
141+
description: >-
142+
Ensure that access to accelerators from within containers is properly
143+
isolated and mediated by the Kubernetes resource management framework
144+
(device plugin or DRA) and container runtime, preventing unauthorized
145+
access or interference between workloads.
70146
level: MUST
71147
status: "Implemented"
72148
evidence:
73149
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/secure-accelerator-access.md"
74-
notes: "GPU Operator manages all GPU lifecycle components (driver, device-plugin, DCGM, toolkit, validator, MIG manager). 8x H100 GPUs are individually advertised via ResourceSlices with DRA. Pod volumes contain only kube-api-access projected tokens — no hostPath mounts to /dev/nvidia devices. Device isolation is verified: a test pod requesting 1 GPU sees only the single allocated device."
150+
notes: >-
151+
GPU Operator manages all GPU lifecycle components (driver, device-plugin,
152+
DCGM, toolkit, validator, MIG manager). 8x H100 GPUs are individually
153+
advertised via ResourceSlices with DRA. Pod volumes contain only
154+
kube-api-access projected tokens — no hostPath mounts to /dev/nvidia
155+
devices. Device isolation is verified: a test pod requesting 1 GPU sees
156+
only the single allocated device.
75157
operator:
76158
- id: robust_controller
77-
description: "The platform must prove that at least one complex AI operator with a CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This includes verifying that the operator's pods run correctly, its webhooks are operational, and its custom resources can be reconciled."
159+
description: >-
160+
The platform must prove that at least one complex AI operator with a
161+
CRD (e.g., Ray, Kubeflow) can be installed and functions reliably. This
162+
includes verifying that the operator's pods run correctly, its webhooks
163+
are operational, and its custom resources can be reconciled.
78164
level: MUST
79165
status: "Implemented"
80166
evidence:
81167
- "https://github.com/NVIDIA/aicr/blob/main/docs/conformance/cncf/evidence/robust-operator.md"
82-
notes: "NVIDIA Dynamo operator is deployed with 6 CRDs (DynamoGraphDeployment, DynamoComponentDeployment, DynamoGraphDeploymentRequest, DynamoGraphDeploymentScalingAdapter, DynamoModel, DynamoWorkerMetadata). Validating webhooks are active and verified via rejection test (invalid CR correctly denied). A DynamoGraphDeployment custom resource is reconciled with frontend and GPU-enabled worker pods running successfully."
168+
notes: >-
169+
NVIDIA Dynamo operator is deployed with 6 CRDs (DynamoGraphDeployment,
170+
DynamoComponentDeployment, DynamoGraphDeploymentRequest,
171+
DynamoGraphDeploymentScalingAdapter, DynamoModel, DynamoWorkerMetadata).
172+
Validating webhooks are active and verified via rejection test (invalid
173+
CR correctly denied). A DynamoGraphDeployment custom resource is
174+
reconciled with frontend and GPU-enabled worker pods running
175+
successfully.

0 commit comments

Comments
 (0)