Skip to content

Commit 3cf5f36

Browse files
authored
Merge branch 'main' into refactor/rename-predeployment-to-readiness
2 parents 2034e9c + 4ddf3b8 commit 3cf5f36

File tree

13 files changed

+2954
-55
lines changed

13 files changed

+2954
-55
lines changed

.github/workflows/gpu-h100-smoke-test.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ on:
2929
- '.github/actions/load-versions/**'
3030
- 'tests/manifests/**'
3131
- 'tests/chainsaw/ai-conformance/**'
32+
- 'docs/conformance/cncf/**'
3233
- 'recipes/components/dynamo-platform/**'
3334
- 'recipes/overlays/kind.yaml'
3435
- 'recipes/overlays/kind-inference.yaml'
@@ -193,7 +194,7 @@ jobs:
193194
- name: Deploy DRA GPU test
194195
run: |
195196
kubectl --context="kind-${KIND_CLUSTER_NAME}" apply \
196-
-f tests/manifests/dra-gpu-test.yaml
197+
-f docs/conformance/cncf/manifests/dra-gpu-test.yaml
197198
198199
echo "Waiting for DRA GPU test pod to complete..."
199200
if kubectl --context="kind-${KIND_CLUSTER_NAME}" -n dra-test \
@@ -216,7 +217,7 @@ jobs:
216217
if: always()
217218
run: |
218219
kubectl --context="kind-${KIND_CLUSTER_NAME}" delete \
219-
-f tests/manifests/dra-gpu-test.yaml --ignore-not-found 2>/dev/null || true
220+
-f docs/conformance/cncf/manifests/dra-gpu-test.yaml --ignore-not-found 2>/dev/null || true
220221
221222
# --- Evidence collection ---
222223

docs/conformance/cncf/README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# CNCF AI Conformance Evidence
2+
3+
## Overview
4+
5+
This directory contains evidence for [CNCF Kubernetes AI Conformance](https://github.com/cncf/k8s-ai-conformance)
6+
certification. The evidence demonstrates that a cluster configured with a specific
7+
recipe meets the Must-have requirements for Kubernetes v1.34.
8+
9+
> **Note:** It is the **cluster configured by a recipe** that is conformant, not the
10+
> tool itself. The recipe determines which components are deployed and how they are
11+
> configured. Different recipes may produce clusters with different conformance profiles.
12+
13+
**Recipe used:** `h100-eks-ubuntu-inference-dynamo`
14+
**Cluster:** EKS with p5.48xlarge (8x NVIDIA H100 80GB HBM3)
15+
**Kubernetes:** v1.34
16+
17+
## Directory Structure
18+
19+
```
20+
docs/conformance/cncf/
21+
├── README.md
22+
├── collect-evidence.sh
23+
├── manifests/
24+
│ ├── dra-gpu-test.yaml
25+
│ └── gang-scheduling-test.yaml
26+
└── evidence/
27+
├── index.md
28+
├── dra-support.md
29+
├── gang-scheduling.md
30+
├── secure-accelerator-access.md
31+
├── accelerator-metrics.md
32+
├── inference-gateway.md
33+
└── robust-operator.md
34+
```
35+
36+
## Usage
37+
38+
```bash
39+
# Collect all evidence
40+
./docs/conformance/cncf/collect-evidence.sh all
41+
42+
# Collect evidence for a single feature
43+
./docs/conformance/cncf/collect-evidence.sh dra
44+
./docs/conformance/cncf/collect-evidence.sh gang
45+
./docs/conformance/cncf/collect-evidence.sh secure
46+
./docs/conformance/cncf/collect-evidence.sh metrics
47+
./docs/conformance/cncf/collect-evidence.sh gateway
48+
./docs/conformance/cncf/collect-evidence.sh operator
49+
```
50+
51+
## Evidence
52+
53+
See [evidence/index.md](evidence/index.md) for a summary of all collected evidence and results.
54+
55+
## Feature Areas
56+
57+
| # | Feature | Requirement | Evidence File |
58+
|---|---------|-------------|---------------|
59+
| 1 | DRA Support | `dra_support` | [evidence/dra-support.md](evidence/dra-support.md) |
60+
| 2 | Gang Scheduling | `gang_scheduling` | [evidence/gang-scheduling.md](evidence/gang-scheduling.md) |
61+
| 3 | Secure Accelerator Access | `secure_accelerator_access` | [evidence/secure-accelerator-access.md](evidence/secure-accelerator-access.md) |
62+
| 4 | Accelerator & AI Service Metrics | `accelerator_metrics`, `ai_service_metrics` | [evidence/accelerator-metrics.md](evidence/accelerator-metrics.md) |
63+
| 5 | Inference API Gateway | `ai_inference` | [evidence/inference-gateway.md](evidence/inference-gateway.md) |
64+
| 6 | Robust AI Operator | `robust_controller` | [evidence/robust-operator.md](evidence/robust-operator.md) |
65+
66+
## TODO
67+
68+
- [ ] **Cluster Autoscaling** (`cluster_autoscaling`, MUST) — Demonstrate Karpenter or cluster autoscaler scaling GPU node groups based on pending pod requests
69+
- [ ] **Pod Autoscaling** (`pod_autoscaling`, MUST) — Demonstrate HPA scaling pods based on custom GPU metrics (e.g., `gpu_utilization` from prometheus-adapter)

0 commit comments

Comments
 (0)