Skip to content

Commit ad681cd

Browse files
committed
feat: integrate behavioral evidence collection into aicr validate
Integrate the evidence collection script into `aicr validate --phase conformance --evidence-dir` instead of a standalone `aicr evidence` command. When --evidence-dir is set, behavioral evidence (GPU workload tests, HPA scaling, Prometheus queries) is collected atomically alongside structural validation, ensuring evidence reflects the same cluster state. Changes: - Hook evidence.Collector into validate.go after structural evidence rendering — runs automatically when --evidence-dir is specified - Move script + manifests to pkg/evidence/scripts/ (embedded via go:embed) - Remove standalone aicr evidence command (pkg/cli/evidence.go) - Update README with single-command usage - Gang scheduling test uses device plugin instead of DRA - Fix path leaks in evidence output (strip SCRIPT_DIR and temp paths) Usage: aicr validate -r recipe.yaml --phase conformance --evidence-dir ./evidence Or run the script directly: ./pkg/evidence/scripts/collect-evidence.sh all Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent 1f1758a commit ad681cd

File tree

10 files changed

+299
-108
lines changed

10 files changed

+299
-108
lines changed

docs/conformance/cncf/README.md

Lines changed: 44 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,9 @@ recipe meets the Must-have requirements for Kubernetes v1.34.
1919
```
2020
docs/conformance/cncf/
2121
├── README.md
22-
├── collect-evidence.sh
23-
├── manifests/
24-
│ ├── dra-gpu-test.yaml
25-
│ ├── gang-scheduling-test.yaml
26-
│ └── hpa-gpu-test.yaml
22+
├── submission/
23+
│ ├── PRODUCT.yaml
24+
│ └── README.md
2725
└── evidence/
2826
├── index.md
2927
├── dra-support.md
@@ -34,76 +32,68 @@ docs/conformance/cncf/
3432
├── robust-operator.md
3533
├── pod-autoscaling.md
3634
└── cluster-autoscaling.md
35+
36+
pkg/evidence/scripts/ # Evidence collection script + test manifests
37+
├── collect-evidence.sh
38+
└── manifests/
39+
├── dra-gpu-test.yaml
40+
├── gang-scheduling-test.yaml
41+
└── hpa-gpu-test.yaml
3742
```
3843

3944
## Usage
4045

4146
Evidence collection has two steps:
4247

43-
### Step 1: Structural Validation Evidence
48+
### Structural Validation (CI)
4449

45-
`aicr validate` checks component health, CRDs, constraints, and generates
46-
structural evidence:
50+
`aicr validate` checks component health, CRDs, and constraints for CI:
4751

4852
```bash
49-
# Generate evidence during validation
50-
aicr validate -r recipe.yaml -s snapshot.yaml \
53+
# Structural validation + evidence rendering
54+
aicr validate -r recipe.yaml \
5155
--phase conformance --evidence-dir ./evidence
52-
53-
# Or use a saved result file
54-
aicr validate -r recipe.yaml -s snapshot.yaml \
55-
--phase conformance --evidence-dir ./evidence \
56-
--result validation-result.yaml
5756
```
5857

59-
### Step 2: Behavioral Test Evidence
58+
### CNCF Submission Evidence
6059

61-
`collect-evidence.sh` deploys test workloads and collects behavioral evidence
62-
(DRA GPU allocation, gang scheduling, HPA autoscaling, etc.) that requires
63-
running actual GPU workloads on the cluster:
60+
Add `--cncf-submission` to collect detailed behavioral evidence for CNCF AI
61+
Conformance submission. This deploys GPU workloads, captures command outputs,
62+
workload logs, nvidia-smi output, and Prometheus queries:
6463

6564
```bash
6665
# Collect all behavioral evidence
67-
./docs/conformance/cncf/collect-evidence.sh all
68-
69-
# Collect evidence for a single feature
70-
./docs/conformance/cncf/collect-evidence.sh dra
71-
./docs/conformance/cncf/collect-evidence.sh gang
72-
./docs/conformance/cncf/collect-evidence.sh secure
73-
./docs/conformance/cncf/collect-evidence.sh metrics
74-
./docs/conformance/cncf/collect-evidence.sh gateway
75-
./docs/conformance/cncf/collect-evidence.sh operator
76-
./docs/conformance/cncf/collect-evidence.sh hpa
77-
./docs/conformance/cncf/collect-evidence.sh cluster-autoscaling
66+
aicr validate --phase conformance \
67+
--evidence-dir ./evidence --cncf-submission
68+
69+
# Collect specific features
70+
aicr validate --phase conformance \
71+
--evidence-dir ./evidence --cncf-submission -f dra -f hpa
72+
```
73+
74+
Alternatively, run the evidence collection script directly:
75+
```bash
76+
./pkg/evidence/scripts/collect-evidence.sh all
77+
./pkg/evidence/scripts/collect-evidence.sh dra
7878
```
7979

80-
> **Note:** The HPA test (`hpa`) deploys a GPU stress workload (nbody) and waits
81-
> for HPA to scale up, then verifies scale-down. This takes ~5 minutes due to
82-
> metric propagation through the DCGM → Prometheus → prometheus-adapter → HPA pipeline.
80+
> **Note:** The `--cncf-submission` flag deploys GPU workloads and takes ~15
81+
> minutes. The HPA test uses CUDA N-Body Simulation to stress GPUs and verifies
82+
> both scale-up and scale-down.
8383
84-
### Why Two Steps?
84+
### Two Modes
8585

86-
| Evidence Type | `aicr validate` | `collect-evidence.sh` |
86+
| | `aicr validate --phase conformance` | `--cncf-submission` |
8787
|---|---|---|
88-
| Component health (pods, CRDs) | Yes | Yes |
89-
| Constraint validation (K8s version, OS) | Yes | No |
90-
| DRA GPU allocation test | No | Yes |
91-
| Gang scheduling test | No | Yes |
92-
| Device isolation verification | No | Yes |
93-
| Gateway condition checks (Accepted, Programmed) | No | Yes |
94-
| Webhook rejection test | No | Yes |
95-
| HPA scale-up and scale-down with GPU load | No | Yes |
96-
| Prometheus query results | No | Yes |
97-
| Cluster autoscaling (ASG config) | No | Yes |
98-
99-
`aicr validate` checks that components are deployed correctly. `collect-evidence.sh`
100-
verifies they work correctly by running actual workloads. Both are needed for
101-
complete conformance evidence.
102-
103-
> **Future:** Behavioral tests are inherently long-running (e.g., HPA test deploys
104-
> CUDA N-Body Simulation and waits ~5 minutes for metric propagation and scaling) and are better
105-
> suited as a separate step than blocking `aicr validate`. A follow-up integration
106-
> is tracked in [#192](https://github.com/NVIDIA/aicr/issues/192).
88+
| **Purpose** | CI pass/fail | CNCF submission evidence |
89+
| **Speed** | ~3 minutes | ~15 minutes |
90+
| **Deploys workloads** | No | Yes |
91+
| **Output** | Structural evidence (pass/fail + artifacts) | Behavioral evidence (command outputs, logs, queries) |
92+
| **DRA GPU allocation test** | Status check only | Deploys pod, verifies GPU access |
93+
| **Gang scheduling test** | Component check only | Deploys PodGroup, verifies co-scheduling |
94+
| **HPA autoscaling** | Metrics API check | Scale-up + scale-down with GPU load |
95+
| **Gateway** | Status check | Condition verification (Accepted, Programmed) |
96+
| **Webhook test** | No | Rejection test with invalid CR |
10797

10898
## Evidence
10999

docs/conformance/cncf/evidence/dra-support.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ ip-100-64-171-120.ec2.internal-gpu.nvidia.com-75xvv ip-100-64-171-1
4747

4848
Deploy a test pod that requests 1 GPU via ResourceClaim and verifies device access.
4949

50-
**Test manifest:** `docs/conformance/cncf/manifests/dra-gpu-test.yaml`
50+
**Test manifest:** `pkg/evidence/scripts/manifests/dra-gpu-test.yaml`
5151

5252
```yaml
5353
---
@@ -99,7 +99,7 @@ spec:
9999
100100
**Apply test manifest**
101101
```
102-
$ kubectl apply -f docs/conformance/cncf/manifests/dra-gpu-test.yaml
102+
$ kubectl apply -f pkg/evidence/scripts/manifests/dra-gpu-test.yaml
103103
namespace/dra-test created
104104
resourceclaim.resource.k8s.io/gpu-claim created
105105
pod/dra-gpu-test created

docs/conformance/cncf/evidence/gang-scheduling.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ podgroups.scheduling.run.ai 2026-02-12T20:42:05Z
5252
Deploy a PodGroup with minMember=2 and two GPU pods. KAI scheduler ensures both
5353
pods are scheduled atomically.
5454

55-
**Test manifest:** `docs/conformance/cncf/manifests/gang-scheduling-test.yaml`
55+
**Test manifest:** `pkg/evidence/scripts/manifests/gang-scheduling-test.yaml`
5656

5757
```yaml
5858
---
@@ -149,7 +149,7 @@ spec:
149149
150150
**Apply test manifest**
151151
```
152-
$ kubectl apply -f docs/conformance/cncf/manifests/gang-scheduling-test.yaml
152+
$ kubectl apply -f pkg/evidence/scripts/manifests/gang-scheduling-test.yaml
153153
namespace/gang-scheduling-test created
154154
podgroup.scheduling.run.ai/gang-test-group created
155155
pod/gang-worker-0 created

docs/conformance/cncf/evidence/pod-autoscaling.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ pods/gpu_utilization
5656
Deploy a GPU workload running CUDA N-Body Simulation to generate sustained GPU utilization,
5757
then create an HPA targeting `gpu_utilization` to demonstrate autoscaling.
5858

59-
**Test manifest:** `docs/conformance/cncf/manifests/hpa-gpu-test.yaml`
59+
**Test manifest:** `pkg/evidence/scripts/manifests/hpa-gpu-test.yaml`
6060

6161
```yaml
6262
---
@@ -123,7 +123,7 @@ spec:
123123
124124
**Apply test manifest**
125125
```
126-
$ kubectl apply -f docs/conformance/cncf/manifests/hpa-gpu-test.yaml
126+
$ kubectl apply -f pkg/evidence/scripts/manifests/hpa-gpu-test.yaml
127127
namespace/hpa-test created
128128
deployment.apps/gpu-workload created
129129
horizontalpodautoscaler.autoscaling/gpu-workload-hpa created

pkg/cli/validate.go

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,15 @@ func validateCmdFlags() []cli.Flag {
363363
Name: "evidence-dir",
364364
Usage: "Write CNCF conformance evidence markdown to this directory. Requires --phase conformance.",
365365
},
366+
&cli.BoolFlag{
367+
Name: "cncf-submission",
368+
Usage: "Collect detailed behavioral evidence for CNCF AI Conformance submission. Deploys GPU workloads, captures nvidia-smi output, Prometheus queries, and HPA scaling tests. Requires --evidence-dir. Takes ~15 minutes.",
369+
},
370+
&cli.StringSliceFlag{
371+
Name: "feature",
372+
Aliases: []string{"f"},
373+
Usage: "Evidence feature to collect (repeatable, default: all). Use -f all to run all features (cannot be combined with other features). Only used with --cncf-submission.",
374+
},
366375
&cli.StringFlag{
367376
Name: "result",
368377
Usage: "Use a saved validation result file as the source for evidence rendering (live validation still runs). Note: saved results do not include diagnostic artifacts captured during live runs. Requires --phase conformance and --evidence-dir.",
@@ -462,6 +471,40 @@ Use a saved result file for evidence instead of the live run:
462471
return errors.New(errors.ErrCodeInvalidRequest, "--result requires --evidence-dir")
463472
}
464473

474+
cncfSubmission := cmd.Bool("cncf-submission")
475+
if cncfSubmission && evidenceDir == "" {
476+
return errors.New(errors.ErrCodeInvalidRequest, "--cncf-submission requires --evidence-dir")
477+
}
478+
features := cmd.StringSlice("feature")
479+
if len(features) > 0 && !cncfSubmission {
480+
return errors.New(errors.ErrCodeInvalidRequest, "--feature requires --cncf-submission")
481+
}
482+
483+
// When --cncf-submission is set, run behavioral evidence collection
484+
// instead of structural Go checks. This deploys GPU workloads and
485+
// captures detailed outputs for CNCF submission.
486+
if cncfSubmission {
487+
slog.Info("collecting behavioral conformance evidence",
488+
"dir", evidenceDir, "features", features)
489+
490+
// Use a longer timeout for behavioral evidence (default 5m is too short).
491+
evidenceTimeout := cmd.Duration("timeout")
492+
if evidenceTimeout <= 5*time.Minute {
493+
evidenceTimeout = 20 * time.Minute
494+
}
495+
evidenceCtx, evidenceCancel := context.WithTimeout(ctx, evidenceTimeout)
496+
defer evidenceCancel()
497+
498+
collector := evidence.NewCollector(evidenceDir,
499+
evidence.WithFeatures(features),
500+
)
501+
if err := collector.Run(evidenceCtx); err != nil {
502+
return errors.Wrap(errors.ErrCodeInternal, "evidence collection failed", err)
503+
}
504+
slog.Info("conformance evidence written", "dir", evidenceDir)
505+
return nil
506+
}
507+
465508
recipeFilePath := cmd.String("recipe")
466509
snapshotFilePath := cmd.String("snapshot")
467510
kubeconfig := cmd.String("kubeconfig")

0 commit comments

Comments
 (0)