Skip to content

Commit fd34b4c

Browse files
committed
feat: integrate behavioral evidence collection into aicr validate
Integrate the evidence collection script into `aicr validate --phase conformance --evidence-dir` instead of a standalone `aicr evidence` command. When --evidence-dir is set, behavioral evidence (GPU workload tests, HPA scaling, Prometheus queries) is collected atomically alongside structural validation, ensuring evidence reflects the same cluster state. Changes: - Hook evidence.Collector into validate.go after structural evidence rendering — runs automatically when --evidence-dir is specified - Move script + manifests to pkg/evidence/scripts/ (embedded via go:embed) - Remove standalone aicr evidence command (pkg/cli/evidence.go) - Update README with single-command usage - Gang scheduling test uses device plugin instead of DRA - Fix path leaks in evidence output (strip SCRIPT_DIR and temp paths) Usage: aicr validate -r recipe.yaml --phase conformance --evidence-dir ./evidence Or run the script directly: ./pkg/evidence/scripts/collect-evidence.sh all Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent 1f1758a commit fd34b4c

File tree

10 files changed

+258
-77
lines changed

10 files changed

+258
-77
lines changed

docs/conformance/cncf/README.md

Lines changed: 18 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,9 @@ recipe meets the Must-have requirements for Kubernetes v1.34.
1919
```
2020
docs/conformance/cncf/
2121
├── README.md
22-
├── collect-evidence.sh
23-
├── manifests/
24-
│ ├── dra-gpu-test.yaml
25-
│ ├── gang-scheduling-test.yaml
26-
│ └── hpa-gpu-test.yaml
22+
├── submission/
23+
│ ├── PRODUCT.yaml
24+
│ └── README.md
2725
└── evidence/
2826
├── index.md
2927
├── dra-support.md
@@ -34,6 +32,13 @@ docs/conformance/cncf/
3432
├── robust-operator.md
3533
├── pod-autoscaling.md
3634
└── cluster-autoscaling.md
35+
36+
pkg/evidence/scripts/ # Evidence collection script + test manifests
37+
├── collect-evidence.sh
38+
└── manifests/
39+
├── dra-gpu-test.yaml
40+
├── gang-scheduling-test.yaml
41+
└── hpa-gpu-test.yaml
3742
```
3843

3944
## Usage
@@ -56,25 +61,16 @@ aicr validate -r recipe.yaml -s snapshot.yaml \
5661
--result validation-result.yaml
5762
```
5863

59-
### Step 2: Behavioral Test Evidence
60-
61-
`collect-evidence.sh` deploys test workloads and collects behavioral evidence
62-
(DRA GPU allocation, gang scheduling, HPA autoscaling, etc.) that requires
63-
running actual GPU workloads on the cluster:
64+
When `--evidence-dir` is specified, both structural validation evidence and
65+
behavioral test evidence (DRA GPU allocation, gang scheduling, HPA autoscaling,
66+
etc.) are collected atomically in a single command. Behavioral tests deploy GPU
67+
workloads on the cluster and capture detailed command outputs, workload logs,
68+
and Prometheus queries.
6469

70+
Alternatively, run the evidence collection script directly:
6571
```bash
66-
# Collect all behavioral evidence
67-
./docs/conformance/cncf/collect-evidence.sh all
68-
69-
# Collect evidence for a single feature
70-
./docs/conformance/cncf/collect-evidence.sh dra
71-
./docs/conformance/cncf/collect-evidence.sh gang
72-
./docs/conformance/cncf/collect-evidence.sh secure
73-
./docs/conformance/cncf/collect-evidence.sh metrics
74-
./docs/conformance/cncf/collect-evidence.sh gateway
75-
./docs/conformance/cncf/collect-evidence.sh operator
76-
./docs/conformance/cncf/collect-evidence.sh hpa
77-
./docs/conformance/cncf/collect-evidence.sh cluster-autoscaling
72+
./pkg/evidence/scripts/collect-evidence.sh all
73+
./pkg/evidence/scripts/collect-evidence.sh dra
7874
```
7975

8076
> **Note:** The HPA test (`hpa`) deploys a GPU stress workload (nbody) and waits

docs/conformance/cncf/evidence/dra-support.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ ip-100-64-171-120.ec2.internal-gpu.nvidia.com-75xvv ip-100-64-171-1
4747

4848
Deploy a test pod that requests 1 GPU via ResourceClaim and verifies device access.
4949

50-
**Test manifest:** `docs/conformance/cncf/manifests/dra-gpu-test.yaml`
50+
**Test manifest:** `pkg/evidence/scripts/manifests/dra-gpu-test.yaml`
5151

5252
```yaml
5353
---
@@ -99,7 +99,7 @@ spec:
9999
100100
**Apply test manifest**
101101
```
102-
$ kubectl apply -f docs/conformance/cncf/manifests/dra-gpu-test.yaml
102+
$ kubectl apply -f pkg/evidence/scripts/manifests/dra-gpu-test.yaml
103103
namespace/dra-test created
104104
resourceclaim.resource.k8s.io/gpu-claim created
105105
pod/dra-gpu-test created

docs/conformance/cncf/evidence/gang-scheduling.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ podgroups.scheduling.run.ai 2026-02-12T20:42:05Z
5252
Deploy a PodGroup with minMember=2 and two GPU pods. KAI scheduler ensures both
5353
pods are scheduled atomically.
5454

55-
**Test manifest:** `docs/conformance/cncf/manifests/gang-scheduling-test.yaml`
55+
**Test manifest:** `pkg/evidence/scripts/manifests/gang-scheduling-test.yaml`
5656

5757
```yaml
5858
---
@@ -149,7 +149,7 @@ spec:
149149
150150
**Apply test manifest**
151151
```
152-
$ kubectl apply -f docs/conformance/cncf/manifests/gang-scheduling-test.yaml
152+
$ kubectl apply -f pkg/evidence/scripts/manifests/gang-scheduling-test.yaml
153153
namespace/gang-scheduling-test created
154154
podgroup.scheduling.run.ai/gang-test-group created
155155
pod/gang-worker-0 created

docs/conformance/cncf/evidence/pod-autoscaling.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ pods/gpu_utilization
5656
Deploy a GPU workload running CUDA N-Body Simulation to generate sustained GPU utilization,
5757
then create an HPA targeting `gpu_utilization` to demonstrate autoscaling.
5858

59-
**Test manifest:** `docs/conformance/cncf/manifests/hpa-gpu-test.yaml`
59+
**Test manifest:** `pkg/evidence/scripts/manifests/hpa-gpu-test.yaml`
6060

6161
```yaml
6262
---
@@ -123,7 +123,7 @@ spec:
123123
124124
**Apply test manifest**
125125
```
126-
$ kubectl apply -f docs/conformance/cncf/manifests/hpa-gpu-test.yaml
126+
$ kubectl apply -f pkg/evidence/scripts/manifests/hpa-gpu-test.yaml
127127
namespace/hpa-test created
128128
deployment.apps/gpu-workload created
129129
horizontalpodautoscaler.autoscaling/gpu-workload-hpa created

pkg/cli/validate.go

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -361,7 +361,12 @@ func validateCmdFlags() []cli.Flag {
361361
},
362362
&cli.StringFlag{
363363
Name: "evidence-dir",
364-
Usage: "Write CNCF conformance evidence markdown to this directory. Requires --phase conformance.",
364+
Usage: "Collect CNCF conformance evidence to this directory. When set, runs behavioral evidence collection (GPU workload tests, HPA scaling, Prometheus queries) instead of structural Go checks. Requires --phase conformance.",
365+
},
366+
&cli.StringSliceFlag{
367+
Name: "feature",
368+
Aliases: []string{"f"},
369+
Usage: "Evidence feature to collect (repeatable, default: all). Use -f all to run all features (cannot be combined with other features). Only used with --evidence-dir.",
365370
},
366371
&cli.StringFlag{
367372
Name: "result",
@@ -462,6 +467,28 @@ Use a saved result file for evidence instead of the live run:
462467
return errors.New(errors.ErrCodeInvalidRequest, "--result requires --evidence-dir")
463468
}
464469

470+
// When --evidence-dir is set, run behavioral evidence collection
471+
// instead of structural Go checks. This deploys GPU workloads and
472+
// captures detailed outputs for CNCF submission.
473+
if evidenceDir != "" {
474+
features := cmd.StringSlice("feature")
475+
slog.Info("collecting behavioral conformance evidence",
476+
"dir", evidenceDir, "features", features)
477+
478+
evidenceTimeout := cmd.Duration("timeout")
479+
evidenceCtx, evidenceCancel := context.WithTimeout(ctx, evidenceTimeout)
480+
defer evidenceCancel()
481+
482+
collector := evidence.NewCollector(evidenceDir,
483+
evidence.WithFeatures(features),
484+
)
485+
if err := collector.Run(evidenceCtx); err != nil {
486+
return errors.Wrap(errors.ErrCodeInternal, "evidence collection failed", err)
487+
}
488+
slog.Info("conformance evidence written", "dir", evidenceDir)
489+
return nil
490+
}
491+
465492
recipeFilePath := cmd.String("recipe")
466493
snapshotFilePath := cmd.String("snapshot")
467494
kubeconfig := cmd.String("kubeconfig")

pkg/evidence/collector.go

Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
// Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
package evidence
16+
17+
import (
18+
"context"
19+
"embed"
20+
"io/fs"
21+
"log/slog"
22+
"os"
23+
"os/exec"
24+
"path/filepath"
25+
26+
"github.com/NVIDIA/aicr/pkg/errors"
27+
)
28+
29+
//go:embed scripts/collect-evidence.sh
30+
var collectScript []byte
31+
32+
//go:embed scripts/manifests
33+
var manifestsFS embed.FS
34+
35+
// ValidFeatures lists all supported evidence collection features.
36+
var ValidFeatures = []string{
37+
"dra",
38+
"gang",
39+
"secure",
40+
"metrics",
41+
"gateway",
42+
"operator",
43+
"hpa",
44+
"cluster-autoscaling",
45+
}
46+
47+
// FeatureDescriptions maps feature names to human-readable descriptions.
48+
var FeatureDescriptions = map[string]string{
49+
"dra": "DRA GPU allocation test",
50+
"gang": "Gang scheduling co-scheduling test",
51+
"secure": "Secure accelerator access verification",
52+
"metrics": "Accelerator & AI service metrics",
53+
"gateway": "Inference API gateway conditions",
54+
"operator": "Robust AI operator + webhook test",
55+
"hpa": "HPA pod autoscaling (scale-up + scale-down)",
56+
"cluster-autoscaling": "Cluster autoscaling (ASG configuration)",
57+
}
58+
59+
// CollectorOption configures the Collector.
60+
type CollectorOption func(*Collector)
61+
62+
// Collector orchestrates behavioral evidence collection by invoking the
63+
// embedded collect-evidence.sh script against a live Kubernetes cluster.
64+
type Collector struct {
65+
outputDir string
66+
features []string
67+
noCleanup bool
68+
}
69+
70+
// NewCollector creates a new evidence Collector.
71+
func NewCollector(outputDir string, opts ...CollectorOption) *Collector {
72+
c := &Collector{
73+
outputDir: outputDir,
74+
}
75+
for _, opt := range opts {
76+
opt(c)
77+
}
78+
return c
79+
}
80+
81+
// WithFeatures sets which features to collect evidence for.
82+
// If empty, all features are collected.
83+
func WithFeatures(features []string) CollectorOption {
84+
return func(c *Collector) {
85+
c.features = features
86+
}
87+
}
88+
89+
// WithNoCleanup skips test namespace cleanup after collection.
90+
func WithNoCleanup(noCleanup bool) CollectorOption {
91+
return func(c *Collector) {
92+
c.noCleanup = noCleanup
93+
}
94+
}
95+
96+
// Run executes evidence collection for the configured features.
97+
func (c *Collector) Run(ctx context.Context) error {
98+
// Write embedded script and manifests to temp directory.
99+
tmpDir, err := os.MkdirTemp("", "aicr-evidence-")
100+
if err != nil {
101+
return errors.Wrap(errors.ErrCodeInternal, "failed to create temp directory", err)
102+
}
103+
defer os.RemoveAll(tmpDir)
104+
105+
scriptPath := filepath.Join(tmpDir, "collect-evidence.sh")
106+
if err := os.WriteFile(scriptPath, collectScript, 0o700); err != nil { //nolint:gosec // script needs execute permission
107+
return errors.Wrap(errors.ErrCodeInternal, "failed to write evidence script", err)
108+
}
109+
110+
manifestDir := filepath.Join(tmpDir, "manifests")
111+
if err := writeEmbeddedManifests(manifestDir); err != nil {
112+
return errors.Wrap(errors.ErrCodeInternal, "failed to write manifests", err)
113+
}
114+
115+
// Create output directory.
116+
if err := os.MkdirAll(c.outputDir, 0o755); err != nil {
117+
return errors.Wrap(errors.ErrCodeInternal, "failed to create output directory", err)
118+
}
119+
120+
// Determine sections to run. "all" or empty means run everything.
121+
sections := c.features
122+
if len(sections) == 0 {
123+
sections = []string{"all"}
124+
}
125+
for _, s := range sections {
126+
if s == "all" {
127+
sections = []string{"all"}
128+
break
129+
}
130+
}
131+
132+
// Run each feature.
133+
var lastErr error
134+
for _, section := range sections {
135+
slog.Info("collecting evidence", "feature", section)
136+
if err := c.runSection(ctx, scriptPath, tmpDir, section); err != nil {
137+
slog.Warn("evidence collection failed for feature",
138+
"feature", section, "error", err)
139+
lastErr = err
140+
// Continue with remaining features.
141+
}
142+
}
143+
144+
if lastErr != nil {
145+
return errors.Wrap(errors.ErrCodeInternal,
146+
"one or more evidence sections failed", lastErr)
147+
}
148+
return nil
149+
}
150+
151+
// runSection executes the evidence script for a single section.
152+
func (c *Collector) runSection(ctx context.Context, scriptPath, scriptDir, section string) error {
153+
cmd := exec.CommandContext(ctx, "bash", scriptPath, section)
154+
cmd.Dir = scriptDir
155+
cmd.Env = append(os.Environ(),
156+
"EVIDENCE_DIR="+c.outputDir,
157+
"SCRIPT_DIR="+scriptDir,
158+
)
159+
if c.noCleanup {
160+
cmd.Env = append(cmd.Env, "NO_CLEANUP=true")
161+
}
162+
cmd.Stdout = os.Stdout
163+
cmd.Stderr = os.Stderr
164+
return cmd.Run()
165+
}
166+
167+
// writeEmbeddedManifests extracts the embedded manifests to the target directory.
168+
func writeEmbeddedManifests(targetDir string) error {
169+
return fs.WalkDir(manifestsFS, "scripts/manifests", func(path string, d fs.DirEntry, err error) error {
170+
if err != nil {
171+
return err
172+
}
173+
174+
// Compute relative path from "scripts/manifests" prefix.
175+
relPath, _ := filepath.Rel("scripts/manifests", path)
176+
targetPath := filepath.Join(targetDir, relPath)
177+
178+
if d.IsDir() {
179+
return os.MkdirAll(targetPath, 0o755)
180+
}
181+
182+
data, err := manifestsFS.ReadFile(path)
183+
if err != nil {
184+
return err
185+
}
186+
return os.WriteFile(targetPath, data, 0o600)
187+
})
188+
}

0 commit comments

Comments
 (0)