Skip to content

Commit a96eb04

Browse files
committed
docs(validator): add development and extension guides for validation system
Add three new documents to match component documentation coverage: - docs/contributor/validator.md: upstream check development guide (contract, quick start, Context API, testing patterns) - docs/integrator/validator-extension.md: external extension via --data (custom validators, overrides, bash example, language-agnostic contract) - recipes/validators/README.md: catalog schema and validator reference Update existing docs with cross-references: - docs/contributor/validations.md: disambiguation note (component vs V2) - docs/contributor/README.md: link to validator guide - docs/integrator/README.md: link to extension guide Closes #307
1 parent a911a55 commit a96eb04

File tree

6 files changed

+653
-0
lines changed

6 files changed

+653
-0
lines changed

docs/contributor/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,10 @@ This directory contains architecture documentation for the AI Cluster Runtime (A
5151
- **[API Server Architecture](api-server.md)**: HTTP REST API for recipe generation and bundle creation
5252
- Endpoints: `GET /v1/recipe` (query mode only), `POST /v1/bundle` (bundle generation)
5353
- Does not support snapshot capture or validation (use CLI or agent)
54+
- **[Validator Development Guide](validator.md)**: Container-per-validator engine for `aicr validate`
55+
- Container contract (exit codes, I/O channels, mounted data)
56+
- Quick start for adding upstream Go checks
57+
- Catalog schema reference and testing patterns
5458
- **[Component Validation System](validations.md)**: Component-driven validation framework
5559
- Automatic validation execution during bundle generation
5660
- Condition-based validation (intent, service, accelerator, etc.)

docs/contributor/validations.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
Learn how to define and use component validations in AICR.
44

5+
> **Note:** This document covers **component validations** — condition-based checks that run during bundle generation (e.g., missing config, incompatible settings). For the **container-per-validator engine** used by `aicr validate`, see the [Validator Development Guide](validator.md) and [Validator Extension Guide](../integrator/validator-extension.md).
6+
57
## Overview
68

79
The component validation system allows components to register validation checks that run automatically during bundle generation. Validations can check for missing configuration, incompatible settings, or other conditions that might cause deployment issues.

docs/contributor/validator.md

Lines changed: 369 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,369 @@
1+
# Validator Development Guide
2+
3+
Learn how to add new validation checks to AICR.
4+
5+
## Overview
6+
7+
AICR uses a container-per-validator model. Each validation check runs as an isolated Kubernetes Job with access to the cluster, a snapshot, and the recipe. Validators are organized into three phases:
8+
9+
| Phase | Purpose | Example |
10+
|-------|---------|---------|
11+
| `deployment` | Verify components are installed and healthy | GPU operator pods running, Helm values match recipe |
12+
| `performance` | Verify system meets performance thresholds | NCCL bandwidth, GPU utilization |
13+
| `conformance` | Verify workload-specific requirements | DRA support, gang scheduling, autoscaling |
14+
15+
**Architecture:**
16+
17+
- **Declarative Catalog**: Validators are defined in `recipes/validators/catalog.yaml`
18+
- **Container Contract**: Exit code 0 = pass, 1 = fail, 2 = skip
19+
- **Evidence via stdout**: Check output printed to stdout is captured as CTRF evidence
20+
- **Debug via stderr**: Structured logs go to stderr and are streamed to the user
21+
- **CTRF Reports**: Results are aggregated into [Common Test Report Format](https://ctrf.io/) JSON
22+
23+
## Quick Start
24+
25+
Adding a new check to an existing validator container requires three steps.
26+
27+
### Step 1: Implement the Check Function
28+
29+
Create a new file in the appropriate phase directory (e.g., `validators/deployment/`):
30+
31+
```go
32+
package main
33+
34+
import (
35+
"fmt"
36+
"log/slog"
37+
38+
"github.com/NVIDIA/aicr/pkg/errors"
39+
"github.com/NVIDIA/aicr/validators"
40+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
41+
)
42+
43+
func checkMyComponent(ctx *validators.Context) error {
44+
slog.Info("checking my-component health")
45+
46+
pods, err := ctx.Clientset.CoreV1().Pods("my-namespace").List(
47+
ctx.Ctx,
48+
metav1.ListOptions{LabelSelector: "app=my-component"},
49+
)
50+
if err != nil {
51+
return errors.Wrap(errors.ErrCodeInternal, "failed to list pods", err)
52+
}
53+
54+
if len(pods.Items) == 0 {
55+
return errors.New(errors.ErrCodeNotFound, "no my-component pods found")
56+
}
57+
58+
// Evidence to stdout (captured in CTRF report)
59+
fmt.Printf("Found %d my-component pod(s)\n", len(pods.Items))
60+
for _, pod := range pods.Items {
61+
fmt.Printf(" %s: %s\n", pod.Name, pod.Status.Phase)
62+
}
63+
64+
return nil
65+
}
66+
```
67+
68+
### Step 2: Register in `main.go`
69+
70+
Add the check function to the dispatch map in `validators/deployment/main.go`:
71+
72+
```go
73+
func main() {
74+
validators.Run(map[string]validators.CheckFunc{
75+
"operator-health": checkOperatorHealth,
76+
"expected-resources": checkExpectedResources,
77+
// Add your check here:
78+
"my-component": checkMyComponent,
79+
})
80+
}
81+
```
82+
83+
### Step 3: Add Catalog Entry
84+
85+
Add an entry to `recipes/validators/catalog.yaml`:
86+
87+
```yaml
88+
validators:
89+
# ... existing entries ...
90+
91+
- name: my-component
92+
phase: deployment
93+
description: "Verify my-component pods are running and healthy"
94+
image: ghcr.io/nvidia/aicr-validators/deployment:latest
95+
timeout: 2m
96+
args: ["my-component"]
97+
env: []
98+
```
99+
100+
The `args` field must match the key used in the `validators.Run()` dispatch map.
101+
102+
## Container Contract
103+
104+
Every validator container must follow this contract:
105+
106+
### Exit Codes
107+
108+
| Code | Meaning | CTRF Status |
109+
|------|---------|-------------|
110+
| `0` | Check passed | `passed` |
111+
| `1` | Check failed | `failed` |
112+
| `2` | Check skipped (not applicable) | `skipped` |
113+
114+
### I/O Channels
115+
116+
| Channel | Purpose | Captured By |
117+
|---------|---------|-------------|
118+
| **stdout** | Evidence output (human-readable check results) | CTRF report `message` field |
119+
| **stderr** | Debug/progress logs (`slog` output) | Streamed live to user terminal |
120+
| `/dev/termination-log` | Failure reason (max 4096 bytes) | CTRF report on failure |
121+
122+
### Mounted Data
123+
124+
The validator engine mounts snapshot and recipe data as ConfigMaps:
125+
126+
| Path | Content | Environment Override |
127+
|------|---------|---------------------|
128+
| `/data/snapshot/snapshot.yaml` | Cluster snapshot | `AICR_SNAPSHOT_PATH` |
129+
| `/data/recipe/recipe.yaml` | Recipe with constraints | `AICR_RECIPE_PATH` |
130+
131+
### Environment Variables
132+
133+
| Variable | Description |
134+
|----------|-------------|
135+
| `AICR_NAMESPACE` | Validation namespace (fallback if ServiceAccount namespace unavailable) |
136+
| `AICR_SNAPSHOT_PATH` | Override snapshot mount path |
137+
| `AICR_RECIPE_PATH` | Override recipe mount path |
138+
| `AICR_VALIDATOR_IMAGE_REGISTRY` | Override image registry prefix (set by user) |
139+
140+
## Context API
141+
142+
The `validators.Context` struct provides all dependencies a check needs:
143+
144+
```go
145+
type Context struct {
146+
Ctx context.Context // Parent context with timeout
147+
Cancel context.CancelFunc // Release resources (caller must defer)
148+
Clientset kubernetes.Interface // Typed K8s client
149+
RESTConfig *rest.Config // For exec, port-forward, dynamic client
150+
DynamicClient dynamic.Interface // For CRD access
151+
Snapshot *snapshotter.Snapshot // Captured cluster state
152+
Recipe *recipe.RecipeResult // Recipe with validation config
153+
Namespace string // Validation namespace
154+
}
155+
```
156+
157+
`LoadContext()` builds this from the container environment: reads mounted ConfigMaps, creates in-cluster K8s clients, and sets a timeout from `defaults.CheckExecutionTimeout`.
158+
159+
### Helper Methods
160+
161+
**`ctx.Timeout(d)`** — Create a child context with a specific timeout:
162+
163+
```go
164+
subCtx, cancel := ctx.Timeout(30 * time.Second)
165+
defer cancel()
166+
pods, err := ctx.Clientset.CoreV1().Pods(ns).List(subCtx, opts)
167+
```
168+
169+
### Runner Utilities
170+
171+
**`validators.Run(checks)`** — Main entry point for validator containers. Handles context loading, check dispatch by `os.Args[1]`, exit codes, and termination log writing.
172+
173+
**`validators.Skip(reason)`** — Return from a `CheckFunc` to indicate the check is not applicable. The runner exits with code 2:
174+
175+
```go
176+
func checkFeatureX(ctx *validators.Context) error {
177+
if ctx.Recipe.Validation == nil {
178+
return validators.Skip("no validation section in recipe")
179+
}
180+
// ... actual check logic ...
181+
return nil
182+
}
183+
```
184+
185+
## Catalog Entry Schema
186+
187+
Each entry in `recipes/validators/catalog.yaml`:
188+
189+
```yaml
190+
- name: operator-health # Unique identifier, used in Job names
191+
phase: deployment # deployment | performance | conformance
192+
description: "Human-readable" # Shown in CTRF report
193+
image: ghcr.io/.../img:latest # OCI image reference
194+
timeout: 2m # Job activeDeadlineSeconds
195+
args: ["operator-health"] # Container args (check name)
196+
env: # Optional environment variables
197+
- name: MY_VAR
198+
value: "my-value"
199+
resources: # Optional resource requests (omit for defaults)
200+
cpu: "100m"
201+
memory: "128Mi"
202+
```
203+
204+
**Image tag resolution** (applied by `catalog.Load`):
205+
206+
1. `:latest` tags are replaced with the CLI version (e.g., `:v0.9.5`) for release builds
207+
2. Explicit version tags (e.g., `:v1.2.3`) are never modified
208+
3. `AICR_VALIDATOR_IMAGE_REGISTRY` overrides the registry prefix
209+
210+
## Code Walkthrough
211+
212+
The `operator_health.go` check demonstrates the standard pattern:
213+
214+
```go
215+
// validators/deployment/operator_health.go
216+
217+
func checkOperatorHealth(ctx *validators.Context) error {
218+
// 1. Use slog for debug output (goes to stderr, streamed to user)
219+
slog.Info("listing pods", "namespace", gpuOperatorNamespace)
220+
221+
// 2. Use ctx.Clientset for K8s API calls
222+
pods, err := ctx.Clientset.CoreV1().Pods(gpuOperatorNamespace).List(
223+
ctx.Ctx,
224+
metav1.ListOptions{LabelSelector: gpuOperatorLabel},
225+
)
226+
if err != nil {
227+
// 3. Return wrapped errors for failures
228+
return errors.Wrap(errors.ErrCodeInternal, "failed to list pods", err)
229+
}
230+
231+
// 4. Print evidence to stdout (captured in CTRF report)
232+
fmt.Printf("Found %d gpu-operator pod(s):\n", len(pods.Items))
233+
for _, pod := range pods.Items {
234+
fmt.Printf(" %s: %s\n", pod.Name, pod.Status.Phase)
235+
}
236+
237+
// 5. Return nil for pass, non-nil error for fail
238+
if runningCount == 0 {
239+
return errors.New(errors.ErrCodeInternal, "no pods in Running state")
240+
}
241+
return nil
242+
}
243+
```
244+
245+
**Key patterns:**
246+
247+
- `slog.*` → stderr → streamed live to user
248+
- `fmt.Printf` → stdout → captured as CTRF evidence
249+
- `return nil` → exit 0 → passed
250+
- `return errors.*` → exit 1 → failed (message written to termination log)
251+
- `return validators.Skip(reason)` → exit 2 → skipped
252+
253+
## Directory Layout
254+
255+
```
256+
validators/
257+
├── context.go # Shared Context type and LoadContext()
258+
├── runner.go # Run() entry point, exit code handling
259+
├── deployment/ # Deployment phase validators
260+
│ ├── main.go # Check dispatch map
261+
│ ├── Dockerfile # Container image build
262+
│ ├── operator_health.go # Individual check implementation
263+
│ ├── expected_resources.go
264+
│ └── ...
265+
├── performance/ # Performance phase validators
266+
│ ├── main.go
267+
│ ├── Dockerfile
268+
│ └── ...
269+
├── conformance/ # Conformance phase validators
270+
│ ├── main.go
271+
│ ├── Dockerfile
272+
│ └── ...
273+
└── chainsaw/ # Chainsaw test runner utilities
274+
└── ...
275+
```
276+
277+
Each phase directory produces one container image. Multiple checks are compiled into a single binary and selected via the first argument.
278+
279+
## Testing
280+
281+
### Unit Tests
282+
283+
Use fake K8s clients for isolated testing:
284+
285+
```go
286+
func TestCheckMyComponent(t *testing.T) {
287+
tests := []struct {
288+
name string
289+
pods []corev1.Pod
290+
wantErr bool
291+
}{
292+
{
293+
name: "healthy pods",
294+
pods: []corev1.Pod{
295+
{
296+
ObjectMeta: metav1.ObjectMeta{
297+
Name: "my-pod",
298+
Labels: map[string]string{"app": "my-component"},
299+
},
300+
Status: corev1.PodStatus{Phase: corev1.PodRunning},
301+
},
302+
},
303+
wantErr: false,
304+
},
305+
{
306+
name: "no pods found",
307+
pods: []corev1.Pod{},
308+
wantErr: true,
309+
},
310+
}
311+
312+
for _, tt := range tests {
313+
t.Run(tt.name, func(t *testing.T) {
314+
objects := make([]runtime.Object, len(tt.pods))
315+
for i := range tt.pods {
316+
objects[i] = &tt.pods[i]
317+
}
318+
ctx := &validators.Context{
319+
Ctx: context.TODO(),
320+
Clientset: fake.NewClientset(objects...),
321+
Namespace: "test",
322+
}
323+
err := checkMyComponent(ctx)
324+
if (err != nil) != tt.wantErr {
325+
t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
326+
}
327+
})
328+
}
329+
}
330+
```
331+
332+
### Local Testing with Docker
333+
334+
Build and run a validator locally against mounted data:
335+
336+
```shell
337+
# Build the validator image
338+
docker build -f validators/deployment/Dockerfile -t my-validator .
339+
340+
# Run with mounted snapshot and recipe
341+
docker run --rm \
342+
-v ./snapshot.yaml:/data/snapshot/snapshot.yaml \
343+
-v ./recipe.yaml:/data/recipe/recipe.yaml \
344+
my-validator my-component
345+
346+
# Check exit code
347+
echo $? # 0=pass, 1=fail, 2=skip
348+
```
349+
350+
Note: K8s API calls will fail locally unless you mount a kubeconfig. For checks that only read snapshot/recipe data, this works without cluster access.
351+
352+
## Checklist
353+
354+
When adding a new upstream check:
355+
356+
1. Create `validators/{phase}/my_check.go` implementing `CheckFunc`
357+
2. Register in `validators/{phase}/main.go` dispatch map
358+
3. Add catalog entry in `recipes/validators/catalog.yaml`
359+
4. Add the check name to the recipe's `validation.{phase}.checks[]` (or omit to run all)
360+
5. Write table-driven unit tests with fake K8s clients
361+
6. Test locally with `docker run` and mounted data
362+
7. Run `make test` with race detector
363+
364+
## See Also
365+
366+
- [Validator Extension Guide](../integrator/validator-extension.md) — External validators via `--data`
367+
- [Validator Catalog Reference](../../recipes/validators/README.md) — Catalog schema and entries
368+
- [Validator V2 ADR](../design/002-validatorv2-adr.md) — Architecture decision record
369+
- [CLI Reference](../user/cli-reference.md#aicr-validate) — Validate command flags

docs/integrator/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ This section is for integrators who:
1919
| [Kubernetes Deployment](kubernetes-deployment.md) | Self-hosted API server deployment with Kubernetes manifests |
2020
| [EKS Dynamo Networking](eks-dynamo-networking.md) | Security group prerequisites for Dynamo overlays on EKS |
2121
| [Recipe Development](recipe-development.md) | Creating and modifying recipe metadata for custom environments |
22+
| [Validator Extension](validator-extension.md) | Adding custom validators and overriding embedded ones via `--data` |
2223

2324
## Quick Start
2425

0 commit comments

Comments
 (0)