Skip to content

Commit bf7d9f2

Browse files
committed
docs: remove --phase readiness, replace ValidationResult with CTRF output
1 parent 8312960 commit bf7d9f2

File tree

6 files changed

+234
-254
lines changed

6 files changed

+234
-254
lines changed

docs/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ NVIDIA AI Cluster Runtime (AICR) is a suite of tooling designed to automate the
1616
| **Component** | A deployable software package (e.g., GPU Operator, Network Operator, cert-manager). Components have versions, Helm sources, and configuration values. |
1717
| **ComponentRef** | A reference to a component in a recipe, including version, source repository, values file, and dependency references. |
1818
| **Constraint** | A validation rule in a recipe specifying required system conditions (e.g., `K8s.server.version >= 1.31`, `OS.release.ID == ubuntu`). Constraints can have severity (error/warning), remediation guidance, and units. |
19-
| **Validation Phase** | A stage of validation in the deployment lifecycle: readiness (infrastructure), deployment (components), performance (system), conformance (workloads). |
19+
| **Validation Phase** | A stage of validation in the deployment lifecycle: deployment (components), performance (system), conformance (workloads). Readiness constraints are evaluated implicitly before any phase. |
2020
| **ValidationConfig** | Configuration in a recipe defining phase-specific checks, constraints, expected resources, and node selection for validation. |
2121
| **Measurement** | A captured data point from the system organized by type (K8s, OS, GPU, SystemD), subtype, and key-value readings. |
2222
| **Specificity** | A score indicating how specific a recipe's criteria is (number of non-"any" fields). More specific recipes are applied later during merge. |
@@ -179,7 +179,7 @@ aicr recipe --snapshot snapshot.yaml --intent training --platform kubeflow
179179
### Validate Configuration
180180

181181
```shell
182-
# Validate readiness phase (default)
182+
# Validate recipe against snapshot (readiness constraints run implicitly)
183183
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
184184

185185
# Validate all phases

docs/integrator/data-flow.md

Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -401,30 +401,39 @@ aicr validate \
401401

402402
### Validation Output
403403

404-
```yaml
405-
apiVersion: aicr.nvidia.com/v1alpha1
406-
kind: ValidationResult
407-
metadata:
408-
created: "2025-01-15T10:30:00Z"
409-
summary:
410-
total: 5
411-
passed: 4
412-
failed: 1
413-
skipped: 0
414-
results:
415-
- constraint: "K8s.server.version>=1.28"
416-
status: passed
417-
expected: ">=1.28"
418-
actual: "1.33.5"
419-
- constraint: "OS.release.ID==ubuntu"
420-
status: passed
421-
expected: "ubuntu"
422-
actual: "ubuntu"
423-
- constraint: "GPU.driver.version>=570.00"
424-
status: failed
425-
expected: ">=570.00"
426-
actual: "560.28.03"
427-
message: "version 560.28.03 does not satisfy >=570.00"
404+
Results are output in [CTRF](https://ctrf.io/) (Common Test Report Format) JSON:
405+
406+
```json
407+
{
408+
"reportFormat": "CTRF",
409+
"specVersion": "0.0.1",
410+
"timestamp": "2026-03-10T20:10:44Z",
411+
"generatedBy": "aicr",
412+
"results": {
413+
"tool": { "name": "aicr", "version": "v0.10.3-next" },
414+
"summary": {
415+
"tests": 16, "passed": 13, "failed": 0, "skipped": 3,
416+
"pending": 0, "other": 0,
417+
"start": 1773173400872, "stop": 1773173799002
418+
},
419+
"tests": [
420+
{
421+
"name": "operator-health",
422+
"status": "passed",
423+
"duration": 0,
424+
"suite": ["deployment"],
425+
"stdout": ["Found 1 gpu-operator pod(s)", "Running: 1/1"]
426+
},
427+
{
428+
"name": "nccl-all-reduce-bw",
429+
"status": "passed",
430+
"duration": 234000,
431+
"suite": ["performance"],
432+
"stdout": ["NCCL All Reduce bandwidth: 488.37 GB/s", "Constraint: >= 100 → true"]
433+
}
434+
]
435+
}
436+
}
428437
```
429438

430439
### CI/CD Integration

docs/integrator/recipe-development.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,7 @@ validation:
275275
checks: [nccl-bandwidth-test]
276276
```
277277

278-
**Phases:** `readiness`, `deployment`, `performance`, `conformance`
278+
**Phases:** `deployment`, `performance`, `conformance` (readiness constraints are evaluated implicitly)
279279

280280
### Testing
281281

docs/user/cli-reference.md

Lines changed: 84 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -432,7 +432,7 @@ aicr validate [flags]
432432
|------|-------|------|-------------|
433433
| `--recipe` | `-r` | string | Path/URI to recipe file containing constraints (required) |
434434
| `--snapshot` | `-s` | string | Path/URI to snapshot file containing measurements (required) |
435-
| `--phase` | | string | Validation phase to run: readiness (default), deployment, performance, conformance, all |
435+
| `--phase` | | string | Validation phase to run: deployment, performance, conformance, all (default: all) |
436436
| `--fail-on-error` | | bool | Exit with non-zero status if any constraint fails (default: true) |
437437
| `--output` | `-o` | string | Output destination (file or stdout, default: stdout) |
438438
| `--format` | `-t` | string | Output format: json, yaml, table (default: yaml) |
@@ -449,12 +449,13 @@ Validation can be run in different phases to validate different aspects of the d
449449

450450
| Phase | Description | When to Run |
451451
|-------|-------------|-------------|
452-
| `readiness` | Evaluates constraints inline against snapshot (K8s version, OS, kernel) — no checks or Jobs | Before deploying any components |
453452
| `deployment` | Validates component deployment health and expected resources | After deploying components |
454453
| `performance` | Validates system performance and network fabric health | After components are running |
455454
| `conformance` | Validates workload-specific requirements and conformance | Before running production workloads |
456455
| `all` | Runs all phases sequentially with dependency logic | Complete end-to-end validation |
457456

457+
> **Note:** Readiness constraints (K8s version, OS, kernel) are always evaluated implicitly before any phase runs. If readiness fails, validation stops before deploying any Jobs.
458+
458459
**Phase Dependencies:**
459460
- Phases run sequentially when using `--phase all`
460461
- If a phase fails, subsequent phases are skipped
@@ -486,7 +487,7 @@ Constraints use fully qualified measurement paths: `{Type}.{Subtype}.{Key}`
486487
**Examples:**
487488

488489
```shell
489-
# Validate snapshot against recipe (default: readiness phase)
490+
# Validate snapshot against recipe (readiness constraints run implicitly)
490491
aicr validate --recipe recipe.yaml --snapshot snapshot.yaml
491492

492493
# Validate specific phase
@@ -510,14 +511,7 @@ aicr validate \
510511
aicr validate \
511512
--recipe recipe.yaml \
512513
--snapshot cm://gpu-operator/aicr-snapshot \
513-
--output validation-results.yaml
514-
515-
# Validate readiness phase before installing components
516-
aicr validate \
517-
--recipe recipe.yaml \
518-
--snapshot snapshot.yaml \
519-
--phase readiness \
520-
--fail-on-error
514+
--output validation-results.json
521515

522516
# Validate deployment phase after components are installed
523517
aicr validate \
@@ -544,101 +538,88 @@ aicr validate \
544538
--kubeconfig ~/.kube/prod-cluster
545539
```
546540

547-
**Output Structure (Readiness Phase):**
548-
```yaml
549-
apiVersion: aicr.nvidia.com/v1alpha1
550-
kind: ValidationResult
551-
metadata:
552-
timestamp: "2025-12-31T10:30:00Z"
553-
version: v0.14.0
554-
recipeSource: recipe.yaml
555-
snapshotSource: cm://gpu-operator/aicr-snapshot
556-
summary:
557-
passed: 5
558-
failed: 0
559-
skipped: 0
560-
total: 5
561-
status: pass
562-
duration: 20.5µs
563-
phases:
564-
readiness:
565-
status: pass
566-
constraints:
567-
- name: K8s.server.version
568-
expected: '>= 1.30'
569-
actual: v1.30.14-eks-3025e55
570-
status: passed
571-
- name: OS.release.ID
572-
expected: ubuntu
573-
actual: ubuntu
574-
status: passed
575-
duration: 20.5µs
576-
```
577-
578-
**Output Structure (All Phases):**
579-
```yaml
580-
apiVersion: aicr.nvidia.com/v1alpha1
581-
kind: ValidationResult
582-
metadata:
583-
timestamp: "2025-12-31T10:30:00Z"
584-
version: v0.14.0
585-
recipeSource: recipe.yaml
586-
snapshotSource: snapshot.yaml
587-
summary:
588-
passed: 3
589-
failed: 0
590-
skipped: 1
591-
total: 4
592-
status: pass
593-
duration: 58.4µs
594-
phases:
595-
readiness:
596-
status: pass
597-
constraints:
598-
- name: K8s.server.version
599-
expected: '>= 1.32.4'
600-
actual: v1.35.0
601-
status: passed
602-
- name: OS.release.ID
603-
expected: ubuntu
604-
actual: ubuntu
605-
status: passed
606-
duration: 20.7µs
607-
deployment:
608-
status: pass
609-
checks:
610-
- name: gpu-operator.version
611-
status: pass
612-
- name: expected-resources
613-
status: pass
614-
duration: 1.2µs
615-
performance:
616-
status: pass
617-
checks:
618-
- name: nccl-bandwidth-test
619-
status: pass
620-
- name: fabric-health-check
621-
status: pass
622-
duration: 1.2µs
623-
conformance:
624-
status: skipped
625-
reason: conformance phase not configured in recipe
626-
duration: 0.8µs
627-
```
628-
629-
**Validation Statuses:**
541+
**Output Structure ([CTRF](https://ctrf.io/) JSON):**
542+
543+
Results are output in CTRF (Common Test Report Format) — an industry-standard schema for test reporting.
544+
545+
```json
546+
{
547+
"reportFormat": "CTRF",
548+
"specVersion": "0.0.1",
549+
"timestamp": "2026-03-10T20:10:44Z",
550+
"generatedBy": "aicr",
551+
"results": {
552+
"tool": {
553+
"name": "aicr",
554+
"version": "v0.10.3-next"
555+
},
556+
"summary": {
557+
"tests": 16,
558+
"passed": 13,
559+
"failed": 0,
560+
"skipped": 3,
561+
"pending": 0,
562+
"other": 0,
563+
"start": 1773173400872,
564+
"stop": 1773173799002
565+
},
566+
"tests": [
567+
{
568+
"name": "operator-health",
569+
"status": "passed",
570+
"duration": 0,
571+
"suite": ["deployment"],
572+
"stdout": ["Found 1 gpu-operator pod(s)", "Running: 1/1"]
573+
},
574+
{
575+
"name": "expected-resources",
576+
"status": "passed",
577+
"duration": 0,
578+
"suite": ["deployment"],
579+
"stdout": ["All expected resources are healthy"]
580+
},
581+
{
582+
"name": "nccl-all-reduce-bw",
583+
"status": "passed",
584+
"duration": 234000,
585+
"suite": ["performance"],
586+
"stdout": ["NCCL All Reduce bandwidth: 488.37 GB/s", "Constraint: >= 100 → true"]
587+
},
588+
{
589+
"name": "dra-support",
590+
"status": "passed",
591+
"duration": 8000,
592+
"suite": ["conformance"],
593+
"stdout": ["DRA GPU allocation successful"]
594+
},
595+
{
596+
"name": "cluster-autoscaling",
597+
"status": "skipped",
598+
"duration": 0,
599+
"suite": ["conformance"],
600+
"stdout": ["SKIP reason=\"Karpenter not found\""]
601+
}
602+
]
603+
}
604+
}
605+
```
606+
607+
> **Note:** The `tests` array above is truncated for brevity. A full validation run produces one entry per check across all phases. Each entry includes `stdout` with detailed diagnostic output.
608+
609+
**Test Statuses:**
630610
| Status | Description |
631611
|--------|-------------|
632-
| `passed` | Constraint satisfied |
633-
| `failed` | Constraint not satisfied |
634-
| `skipped` | Constraint could not be evaluated (missing data, invalid path) |
612+
| `passed` | Check or constraint passed |
613+
| `failed` | Check or constraint failed |
614+
| `skipped` | Check could not be evaluated (missing data, no-cluster mode) |
615+
| `other` | Unexpected outcome (crash, OOM, timeout) |
635616

636-
**Summary Status:**
637-
| Status | Description |
638-
|--------|-------------|
639-
| `pass` | All constraints passed |
640-
| `fail` | One or more constraints failed |
641-
| `partial` | Some constraints skipped, none failed |
617+
**Exit Codes:**
618+
| Code | Description |
619+
|------|-------------|
620+
| `0` | All checks passed |
621+
| `1` | One or more checks failed (when `--fail-on-error` is set) |
622+
| `2` | Invalid input (bad flags, missing recipe/snapshot) |
642623

643624
---
644625

site/docs/integrator/data-flow.md

Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -405,30 +405,39 @@ aicr validate \
405405

406406
### Validation Output
407407

408-
```yaml
409-
apiVersion: aicr.nvidia.com/v1alpha1
410-
kind: ValidationResult
411-
metadata:
412-
created: "2025-01-15T10:30:00Z"
413-
summary:
414-
total: 5
415-
passed: 4
416-
failed: 1
417-
skipped: 0
418-
results:
419-
- constraint: "K8s.server.version>=1.28"
420-
status: passed
421-
expected: ">=1.28"
422-
actual: "1.33.5"
423-
- constraint: "OS.release.ID==ubuntu"
424-
status: passed
425-
expected: "ubuntu"
426-
actual: "ubuntu"
427-
- constraint: "GPU.driver.version>=570.00"
428-
status: failed
429-
expected: ">=570.00"
430-
actual: "560.28.03"
431-
message: "version 560.28.03 does not satisfy >=570.00"
408+
Results are output in [CTRF](https://ctrf.io/) (Common Test Report Format) JSON:
409+
410+
```json
411+
{
412+
"reportFormat": "CTRF",
413+
"specVersion": "0.0.1",
414+
"timestamp": "2026-03-10T20:10:44Z",
415+
"generatedBy": "aicr",
416+
"results": {
417+
"tool": { "name": "aicr", "version": "v0.10.3-next" },
418+
"summary": {
419+
"tests": 16, "passed": 13, "failed": 0, "skipped": 3,
420+
"pending": 0, "other": 0,
421+
"start": 1773173400872, "stop": 1773173799002
422+
},
423+
"tests": [
424+
{
425+
"name": "operator-health",
426+
"status": "passed",
427+
"duration": 0,
428+
"suite": ["deployment"],
429+
"stdout": ["Found 1 gpu-operator pod(s)", "Running: 1/1"]
430+
},
431+
{
432+
"name": "nccl-all-reduce-bw",
433+
"status": "passed",
434+
"duration": 234000,
435+
"suite": ["performance"],
436+
"stdout": ["NCCL All Reduce bandwidth: 488.37 GB/s", "Constraint: >= 100 → true"]
437+
}
438+
]
439+
}
440+
}
432441
```
433442

434443
### CI/CD Integration

0 commit comments

Comments
 (0)