feat(validation): container-per-validator execution engine by lalitadithya · Pull Request #290 · NVIDIA/aicr

lalitadithya · 2026-03-05T15:04:04Z

Closes #291
Closes #303
Closes #295
Closes #140

Summary

Replace the Go testing.T-based validation pipeline with a container-per-validator execution engine. Each validator is a standalone OCI container run as a Kubernetes Job.

Validator Contract

Exit codes: 0=pass, 1=fail, 2=skip (not applicable)
Evidence via stdout (captured in CTRF report)
Debug logs via stderr
Error context via /dev/termination-log

Key Design Decisions

Decision	Rationale
Exit code protocol	No custom IPC, no log parsing
One Job per validator	Fault isolation; partial results always available
Scoped ClusterRole	Purpose-built RBAC with minimum required permissions
Watch API for completion	Reliable event delivery, no informer complexity
CTRF reporting	Industry-standard format with existing tooling
Catalog in `recipes/`	Same embed pattern as registry and overlays
Self-selecting validators	Exit 2 (skip) when infrastructure unavailable

What Changed

Area	Changes
`pkg/validator/`	Orchestrator, catalog loader, CTRF builder, Job deployer, RBAC, result extraction
`pkg/validator/catalog/`	Catalog types, phase filtering, image registry override
`pkg/validator/ctrf/`	CTRF types, builder, ConfigMap writer/reader
`pkg/validator/job/`	SSA-based Job deployer, scoped ClusterRole, Watch-based completion
`validators/deployment/`	operator-health, expected-resources, helm-values, gpu-operator-version, check-nvidia-smi
`validators/conformance/`	DRA, gang-scheduling, autoscaling, metrics, gateway, security checks
`validators/performance/`	NCCL bandwidth, trainer lifecycle
`validators/helper/`	Shared pod, GPU, resource utilities
`validators/chainsaw/`	Chainsaw assertion runner
`pkg/cli/validate.go`	Unified `--namespace` flag, snapshot agent + validator deployment
`pkg/snapshotter/agent.go`	Concurrent log streaming (eliminates 2-min delay)
`pkg/constraints/`	Extracted constraint evaluation from validator package
`recipes/validators/catalog.yaml`	Declarative validator catalog (embedded via `recipes.FS`)
`docs/design/002-validatorv2-adr.md`	Architecture decision record
`.github/workflows/on-tag.yaml`	Multi-arch validator image build, manifest, attestation
`.github/actions/aicr-build/`	GPU test validator image build with testdata
`tests/e2e/run.sh`	Optimized E2E tests (~11 min, down from 30 min timeout)

Why

The v1 validator used go test -c compiled binaries inside K8s Jobs, parsing results from go test -json output with custom string markers. This was:

Fragile — log corruption breaks result parsing
Tightly coupled — adding a check requires Go code (binary) + image rebuild
Monolithic — all checks share one pod; OOM kills everything
Non-standard — custom ValidationResult type with no tooling interop

Validation First Principles

Validators must:

Be independent of each other (own failure domain)
Be executable independent of AICR repo or binary
Use Kubernetes-native execution (OCI container as unit of delivery)
Have consistent, machine-readable inputs/outputs
Be implementable in any language/framework
Support versioning independent of binary version
Be incrementally adoptable (new validators added without Go code changes)

Test Plan

make qualify passes locally
Unit tests with -race across all packages
E2E tests pass in CI (~11 min)
CLI E2E (chainsaw) tests pass
GPU smoke test passes
On-tag release pipeline builds multi-arch validator images
Validator images published to GHCR (public)
Local cluster validation tested (Kind + real EKS)
GPU conformance/training/inference tests (pending testdata fix in released images)

validators/operator-health/Dockerfile

mchmarny · 2026-03-05T18:19:14Z

I do like this approach:

Self-contained/Modular - validation logic into separate container images, each component is responsible for its own tests. This decouples the validation lifecycle from the core CLI, allowing developers to test individual "validators" in isolation without needing to understand the entire codebase
Language and Tooling Agnostic - Since the validator's logic is "locked" inside a container, the system is no longer strictly tied to Go tests. People can use the best tool for the job (e.g., shell scripts, Python, or existing conformance suites) as long as the container follows the contract of providing "evidence" via stdout and status via exit codes
Scalability/Clear Reporting - Using k8s jobs to run validations allows for parallel execution and standardized status reporting. This "generic grammar" makes it easy for the main runner to determine whether a result is a failure or a warning, and to version the API inputs/outputs as the requirements evolve over time

@xdu31 @iamkhaledh I also think this allows for easy migration to the new model without any changes to the existing Go-based tests. Thoughts?

yuanchen8911 · 2026-03-05T20:53:07Z

Review findings (ordered by severity):

High: --phase is ignored for validator v2, so v2 always runs all phases.
- pkg/cli/validate.go:689-692, pkg/cli/validate.go:470
- parseValidationPhases() is computed, but v2 path always calls ValidateAll(...). This breaks documented behavior and can run much more work than requested (CLI default phase is readiness).
High: v2 multi-phase output overwrites the same output file per phase.
- pkg/cli/validate.go:477-497, pkg/serializer/writer.go:113
- Inside the phase loop, a new file writer is created each time; NewFileWriterOrStdout uses os.Create, so only the last phase report remains on disk.
Medium: cleanup=false is not fully honored in v2 for ConfigMaps.
- pkg/validatorv2/options.go:45-46, pkg/validatorv2/validator.go:99-104, pkg/validatorv2/validator.go:178-183
- Option/docs say cleanup controls Jobs, ConfigMaps, and RBAC, but ConfigMaps are always deferred for cleanup regardless of v.Cleanup.
Medium: ClusterRoleBinding is global/static and not namespace-safe across runs.
- pkg/validatorv2/job/rbac.go:36, pkg/validatorv2/job/rbac.go:122, pkg/validatorv2/job/rbac.go:134
- ClusterRoleBindingName is fixed, and AlreadyExists is ignored. If a run later uses a different namespace, subject namespace may remain stale, causing permission failures for that namespace's ServiceAccount.

Open question:

Is v2 intentionally “always all phases”? If yes, CLI/docs should explicitly say --phase is v1-only (currently it appears generic).

yuanchen8911

take a look at the comments.

xdu31 · 2026-03-06T08:12:50Z

@lalitadithya @mchmarny @iamkhaledh

Feedback on V2 Architecture

Thanks for the thorough ADR and implementation. The problems identified (monolithic failure domain, tight coupling, fragile IPC) are real. I want to share some observations from the V1 isolation work (#299) that addresses several of the same problems within the existing architecture, and raise some architectural concerns about V2's approach.

V1 Isolation Already Solves the Core Problem

#299 adds a three-tier execution model to V1:

validation:
  deployment:
    checks:
      - expected-resources                     # Tier 1: shared Job (batched)
      - name: expected-resources               # Tier 2: isolated Job (own pod)
        isolated: true
        timeout: 3m
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"
    validators:
      - name: cluster-dns-check               # Tier 3: external BYO container
        image: registry.example.com/cluster-dns-check:v1
        timeout: 2m

Tier 1 (Shared): Non-isolated checks batched into one Job with a combined -run pattern — fast, fewer Jobs.
Tier 2 (Isolated): Each check/constraint gets its own Job — fault isolation without changing the image model.
Tier 3 (External): BYO OCI containers with exit-code protocol (0=pass, non-zero=fail) — same contract as V2.

The isolated flag cascades: individual check > phase > top-level > default (false). Recipe authors control granularity per check — some shared for speed, some isolated for safety.

This has been tested end-to-end on a Kind cluster with all three tiers running simultaneously. Demo at demos/isolation/.

Catalog is Disconnected from Recipe

V2's catalog is //go:embed'd into the binary and runs all validators for a phase unconditionally:

validators := catalog.ForPhase(string(phase))
for _, entry := range validators {
    deployer.DeployJob(ctx)  // runs every catalog entry
}

The recipe's validation.deployment.checks and constraints fields have no effect in V2. The catalog owns the execution plan.

This breaks AICR's core principle: the recipe is the single source of truth for reproducible validation. The workflow is Snapshot -> Recipe -> Validate -> Bundle. The recipe must capture what to validate and what thresholds to enforce so that anyone with the same recipe + snapshot gets the same validation result.

With V2, two teams running the same recipe with different aicr binary versions get different results because the catalog changed between releases. The recipe doesn't record what actually ran — the catalog does, but it's invisible to the user. There's no mechanism for the recipe to select which validators to run, skip, or parameterize.

V1 keeps validation reproducible — everything needed to reproduce the result is declared in the recipe.

Constraint Protocol is Missing

V2 has no concept of constraints. A validator is a container that exits 0 or 1 — there's no way to pass name: X, value: Y from the recipe.

To implement Deployment.gpu-operator.version >= v24.6.0 in V2, each validator would need to independently parse the raw recipe YAML from the mounted ConfigMap, locate the constraint, implement version comparison, and look up values from the snapshot. Every validator reimplements this logic.

V1 provides all of this as shared infrastructure: the check registration framework includes constraint pattern matching (Deployment.*), version comparison operators (==, !=, >=, <=, >, <, ~=), and structured access to snapshot data.

"Any Language" Shifts Maintenance Burden to AICR

V2 frames "write validators in any language" as an advantage. In practice, this means AICR becomes responsible for N language toolchains in CI, N dependency trees to scan for CVEs, N test frameworks with different quality standards, N linters with different code review expertise requirements, and no shared scaffolding to enforce unit test coverage.

V1 enforces quality through its registration framework — you can't register a check without a corresponding test function. The scaffolding script generates the check, test, and registration together. make test with -race catches regressions across all checks automatically.

V2's catalog is just a pointer to an image. Nothing enforces that the image has tests, handles edge cases correctly, or follows the exit-code protocol. A broken validator is only discovered at runtime.

V1's external validators: field already provides the "any language" escape hatch for users who truly need it — but the maintenance burden stays with the user, not AICR.

"Declarative Catalog" is Moving Layers, Not Simplifying

The ADR positions the catalog as simpler than V1's Go registration. In practice, adding a validator in V2 requires: write code (the example operator-health is Go using pkg/k8s/client), write a Dockerfile, build and publish the image to a registry, add an entry to catalog.yaml, and rebuild the aicr binary (since the catalog is //go:embed). That's more steps than V1, not fewer.

The catalog is cosmetically declarative but functionally compiled into the same binary.

Per-Validator Images Are an Operational Burden

V1 compiles all checks into one validator image (aicr-validator). V2 requires a separate OCI image per validator.

With N validators, that means N images to build and publish in CI, N base images to patch for CVEs, N image tags to keep in sync with the embedded catalog, N image pulls per validation run, and N images to mirror in air-gapped environments. As checks grow (10 deployment + 5 performance + 3 conformance), the maintenance cost scales linearly.

The example operator-health validator is Go code with the same dependencies as V1 checks — just wrapped in a separate Dockerfile and CI pipeline. It's the same work plus operational overhead.

Duplicate Infrastructure

V2 reimplements Job deployment, RBAC management, watch-based completion waiting, pod log capture, and cleanup in pkg/validatorv2/job/ rather than reusing pkg/validator/agent/ and pkg/k8s/pod/. This creates two parallel codepaths for the same Kubernetes operations.

Other Gaps

No shared mode. 10 checks = 10 sequential Jobs with scheduling and image pull overhead each. V1 can batch them in one Job.
No resume. V1 has determineStartPhase() for resuming from a failed phase. V2 starts from scratch.
cluster-admin RBAC. V2 grants every validator pod full cluster admin. This violates the principle of least privilege — a validator that only needs to list Deployments in one namespace should not have permission to delete anything cluster-wide. V1 uses a scoped custom ClusterRole with minimum required permissions. Yes, scoped RBAC requires manual updates when a new check needs additional permissions, but that's a feature — it forces explicit review of what access each check actually needs.
No E2E tests. V2's test plan shows [ ] E2E with real validator container image (post-merge).
Hardcoded resource limits. 1 CPU, 1Gi for every validator, not configurable per check.

Suggestion

The V1 isolation branch addresses the fault isolation and BYO container problems within the existing architecture, without breaking recipe reproducibility or multiplying the image maintenance burden. The remaining V2 advantages (termination log, resource limits, three-layer timeouts) are incremental improvements that can be added to V1.

I'd suggest we consider whether V1 + isolation covers the requirements before introducing a parallel validation engine.

validators/conformance/Dockerfile

validators/deployment/Dockerfile

validators/performance/Dockerfile

mchmarny · 2026-03-06T13:13:29Z

NOTE: There appears to be another PR (#299) with significant overlaps with this one. I put a block on that PR as we need to resolve the architectural direction here first, before merging any code.

Now, let's reason about this first in terms of first principles:

Decoupling validation logic from the CLI - This may seem unnatural first, but this is the exact right approach IMO. Our current approach of keeping all checks compiled into the aicr binary, means that updates to a single check still requires a full CLI release. This approach allows us to decouple the validation lifecycle from the core tool. As a result, component teams can iterate on their validators independently.
Standardized, interoperable reporting - The ADR in this PR introduces CTRF (Common Test Report Format) — an industry-standard schema with existing tooling, which was one of the original motivations for rethinking the validation architecture. This does not prevent us form encoding ValidationResult type content into the result. It also adds language and tooling agnosticism. Mixing some tests in AICR repo in Go, and some in external image seems like an escape hatch, not a first-class model. The approach proposed in this PR uses a consistent model where every validator is a container with a simple contract (exit code + stdout evidence). This is a fundamental difference — it enable teams to use whatever tool is best for their job without touching the AICR codebase.

Now, more specifically WRT @xdu31 comments:

"Recipe reproducibility is broken because the catalog is embedded." — The current v1 model has the same property: checks are compiled into the binary, so different aicr versions already produce different validation results. The catalog makes this explicit and versionable rather than implicit (see 1 above). Recipe-level validator selection is an implementation detail that can be added to feat(validation): container-per-validator execution engine #290, not an architectural flaw.
"Per-validator images are an operational burden." — This is the standard cloud-native model. Every Kubernetes operator, every Helm chart test hook, every Tekton task follows this pattern. The alternative (monolithic binary with everything compiled in) is what we're trying to move away from. Component teams already build and maintain their own container images — asking them to also own their validation image is a natural extension, not a new burden on AICR. Also, there is no hard 1 check per image requirement.
"'Any language' shifts maintenance to AICR." — This inverts the actual ownership model (see 2 above). The whole point is that component teams own their validators end-to-end. AICR defines the contract (exit codes, stdout evidence); teams choose their implementation. AICR does not become responsible for N toolchains — the teams that own the components do, same as they already do for their operators and controllers.
"V2 reimplements Job deployment infrastructure." — This is expected for a new package that deliberately avoids coupling to the existing v1 internals. Shared utilities from pkg/k8s/pod can be factored out during review; it's not an argument against the architecture.
"cluster-admin RBAC violates least privilege." — Valid feedback, and addressable within feat(validation): container-per-validator execution engine #290's architecture. It doesn't change the architectural direction.
"V1 has determineStartPhase() for resuming from a failed phase. V2 starts from scratch." that logic is encoded in the binary, refactoring to the image based approach do does not prevent this. It's up to us to determine the atomic scope of all checks encoded into each image.

Recommendation:

I'd like to see us align on #290's architectural direction first, then incorporate the useful elements from PR #290 (cascading isolation flag, batched execution for speed) as follow-up optimizations.

Overall, it's a +1 form me on the approach proposed in this PR (#290) with some incremental improvements mentioned above.

pkg/cli/validate.go

pkg/evidence/scripts/manifests/dra-gpu-test.yaml

pkg/evidence/scripts/manifests/gang-scheduling-test.yaml

pkg/validator/job/deployer.go

tests/chainsaw/cli/cuj1-training/assert-validate-readiness.yaml

tests/uat/aws/tests/cuj1-training/assert-validate-readiness.yaml

tests/uat/aws/tests/cuj2-inference/assert-validate-readiness.yaml

parseThreshold() split on space and took the first token, which for a constraint value like ">= 100" yielded ">=" — not a valid float. Strip comparison operator characters (>=, <=, etc.) before extracting the numeric value. Handles "450", "450 GB/s", ">= 400", ">= 100 GB/s". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both launcher and worker pods in the NCCL TrainingRuntime were missing tolerations, causing FailedScheduling on clusters with node taints (e.g., dedicated=worker-workload:NoSchedule). Add operator: Exists toleration to both replicatedJobs so pods can schedule on any node regardless of taints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Agent deployer assumed the target namespace already existed. On a brand new cluster the ServiceAccount creation failed with "namespaces not found". Add ensureNamespace step (create-or-ignore, matching validator pattern).

mchmarny

LGTM
with the latest changes

mchmarny · 2026-03-09T12:31:25Z

Large number of comments on this PR so I'm goign to summarize:

Addressed:

--phase ignored (@yuanchen8911) — Fixed. CLI calls ValidatePhases(ctx, phases, ...) and filterEntriesByRecipe() scopes execution to recipe-declared checks per phase. Nil recipe or empty checks returns nil (skip).
Multi-phase output overwrites file (@yuanchen8911) — Fixed. Phase results are merged into a single CTRF report before writing.
cleanup=false not honored for ConfigMaps (@yuanchen8911) — Fixed. Both RBAC and ConfigMap cleanup are gated on v.Cleanup (validator.go:138, 153).
ClusterRoleBinding namespace-safety (@yuanchen8911) — Fixed. Uses SSA (Apply with Force: true) which updates the subject namespace on each run (rbac.go:138-163).
Catalog runs everything by default (@xdu31) — Fixed. filterEntriesByRecipe() returns nil when recipe is nil, validation is nil, or checks are empty. Only explicitly declared checks run.
Dockerfile non-root user (CodeQL DS-0002) — Fixed. All three Dockerfiles have USER nonroot.
nvidia-smi-verify-pod.yaml :latest tag (CodeQL KSV-0013) — Fixed. Image is now ${IMAGE} template variable, not hardcoded.
~5,400 lines of tests deleted (@xdu31) — Addressed. New tests across pkg/validator/ packages (catalog, ctrf, job, orchestrator) and coverage at 73.3% (above 70% threshold).
All resolved inline comments from @lalitadithya (SSA, generate name, informers, bring-back items, etc.) — All marked resolved.
Hardcoded resource limits (@xdu31) — The catalog supports per-validator resource overrides via the Resources field (catalog.go:77-85) with cpu and memory fields. The defaults exist but any catalog entry can override them.

Not addressed, and why:

cluster-admin RBAC (@xdu31) — The branch uses cluster-admin with a detailed rationale in comments (rbac.go:39-68). TLDR; the ServiceAccount is ephemeral (created/deleted per run), the caller must already have cluster-admin to create the binding, and validators need access to arbitrary CRDs unknown at compile time — making a scoped role impractical without constant maintenance.
External validator support removed (@xdu31) — Both ExternalValidator or BYO image mechanism are out of scope at this time. Still, architecturally, the new design does enable external validators via --data extension points
Catalog embedded in binary — no independent lifecycle (@xdu31) — Just like with components, this is an implementation choice for the initial version. External validators can be enabled (see above)
Constraint protocol for per-phase constraints (@xdu31) — No hard requirements in the new approach of phase/checks to container image. Could be 1:1, or N:1. Since validations, just like components are inherited, each recipe can override as needed. Just like with components, have to be validated.
No shared/batched execution mode (@xdu31) — ACK, out of scope for this PR but could be addressed in a follow-up optimization, should we need it.

Unless anyone has additional concerns, I recommend we merge this PR.

dims · 2026-03-09T13:14:00Z

@lalitadithya @mchmarny @xdu31 I like the tradeoffs we made in this PR. Having gone through all the comments/responses, i'd support landing this as it will help us and end users in the long term.

+1 to merge this PR.

yuanchen8911

/lgtm

xdu31 · 2026-03-09T16:35:00Z

1. Validator unit tests excluded from CI

Files: Makefile:167, .github/actions/go-test/action.yml

The Makefile explicitly excludes validators/ from make test:

go list ./... | grep -v -e /tests/chainsaw/ -e /validators

The validators/ directory is part of the root Go module (no separate go.mod), but the ~1,600 lines of tests added in commit 72574753 are never run in CI:

validators/chainsaw/runner_test.go (225 lines)
validators/conformance/helpers_test.go (416 lines)
validators/context_test.go (101 lines)
validators/deployment/gpu_operator_version_test.go (118 lines)
validators/deployment/helm_values_test.go (228 lines)
validators/helper/gpu_test.go (146 lines)
validators/performance/nccl_test.go (309 lines)
validators/runner_test.go (71 lines)

Impact: Coverage numbers don't reflect validator check logic. Regressions in check implementations won't be caught.

Fix: Add make test-validators target or include validators/ in make test.

2. Documentation dropped — no user/developer guidance

Files: docs/validator/ (entire directory missing)

The previous force push included 5 well-written docs (~990 lines total) that were dropped:

Document	Lines	Purpose
`docs/validator/contract.md`	183	Container interface spec: exit codes 0/1/2, `/dev/termination-log`, stdout=evidence, env vars, volumes, RBAC permissions, timeouts
`docs/validator/framework.md`	208	How the engine works: phases, catalog, recipe integration, Job lifecycle, CTRF output format
`docs/validator/custom-checks.md`	295	BYO checks guide: Python/Bash/Go examples, Dockerfile patterns, local testing, versioning
`docs/validator/upstream-checks.md`	290	Contributing checks: `CheckFunc` signature, `Context` object, registration, catalog entry, project layout
`docs/validator/README.md`	13	Index page linking all docs

What remains: Only docs/design/002-validatorv2-adr.md (architecture decision) and docs/contributor/validations.md (bundler validations — unrelated to V2 engine).

Impact: No one can build a custom check, contribute an upstream check, or understand the container contract without reading source code.

Fix: Re-add the 5 dropped docs. The content existed and was high quality.

Medium Concerns

3. cluster-admin binding persists with `--cleanup=false`

File: pkg/validator/job/rbac.go (commit a4ea9479)

The validator ServiceAccount is bound to the built-in cluster-admin ClusterRole. The justification is reasonable (validators need to inspect arbitrary CRDs, the caller already has cluster-admin, the binding is ephemeral).

Problem: When --cleanup=false is used for debugging, the aicr-validator ClusterRoleBinding to cluster-admin persists indefinitely. No warning is emitted, no TTL is set, and the security implication is undocumented.

Fix: Warn when --cleanup=false that a cluster-admin binding will persist. Document manual cleanup steps.

4. No local developer workflow for validator images

Files: Makefile, .github/actions/e2e/action.yml

make image-validators exists and builds all 3 images
CI E2E action has an elaborate build pipeline (host-compiled binaries + COPY-only Dockerfiles for speed)
But there's no documented developer workflow for "I changed a check, how do I test it?"
make e2e-tilt (local dev) doesn't build/load validator images automatically
A developer would need to manually: build images, push to local registry, set AICR_VALIDATOR_IMAGE_REGISTRY

Fix: Document the local dev workflow. Consider a make dev-env-validators target.

mchmarny · 2026-03-09T16:42:52Z

response to @xdu31 comments

1. Validator unit tests excluded from CI

these are actual validations that are not easy to test in unit test but are actually covered in E2E tests.

2. Documentation dropped — no user/developer guidance

ack, this is gap we will have to close in subsequent PRs

3. cluster-admin binding persists with --cleanup=false

Ack, we can add warning in subsequent PRs when --cleanup=false that a cluster-admin binding will persist. Document manual cleanup steps.

4. No local developer workflow for validator images

Same as 2

PR #290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR #214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com

PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com

PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com

PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: yuanchen97@gmail.com

This comment was marked as resolved.

Sign in to view

github-actions bot added area/docs area/cli size/XL labels Mar 5, 2026

This comment was marked as resolved.

Sign in to view

lalitadithya force-pushed the feat/validator-v2 branch from 401eb66 to 56d0633 Compare March 5, 2026 15:59

github-advanced-security bot found potential problems Mar 5, 2026

View reviewed changes

validators/operator-health/Dockerfile Fixed Show fixed Hide fixed

mchmarny requested review from iamkhaledh, mchmarny and xdu31 March 5, 2026 18:21

mchmarny assigned mchmarny and lalitadithya and unassigned mchmarny Mar 5, 2026

mchmarny added this to the M1 - Repo Opening milestone Mar 5, 2026

yuanchen8911 requested changes Mar 5, 2026

View reviewed changes

lalitadithya force-pushed the feat/validator-v2 branch from 56d0633 to 2b4d47a Compare March 6, 2026 09:01

github-actions bot added area/validator area/bundler labels Mar 6, 2026

github-advanced-security bot found potential problems Mar 6, 2026

View reviewed changes

validators/conformance/Dockerfile Fixed Show fixed Hide fixed

validators/deployment/Dockerfile Fixed Show fixed Hide fixed

validators/performance/Dockerfile Fixed Show fixed Hide fixed

github-actions bot added area/tests area/ci labels Mar 6, 2026

lalitadithya force-pushed the feat/validator-v2 branch from d9652e6 to e526c74 Compare March 6, 2026 12:31

mchmarny mentioned this pull request Mar 6, 2026

feat(validator): add per-check isolation and external validator support #299

Closed

4 tasks

lalitadithya force-pushed the feat/validator-v2 branch from f685f58 to be842b4 Compare March 6, 2026 14:19

lalitadithya commented Mar 6, 2026

View reviewed changes

mchmarny force-pushed the feat/validator-v2 branch from 5564408 to c7d85c3 Compare March 6, 2026 17:02

lalitadithya marked this pull request as ready for review March 7, 2026 20:26

lalitadithya requested review from a team as code owners March 7, 2026 20:26

lalitadithya and others added 4 commits March 8, 2026 18:25

fix(snapshot): ensure namespace exists before deploying agent

8d46395

Agent deployer assumed the target namespace already existed. On a brand new cluster the ServiceAccount creation failed with "namespaces not found". Add ensureNamespace step (create-or-ignore, matching validator pattern).

chore: update service with correct value (eks)

c9fe5f3

build: release v0.9.12

d57abab

mchmarny approved these changes Mar 9, 2026

View reviewed changes

dims approved these changes Mar 9, 2026

View reviewed changes

yuanchen8911 self-requested a review March 9, 2026 14:36

yuanchen8911 approved these changes Mar 9, 2026

View reviewed changes

xdu31 approved these changes Mar 9, 2026

View reviewed changes

mchmarny merged commit b741e99 into main Mar 9, 2026
85 of 89 checks passed

mchmarny deleted the feat/validator-v2 branch March 9, 2026 16:44

This was referenced Mar 10, 2026

fix(evidence): restore --cncf-submission behavioral evidence collection #321

Closed

fix(evidence): restore --cncf-submission behavioral evidence collection #322

Merged

Conversation

lalitadithya commented Mar 5, 2026 • edited by mchmarny Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validator Contract

Key Design Decisions

What Changed

Why

Validation First Principles

Test Plan

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

mchmarny commented Mar 5, 2026

Uh oh!

yuanchen8911 commented Mar 5, 2026

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

xdu31 commented Mar 6, 2026

Feedback on V2 Architecture

V1 Isolation Already Solves the Core Problem

Catalog is Disconnected from Recipe

Constraint Protocol is Missing

"Any Language" Shifts Maintenance Burden to AICR

"Declarative Catalog" is Moving Layers, Not Simplifying

Per-Validator Images Are an Operational Burden

Duplicate Infrastructure

Other Gaps

Suggestion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mchmarny commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

mchmarny commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dims commented Mar 9, 2026

Uh oh!

yuanchen8911 left a comment

Choose a reason for hiding this comment

Uh oh!

xdu31 commented Mar 9, 2026

1. Validator unit tests excluded from CI

2. Documentation dropped — no user/developer guidance

Medium Concerns

3. cluster-admin binding persists with --cleanup=false

4. No local developer workflow for validator images

Uh oh!

mchmarny commented Mar 9, 2026

1. Validator unit tests excluded from CI

2. Documentation dropped — no user/developer guidance

3. cluster-admin binding persists with --cleanup=false

4. No local developer workflow for validator images

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

lalitadithya commented Mar 5, 2026 •

edited by mchmarny

Loading

mchmarny commented Mar 6, 2026 •

edited

Loading

mchmarny commented Mar 9, 2026 •

edited

Loading

3. cluster-admin binding persists with `--cleanup=false`

3. cluster-admin binding persists with `--cleanup=false`