Skip to content

feat(validation): container-per-validator execution engine#290

Merged
mchmarny merged 10 commits intomainfrom
feat/validator-v2
Mar 9, 2026
Merged

feat(validation): container-per-validator execution engine#290
mchmarny merged 10 commits intomainfrom
feat/validator-v2

Conversation

@lalitadithya
Copy link
Collaborator

@lalitadithya lalitadithya commented Mar 5, 2026

Closes #291
Closes #303
Closes #295
Closes #140

Summary

Replace the Go testing.T-based validation pipeline with a container-per-validator execution engine. Each validator is a standalone OCI container run as a Kubernetes Job.

Validator Contract

  • Exit codes: 0=pass, 1=fail, 2=skip (not applicable)
  • Evidence via stdout (captured in CTRF report)
  • Debug logs via stderr
  • Error context via /dev/termination-log

Key Design Decisions

Decision Rationale
Exit code protocol No custom IPC, no log parsing
One Job per validator Fault isolation; partial results always available
Scoped ClusterRole Purpose-built RBAC with minimum required permissions
Watch API for completion Reliable event delivery, no informer complexity
CTRF reporting Industry-standard format with existing tooling
Catalog in recipes/ Same embed pattern as registry and overlays
Self-selecting validators Exit 2 (skip) when infrastructure unavailable

What Changed

Area Changes
pkg/validator/ Orchestrator, catalog loader, CTRF builder, Job deployer, RBAC, result extraction
pkg/validator/catalog/ Catalog types, phase filtering, image registry override
pkg/validator/ctrf/ CTRF types, builder, ConfigMap writer/reader
pkg/validator/job/ SSA-based Job deployer, scoped ClusterRole, Watch-based completion
validators/deployment/ operator-health, expected-resources, helm-values, gpu-operator-version, check-nvidia-smi
validators/conformance/ DRA, gang-scheduling, autoscaling, metrics, gateway, security checks
validators/performance/ NCCL bandwidth, trainer lifecycle
validators/helper/ Shared pod, GPU, resource utilities
validators/chainsaw/ Chainsaw assertion runner
pkg/cli/validate.go Unified --namespace flag, snapshot agent + validator deployment
pkg/snapshotter/agent.go Concurrent log streaming (eliminates 2-min delay)
pkg/constraints/ Extracted constraint evaluation from validator package
recipes/validators/catalog.yaml Declarative validator catalog (embedded via recipes.FS)
docs/design/002-validatorv2-adr.md Architecture decision record
.github/workflows/on-tag.yaml Multi-arch validator image build, manifest, attestation
.github/actions/aicr-build/ GPU test validator image build with testdata
tests/e2e/run.sh Optimized E2E tests (~11 min, down from 30 min timeout)

Why

The v1 validator used go test -c compiled binaries inside K8s Jobs, parsing results from go test -json output with custom string markers. This was:

  • Fragile — log corruption breaks result parsing
  • Tightly coupled — adding a check requires Go code (binary) + image rebuild
  • Monolithic — all checks share one pod; OOM kills everything
  • Non-standard — custom ValidationResult type with no tooling interop

Validation First Principles

Validators must:

  • Be independent of each other (own failure domain)
  • Be executable independent of AICR repo or binary
  • Use Kubernetes-native execution (OCI container as unit of delivery)
  • Have consistent, machine-readable inputs/outputs
  • Be implementable in any language/framework
  • Support versioning independent of binary version
  • Be incrementally adoptable (new validators added without Go code changes)

Test Plan

  • make qualify passes locally
  • Unit tests with -race across all packages
  • E2E tests pass in CI (~11 min)
  • CLI E2E (chainsaw) tests pass
  • GPU smoke test passes
  • On-tag release pipeline builds multi-arch validator images
  • Validator images published to GHCR (public)
  • Local cluster validation tested (Kind + real EKS)
  • GPU conformance/training/inference tests (pending testdata fix in released images)

@github-actions

This comment was marked as resolved.

@github-actions

This comment was marked as resolved.

@mchmarny
Copy link
Member

mchmarny commented Mar 5, 2026

I do like this approach:

  • Self-contained/Modular - validation logic into separate container images, each component is responsible for its own tests. This decouples the validation lifecycle from the core CLI, allowing developers to test individual "validators" in isolation without needing to understand the entire codebase
  • Language and Tooling Agnostic - Since the validator's logic is "locked" inside a container, the system is no longer strictly tied to Go tests. People can use the best tool for the job (e.g., shell scripts, Python, or existing conformance suites) as long as the container follows the contract of providing "evidence" via stdout and status via exit codes
  • Scalability/Clear Reporting - Using k8s jobs to run validations allows for parallel execution and standardized status reporting. This "generic grammar" makes it easy for the main runner to determine whether a result is a failure or a warning, and to version the API inputs/outputs as the requirements evolve over time

@xdu31 @iamkhaledh I also think this allows for easy migration to the new model without any changes to the existing Go-based tests. Thoughts?

@mchmarny mchmarny assigned mchmarny and lalitadithya and unassigned mchmarny Mar 5, 2026
@mchmarny mchmarny added this to the M1 - Repo Opening milestone Mar 5, 2026
@yuanchen8911
Copy link
Contributor

Review findings (ordered by severity):

  1. High: --phase is ignored for validator v2, so v2 always runs all phases.

    • pkg/cli/validate.go:689-692, pkg/cli/validate.go:470
    • parseValidationPhases() is computed, but v2 path always calls ValidateAll(...). This breaks documented behavior and can run much more work than requested (CLI default phase is readiness).
  2. High: v2 multi-phase output overwrites the same output file per phase.

    • pkg/cli/validate.go:477-497, pkg/serializer/writer.go:113
    • Inside the phase loop, a new file writer is created each time; NewFileWriterOrStdout uses os.Create, so only the last phase report remains on disk.
  3. Medium: cleanup=false is not fully honored in v2 for ConfigMaps.

    • pkg/validatorv2/options.go:45-46, pkg/validatorv2/validator.go:99-104, pkg/validatorv2/validator.go:178-183
    • Option/docs say cleanup controls Jobs, ConfigMaps, and RBAC, but ConfigMaps are always deferred for cleanup regardless of v.Cleanup.
  4. Medium: ClusterRoleBinding is global/static and not namespace-safe across runs.

    • pkg/validatorv2/job/rbac.go:36, pkg/validatorv2/job/rbac.go:122, pkg/validatorv2/job/rbac.go:134
    • ClusterRoleBindingName is fixed, and AlreadyExists is ignored. If a run later uses a different namespace, subject namespace may remain stale, causing permission failures for that namespace's ServiceAccount.

Open question:

  • Is v2 intentionally “always all phases”? If yes, CLI/docs should explicitly say --phase is v1-only (currently it appears generic).

Copy link
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at the comments.

@xdu31
Copy link
Contributor

xdu31 commented Mar 6, 2026

@lalitadithya @mchmarny @iamkhaledh

Feedback on V2 Architecture

Thanks for the thorough ADR and implementation. The problems identified (monolithic failure domain, tight coupling, fragile IPC) are real. I want to share some observations from the V1 isolation work (#299) that addresses several of the same problems within the existing architecture, and raise some architectural concerns about V2's approach.

V1 Isolation Already Solves the Core Problem

#299 adds a three-tier execution model to V1:

validation:
  deployment:
    checks:
      - expected-resources                     # Tier 1: shared Job (batched)
      - name: expected-resources               # Tier 2: isolated Job (own pod)
        isolated: true
        timeout: 3m
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"
    validators:
      - name: cluster-dns-check               # Tier 3: external BYO container
        image: registry.example.com/cluster-dns-check:v1
        timeout: 2m

Tier 1 (Shared): Non-isolated checks batched into one Job with a combined -run pattern — fast, fewer Jobs.
Tier 2 (Isolated): Each check/constraint gets its own Job — fault isolation without changing the image model.
Tier 3 (External): BYO OCI containers with exit-code protocol (0=pass, non-zero=fail) — same contract as V2.

The isolated flag cascades: individual check > phase > top-level > default (false). Recipe authors control granularity per check — some shared for speed, some isolated for safety.

This has been tested end-to-end on a Kind cluster with all three tiers running simultaneously. Demo at demos/isolation/.

Catalog is Disconnected from Recipe

V2's catalog is //go:embed'd into the binary and runs all validators for a phase unconditionally:

validators := catalog.ForPhase(string(phase))
for _, entry := range validators {
    deployer.DeployJob(ctx)  // runs every catalog entry
}

The recipe's validation.deployment.checks and constraints fields have no effect in V2. The catalog owns the execution plan.

This breaks AICR's core principle: the recipe is the single source of truth for reproducible validation. The workflow is Snapshot -> Recipe -> Validate -> Bundle. The recipe must capture what to validate and what thresholds to enforce so that anyone with the same recipe + snapshot gets the same validation result.

With V2, two teams running the same recipe with different aicr binary versions get different results because the catalog changed between releases. The recipe doesn't record what actually ran — the catalog does, but it's invisible to the user. There's no mechanism for the recipe to select which validators to run, skip, or parameterize.

V1 keeps validation reproducible — everything needed to reproduce the result is declared in the recipe.

Constraint Protocol is Missing

V2 has no concept of constraints. A validator is a container that exits 0 or 1 — there's no way to pass name: X, value: Y from the recipe.

To implement Deployment.gpu-operator.version >= v24.6.0 in V2, each validator would need to independently parse the raw recipe YAML from the mounted ConfigMap, locate the constraint, implement version comparison, and look up values from the snapshot. Every validator reimplements this logic.

V1 provides all of this as shared infrastructure: the check registration framework includes constraint pattern matching (Deployment.*), version comparison operators (==, !=, >=, <=, >, <, ~=), and structured access to snapshot data.

"Any Language" Shifts Maintenance Burden to AICR

V2 frames "write validators in any language" as an advantage. In practice, this means AICR becomes responsible for N language toolchains in CI, N dependency trees to scan for CVEs, N test frameworks with different quality standards, N linters with different code review expertise requirements, and no shared scaffolding to enforce unit test coverage.

V1 enforces quality through its registration framework — you can't register a check without a corresponding test function. The scaffolding script generates the check, test, and registration together. make test with -race catches regressions across all checks automatically.

V2's catalog is just a pointer to an image. Nothing enforces that the image has tests, handles edge cases correctly, or follows the exit-code protocol. A broken validator is only discovered at runtime.

V1's external validators: field already provides the "any language" escape hatch for users who truly need it — but the maintenance burden stays with the user, not AICR.

"Declarative Catalog" is Moving Layers, Not Simplifying

The ADR positions the catalog as simpler than V1's Go registration. In practice, adding a validator in V2 requires: write code (the example operator-health is Go using pkg/k8s/client), write a Dockerfile, build and publish the image to a registry, add an entry to catalog.yaml, and rebuild the aicr binary (since the catalog is //go:embed). That's more steps than V1, not fewer.

The catalog is cosmetically declarative but functionally compiled into the same binary.

Per-Validator Images Are an Operational Burden

V1 compiles all checks into one validator image (aicr-validator). V2 requires a separate OCI image per validator.

With N validators, that means N images to build and publish in CI, N base images to patch for CVEs, N image tags to keep in sync with the embedded catalog, N image pulls per validation run, and N images to mirror in air-gapped environments. As checks grow (10 deployment + 5 performance + 3 conformance), the maintenance cost scales linearly.

The example operator-health validator is Go code with the same dependencies as V1 checks — just wrapped in a separate Dockerfile and CI pipeline. It's the same work plus operational overhead.

Duplicate Infrastructure

V2 reimplements Job deployment, RBAC management, watch-based completion waiting, pod log capture, and cleanup in pkg/validatorv2/job/ rather than reusing pkg/validator/agent/ and pkg/k8s/pod/. This creates two parallel codepaths for the same Kubernetes operations.

Other Gaps

  • No shared mode. 10 checks = 10 sequential Jobs with scheduling and image pull overhead each. V1 can batch them in one Job.
  • No resume. V1 has determineStartPhase() for resuming from a failed phase. V2 starts from scratch.
  • cluster-admin RBAC. V2 grants every validator pod full cluster admin. This violates the principle of least privilege — a validator that only needs to list Deployments in one namespace should not have permission to delete anything cluster-wide. V1 uses a scoped custom ClusterRole with minimum required permissions. Yes, scoped RBAC requires manual updates when a new check needs additional permissions, but that's a feature — it forces explicit review of what access each check actually needs.
  • No E2E tests. V2's test plan shows [ ] E2E with real validator container image (post-merge).
  • Hardcoded resource limits. 1 CPU, 1Gi for every validator, not configurable per check.

Suggestion

The V1 isolation branch addresses the fault isolation and BYO container problems within the existing architecture, without breaking recipe reproducibility or multiplying the image maintenance burden. The remaining V2 advantages (termination log, resource limits, three-layer timeouts) are incremental improvements that can be added to V1.

I'd suggest we consider whether V1 + isolation covers the requirements before introducing a parallel validation engine.

@mchmarny
Copy link
Member

mchmarny commented Mar 6, 2026

NOTE: There appears to be another PR (#299) with significant overlaps with this one. I put a block on that PR as we need to resolve the architectural direction here first, before merging any code.

Now, let's reason about this first in terms of first principles:

  1. Decoupling validation logic from the CLI - This may seem unnatural first, but this is the exact right approach IMO. Our current approach of keeping all checks compiled into the aicr binary, means that updates to a single check still requires a full CLI release. This approach allows us to decouple the validation lifecycle from the core tool. As a result, component teams can iterate on their validators independently.
  2. Standardized, interoperable reporting - The ADR in this PR introduces CTRF (Common Test Report Format) — an industry-standard schema with existing tooling, which was one of the original motivations for rethinking the validation architecture. This does not prevent us form encoding ValidationResult type content into the result. It also adds language and tooling agnosticism. Mixing some tests in AICR repo in Go, and some in external image seems like an escape hatch, not a first-class model. The approach proposed in this PR uses a consistent model where every validator is a container with a simple contract (exit code + stdout evidence). This is a fundamental difference — it enable teams to use whatever tool is best for their job without touching the AICR codebase.

Now, more specifically WRT @xdu31 comments:

  • "Recipe reproducibility is broken because the catalog is embedded." — The current v1 model has the same property: checks are compiled into the binary, so different aicr versions already produce different validation results. The catalog makes this explicit and versionable rather than implicit (see 1 above). Recipe-level validator selection is an implementation detail that can be added to feat(validation): container-per-validator execution engine #290, not an architectural flaw.
  • "Per-validator images are an operational burden." — This is the standard cloud-native model. Every Kubernetes operator, every Helm chart test hook, every Tekton task follows this pattern. The alternative (monolithic binary with everything compiled in) is what we're trying to move away from. Component teams already build and maintain their own container images — asking them to also own their validation image is a natural extension, not a new burden on AICR. Also, there is no hard 1 check per image requirement.
  • "'Any language' shifts maintenance to AICR." — This inverts the actual ownership model (see 2 above). The whole point is that component teams own their validators end-to-end. AICR defines the contract (exit codes, stdout evidence); teams choose their implementation. AICR does not become responsible for N toolchains — the teams that own the components do, same as they already do for their operators and controllers.
  • "V2 reimplements Job deployment infrastructure." — This is expected for a new package that deliberately avoids coupling to the existing v1 internals. Shared utilities from pkg/k8s/pod can be factored out during review; it's not an argument against the architecture.
  • "cluster-admin RBAC violates least privilege." — Valid feedback, and addressable within feat(validation): container-per-validator execution engine #290's architecture. It doesn't change the architectural direction.
  • "V1 has determineStartPhase() for resuming from a failed phase. V2 starts from scratch." that logic is encoded in the binary, refactoring to the image based approach do does not prevent this. It's up to us to determine the atomic scope of all checks encoded into each image.

Recommendation:

I'd like to see us align on #290's architectural direction first, then incorporate the useful elements from PR #290 (cascading isolation flag, batched execution for speed) as follow-up optimizations.

Overall, it's a +1 form me on the approach proposed in this PR (#290) with some incremental improvements mentioned above.

@mchmarny mchmarny force-pushed the feat/validator-v2 branch from 5564408 to c7d85c3 Compare March 6, 2026 17:02
parseThreshold() split on space and took the first token, which for
a constraint value like ">= 100" yielded ">=" — not a valid float.

Strip comparison operator characters (>=, <=, etc.) before extracting
the numeric value. Handles "450", "450 GB/s", ">= 400", ">= 100 GB/s".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lalitadithya lalitadithya marked this pull request as ready for review March 7, 2026 20:26
@lalitadithya lalitadithya requested review from a team as code owners March 7, 2026 20:26
lalitadithya and others added 4 commits March 8, 2026 18:25
Both launcher and worker pods in the NCCL TrainingRuntime were missing
tolerations, causing FailedScheduling on clusters with node taints
(e.g., dedicated=worker-workload:NoSchedule).

Add operator: Exists toleration to both replicatedJobs so pods can
schedule on any node regardless of taints.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agent deployer assumed the target namespace already existed. On a brand
new cluster the ServiceAccount creation failed with "namespaces not found".
Add ensureNamespace step (create-or-ignore, matching validator pattern).
Copy link
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
with the latest changes

@mchmarny
Copy link
Member

mchmarny commented Mar 9, 2026

Large number of comments on this PR so I'm goign to summarize:

Addressed:

  • --phase ignored (@yuanchen8911) — Fixed. CLI calls ValidatePhases(ctx, phases, ...) and filterEntriesByRecipe() scopes execution to recipe-declared checks per phase. Nil recipe or empty checks returns nil (skip).
  • Multi-phase output overwrites file (@yuanchen8911) — Fixed. Phase results are merged into a single CTRF report before writing.
  • cleanup=false not honored for ConfigMaps (@yuanchen8911) — Fixed. Both RBAC and ConfigMap cleanup are gated on v.Cleanup (validator.go:138, 153).
  • ClusterRoleBinding namespace-safety (@yuanchen8911) — Fixed. Uses SSA (Apply with Force: true) which updates the subject namespace on each run (rbac.go:138-163).
  • Catalog runs everything by default (@xdu31) — Fixed. filterEntriesByRecipe() returns nil when recipe is nil, validation is nil, or checks are empty. Only explicitly declared checks run.
  • Dockerfile non-root user (CodeQL DS-0002) — Fixed. All three Dockerfiles have USER nonroot.
  • nvidia-smi-verify-pod.yaml :latest tag (CodeQL KSV-0013) — Fixed. Image is now ${IMAGE} template variable, not hardcoded.
  • ~5,400 lines of tests deleted (@xdu31) — Addressed. New tests across pkg/validator/ packages (catalog, ctrf, job, orchestrator) and coverage at 73.3% (above 70% threshold).
  • All resolved inline comments from @lalitadithya (SSA, generate name, informers, bring-back items, etc.) — All marked resolved.
  • Hardcoded resource limits (@xdu31) — The catalog supports per-validator resource overrides via the Resources field (catalog.go:77-85) with cpu and memory fields. The defaults exist but any catalog entry can override them.

Not addressed, and why:

  • cluster-admin RBAC (@xdu31) — The branch uses cluster-admin with a detailed rationale in comments (rbac.go:39-68). TLDR; the ServiceAccount is ephemeral (created/deleted per run), the caller must already have cluster-admin to create the binding, and validators need access to arbitrary CRDs unknown at compile time — making a scoped role impractical without constant maintenance.
  • External validator support removed (@xdu31) — Both ExternalValidator or BYO image mechanism are out of scope at this time. Still, architecturally, the new design does enable external validators via --data extension points
  • Catalog embedded in binary — no independent lifecycle (@xdu31) — Just like with components, this is an implementation choice for the initial version. External validators can be enabled (see above)
  • Constraint protocol for per-phase constraints (@xdu31) — No hard requirements in the new approach of phase/checks to container image. Could be 1:1, or N:1. Since validations, just like components are inherited, each recipe can override as needed. Just like with components, have to be validated.
  • No shared/batched execution mode (@xdu31) — ACK, out of scope for this PR but could be addressed in a follow-up optimization, should we need it.

Unless anyone has additional concerns, I recommend we merge this PR.

@dims
Copy link
Collaborator

dims commented Mar 9, 2026

@lalitadithya @mchmarny @xdu31 I like the tradeoffs we made in this PR. Having gone through all the comments/responses, i'd support landing this as it will help us and end users in the long term.

+1 to merge this PR.

@yuanchen8911 yuanchen8911 self-requested a review March 9, 2026 14:36
Copy link
Contributor

@yuanchen8911 yuanchen8911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@xdu31
Copy link
Contributor

xdu31 commented Mar 9, 2026

1. Validator unit tests excluded from CI

Files: Makefile:167, .github/actions/go-test/action.yml

The Makefile explicitly excludes validators/ from make test:

go list ./... | grep -v -e /tests/chainsaw/ -e /validators

The validators/ directory is part of the root Go module (no separate go.mod), but the ~1,600 lines of tests added in commit 72574753 are never run in CI:

  • validators/chainsaw/runner_test.go (225 lines)
  • validators/conformance/helpers_test.go (416 lines)
  • validators/context_test.go (101 lines)
  • validators/deployment/gpu_operator_version_test.go (118 lines)
  • validators/deployment/helm_values_test.go (228 lines)
  • validators/helper/gpu_test.go (146 lines)
  • validators/performance/nccl_test.go (309 lines)
  • validators/runner_test.go (71 lines)

Impact: Coverage numbers don't reflect validator check logic. Regressions in check implementations won't be caught.

Fix: Add make test-validators target or include validators/ in make test.


2. Documentation dropped — no user/developer guidance

Files: docs/validator/ (entire directory missing)

The previous force push included 5 well-written docs (~990 lines total) that were dropped:

Document Lines Purpose
docs/validator/contract.md 183 Container interface spec: exit codes 0/1/2, /dev/termination-log, stdout=evidence, env vars, volumes, RBAC permissions, timeouts
docs/validator/framework.md 208 How the engine works: phases, catalog, recipe integration, Job lifecycle, CTRF output format
docs/validator/custom-checks.md 295 BYO checks guide: Python/Bash/Go examples, Dockerfile patterns, local testing, versioning
docs/validator/upstream-checks.md 290 Contributing checks: CheckFunc signature, Context object, registration, catalog entry, project layout
docs/validator/README.md 13 Index page linking all docs

What remains: Only docs/design/002-validatorv2-adr.md (architecture decision) and docs/contributor/validations.md (bundler validations — unrelated to V2 engine).

Impact: No one can build a custom check, contribute an upstream check, or understand the container contract without reading source code.

Fix: Re-add the 5 dropped docs. The content existed and was high quality.


Medium Concerns

3. cluster-admin binding persists with --cleanup=false

File: pkg/validator/job/rbac.go (commit a4ea9479)

The validator ServiceAccount is bound to the built-in cluster-admin ClusterRole. The justification is reasonable (validators need to inspect arbitrary CRDs, the caller already has cluster-admin, the binding is ephemeral).

Problem: When --cleanup=false is used for debugging, the aicr-validator ClusterRoleBinding to cluster-admin persists indefinitely. No warning is emitted, no TTL is set, and the security implication is undocumented.

Fix: Warn when --cleanup=false that a cluster-admin binding will persist. Document manual cleanup steps.


4. No local developer workflow for validator images

Files: Makefile, .github/actions/e2e/action.yml

  • make image-validators exists and builds all 3 images
  • CI E2E action has an elaborate build pipeline (host-compiled binaries + COPY-only Dockerfiles for speed)
  • But there's no documented developer workflow for "I changed a check, how do I test it?"
  • make e2e-tilt (local dev) doesn't build/load validator images automatically
  • A developer would need to manually: build images, push to local registry, set AICR_VALIDATOR_IMAGE_REGISTRY

Fix: Document the local dev workflow. Consider a make dev-env-validators target.

@mchmarny
Copy link
Member

mchmarny commented Mar 9, 2026

response to @xdu31 comments

1. Validator unit tests excluded from CI

these are actual validations that are not easy to test in unit test but are actually covered in E2E tests.

2. Documentation dropped — no user/developer guidance

ack, this is gap we will have to close in subsequent PRs

3. cluster-admin binding persists with --cleanup=false

Ack, we can add warning in subsequent PRs when --cleanup=false that a cluster-admin binding will persist. Document manual cleanup steps.

4. No local developer workflow for validator images

Same as 2

@mchmarny mchmarny merged commit b741e99 into main Mar 9, 2026
85 of 89 checks passed
@mchmarny mchmarny deleted the feat/validator-v2 branch March 9, 2026 16:44
yuanchen8911 added a commit that referenced this pull request Mar 10, 2026
PR #290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR #214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: yuanchen97@gmail.com
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Mar 10, 2026
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: yuanchen97@gmail.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

5 participants