feat(validation): container-per-validator execution engine#290
feat(validation): container-per-validator execution engine#290
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
401eb66 to
56d0633
Compare
|
I do like this approach:
@xdu31 @iamkhaledh I also think this allows for easy migration to the new model without any changes to the existing Go-based tests. Thoughts? |
|
Review findings (ordered by severity):
Open question:
|
yuanchen8911
left a comment
There was a problem hiding this comment.
take a look at the comments.
|
@lalitadithya @mchmarny @iamkhaledh Feedback on V2 ArchitectureThanks for the thorough ADR and implementation. The problems identified (monolithic failure domain, tight coupling, fragile IPC) are real. I want to share some observations from the V1 isolation work (#299) that addresses several of the same problems within the existing architecture, and raise some architectural concerns about V2's approach. V1 Isolation Already Solves the Core Problem#299 adds a three-tier execution model to V1: validation:
deployment:
checks:
- expected-resources # Tier 1: shared Job (batched)
- name: expected-resources # Tier 2: isolated Job (own pod)
isolated: true
timeout: 3m
constraints:
- name: Deployment.gpu-operator.version
value: ">= v24.6.0"
validators:
- name: cluster-dns-check # Tier 3: external BYO container
image: registry.example.com/cluster-dns-check:v1
timeout: 2mTier 1 (Shared): Non-isolated checks batched into one Job with a combined The This has been tested end-to-end on a Kind cluster with all three tiers running simultaneously. Demo at Catalog is Disconnected from RecipeV2's catalog is validators := catalog.ForPhase(string(phase))
for _, entry := range validators {
deployer.DeployJob(ctx) // runs every catalog entry
}The recipe's This breaks AICR's core principle: the recipe is the single source of truth for reproducible validation. The workflow is Snapshot -> Recipe -> Validate -> Bundle. The recipe must capture what to validate and what thresholds to enforce so that anyone with the same recipe + snapshot gets the same validation result. With V2, two teams running the same recipe with different V1 keeps validation reproducible — everything needed to reproduce the result is declared in the recipe. Constraint Protocol is MissingV2 has no concept of constraints. A validator is a container that exits 0 or 1 — there's no way to pass To implement V1 provides all of this as shared infrastructure: the check registration framework includes constraint pattern matching ( "Any Language" Shifts Maintenance Burden to AICRV2 frames "write validators in any language" as an advantage. In practice, this means AICR becomes responsible for N language toolchains in CI, N dependency trees to scan for CVEs, N test frameworks with different quality standards, N linters with different code review expertise requirements, and no shared scaffolding to enforce unit test coverage. V1 enforces quality through its registration framework — you can't register a check without a corresponding test function. The scaffolding script generates the check, test, and registration together. V2's catalog is just a pointer to an image. Nothing enforces that the image has tests, handles edge cases correctly, or follows the exit-code protocol. A broken validator is only discovered at runtime. V1's external "Declarative Catalog" is Moving Layers, Not SimplifyingThe ADR positions the catalog as simpler than V1's Go registration. In practice, adding a validator in V2 requires: write code (the example The catalog is cosmetically declarative but functionally compiled into the same binary. Per-Validator Images Are an Operational BurdenV1 compiles all checks into one validator image ( With N validators, that means N images to build and publish in CI, N base images to patch for CVEs, N image tags to keep in sync with the embedded catalog, N image pulls per validation run, and N images to mirror in air-gapped environments. As checks grow (10 deployment + 5 performance + 3 conformance), the maintenance cost scales linearly. The example Duplicate InfrastructureV2 reimplements Job deployment, RBAC management, watch-based completion waiting, pod log capture, and cleanup in Other Gaps
SuggestionThe V1 isolation branch addresses the fault isolation and BYO container problems within the existing architecture, without breaking recipe reproducibility or multiplying the image maintenance burden. The remaining V2 advantages (termination log, resource limits, three-layer timeouts) are incremental improvements that can be added to V1. I'd suggest we consider whether V1 + isolation covers the requirements before introducing a parallel validation engine. |
56d0633 to
2b4d47a
Compare
d9652e6 to
e526c74
Compare
Now, let's reason about this first in terms of first principles:
Now, more specifically WRT @xdu31 comments:
Recommendation: I'd like to see us align on #290's architectural direction first, then incorporate the useful elements from PR #290 (cascading isolation flag, batched execution for speed) as follow-up optimizations. Overall, it's a +1 form me on the approach proposed in this PR (#290) with some incremental improvements mentioned above. |
f685f58 to
be842b4
Compare
tests/chainsaw/cli/cuj1-training/assert-validate-readiness.yaml
Outdated
Show resolved
Hide resolved
tests/uat/aws/tests/cuj1-training/assert-validate-readiness.yaml
Outdated
Show resolved
Hide resolved
tests/uat/aws/tests/cuj2-inference/assert-validate-readiness.yaml
Outdated
Show resolved
Hide resolved
5564408 to
c7d85c3
Compare
parseThreshold() split on space and took the first token, which for a constraint value like ">= 100" yielded ">=" — not a valid float. Strip comparison operator characters (>=, <=, etc.) before extracting the numeric value. Handles "450", "450 GB/s", ">= 400", ">= 100 GB/s". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both launcher and worker pods in the NCCL TrainingRuntime were missing tolerations, causing FailedScheduling on clusters with node taints (e.g., dedicated=worker-workload:NoSchedule). Add operator: Exists toleration to both replicatedJobs so pods can schedule on any node regardless of taints. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agent deployer assumed the target namespace already existed. On a brand new cluster the ServiceAccount creation failed with "namespaces not found". Add ensureNamespace step (create-or-ignore, matching validator pattern).
mchmarny
left a comment
There was a problem hiding this comment.
LGTM
with the latest changes
|
Large number of comments on this PR so I'm goign to summarize: Addressed:
Not addressed, and why:
Unless anyone has additional concerns, I recommend we merge this PR. |
|
@lalitadithya @mchmarny @xdu31 I like the tradeoffs we made in this PR. Having gone through all the comments/responses, i'd support landing this as it will help us and end users in the long term. +1 to merge this PR. |
1. Validator unit tests excluded from CIFiles: The Makefile explicitly excludes go list ./... | grep -v -e /tests/chainsaw/ -e /validatorsThe
Impact: Coverage numbers don't reflect validator check logic. Regressions in check implementations won't be caught. Fix: Add 2. Documentation dropped — no user/developer guidanceFiles: The previous force push included 5 well-written docs (~990 lines total) that were dropped:
What remains: Only Impact: No one can build a custom check, contribute an upstream check, or understand the container contract without reading source code. Fix: Re-add the 5 dropped docs. The content existed and was high quality. Medium Concerns3. cluster-admin binding persists with
|
|
response to @xdu31 comments
these are actual validations that are not easy to test in unit test but are actually covered in E2E tests.
ack, this is gap we will have to close in subsequent PRs
Ack, we can add warning in subsequent PRs when
Same as 2 |
PR #290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR #214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: yuanchen97@gmail.com
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed the --cncf-submission behavioral evidence collection added in PR NVIDIA#214 during the validation refactor. This restores it on top of the new engine. Restored: - pkg/evidence/collector.go — behavioral evidence collector - pkg/evidence/collector_test.go — unit tests - pkg/evidence/scripts/collect-evidence.sh — evidence collection script Bug fixes in the script: - DCGM metrics: port-forward with retry loop instead of flaky kubectl run - DCGM result: fixed stale variable reference causing false FAIL verdict - ASG lookup: instance ID fallback when EKS nodegroup tags are absent - ELB redaction: auto-redact public ELB hostnames from evidence output - NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag - Robust operator: require healthy workload pods for PASS verdict - DRA evidence: show allocation details to avoid pending state confusion - Gateway CRDs: use name-grep instead of unreliable label selector - Cluster autoscaling: align narrative with configuration-level evidence CLI additions: - --cncf-submission flag to trigger behavioral evidence collection - --feature/-f flag for selective feature collection - --kubeconfig propagated to evidence script via KUBECONFIG env - Flag validation tests for regression prevention Also fixes YAML indentation in tests/uat/aws/config.yaml. Signed-off-by: yuanchen97@gmail.com
Closes #291
Closes #303
Closes #295
Closes #140
Summary
Replace the Go
testing.T-based validation pipeline with a container-per-validator execution engine. Each validator is a standalone OCI container run as a Kubernetes Job.Validator Contract
0=pass,1=fail,2=skip (not applicable)/dev/termination-logKey Design Decisions
recipes/What Changed
pkg/validator/pkg/validator/catalog/pkg/validator/ctrf/pkg/validator/job/validators/deployment/validators/conformance/validators/performance/validators/helper/validators/chainsaw/pkg/cli/validate.go--namespaceflag, snapshot agent + validator deploymentpkg/snapshotter/agent.gopkg/constraints/recipes/validators/catalog.yamlrecipes.FS)docs/design/002-validatorv2-adr.md.github/workflows/on-tag.yaml.github/actions/aicr-build/tests/e2e/run.shWhy
The v1 validator used
go test -ccompiled binaries inside K8s Jobs, parsing results fromgo test -jsonoutput with custom string markers. This was:ValidationResulttype with no tooling interopValidation First Principles
Validators must:
Test Plan
make qualifypasses locally-raceacross all packages