Skip to content

fix(evidence): restore --cncf-submission behavioral evidence collection#322

Merged
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/restore-cncf-submission
Mar 10, 2026
Merged

fix(evidence): restore --cncf-submission behavioral evidence collection#322
mchmarny merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/restore-cncf-submission

Conversation

@yuanchen8911
Copy link
Contributor

Summary

Restore the --cncf-submission behavioral evidence collection feature that was inadvertently removed by PR #290 (container-per-validator execution engine), plus fix several pre-existing bugs in the evidence collection script.

Motivation / Context

PR #290 refactored the validation engine and deleted the --cncf-submission flag, --feature flag, runCNCFSubmission() function, and pkg/evidence/collector.go that were originally added in PR #214. The docs still reference these flags but the implementation was gone.

Fixes: N/A
Related: #290, #214

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: pkg/evidence

Implementation Notes

Restored files (from latest pre-PR#290 state, not the original PR #214):

  • pkg/evidence/collector.go — behavioral evidence collector (shell script orchestrator)
  • pkg/evidence/collector_test.go — unit tests for feature resolution, script sections, collector options
  • pkg/evidence/scripts/collect-evidence.sh — 1237-line evidence collection script

Bug fixes in the script:

Fix Issue Solution
DCGM metrics empty kubectl run curl pod raced against DNS; DCGM container too minimal for kubectl exec Port-forward to DCGM service with retry loop
DCGM result false FAIL Stale dcgm_pod variable reference after refactor to dcgm_svc Fixed variable name
ASG details empty Custom ASGs lack eks:nodegroup-name tag; multi-line None broke string comparison Strip whitespace + instance ID fallback via describe-auto-scaling-instances
ELB hostname exposed Public endpoint in evidence docs Post-processing sed redaction
NO_CLEANUP broken Skipped both pre-run and post-run cleanup cleanup_ns takes pre/post phase; pre-run always cleans stale resources

CLI additions:

  • --cncf-submission flag triggers behavioral evidence collection (bypasses normal validation)
  • --feature/-f flag for selective feature collection
  • --kubeconfig propagated to evidence script via KUBECONFIG env var
  • Flag validation: --cncf-submission requires --evidence-dir, --feature requires --cncf-submission

Testing

go test ./pkg/evidence/ ./pkg/cli/ -race -count=1
# ok  github.com/NVIDIA/aicr/pkg/evidence  1.628s
# ok  github.com/NVIDIA/aicr/pkg/cli       1.902s

Verified end-to-end on EKS cluster with H100 GPUs — all 8 evidence features collected successfully with all bug fixes confirmed.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — restores previously existing functionality

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 10, 2026 01:49
@yuanchen8911 yuanchen8911 added bug Something isn't working area/tests labels Mar 10, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/restore-cncf-submission branch 7 times, most recently from 41acf49 to a5be232 Compare March 10, 2026 02:16
@yuanchen8911 yuanchen8911 requested a review from mchmarny March 10, 2026 02:17
@yuanchen8911 yuanchen8911 force-pushed the fix/restore-cncf-submission branch 3 times, most recently from ece1c0e to 1f2caab Compare March 10, 2026 03:50
PR NVIDIA#290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR NVIDIA#214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag
- Robust operator: require healthy workload pods for PASS verdict
- DRA evidence: show allocation details to avoid pending state confusion
- Gateway CRDs: use name-grep instead of unreliable label selector
- Cluster autoscaling: align narrative with configuration-level evidence

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Also fixes YAML indentation in tests/uat/aws/config.yaml.

Signed-off-by: yuanchen97@gmail.com
@mchmarny mchmarny merged commit 5fcebe3 into NVIDIA:main Mar 10, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants