Skip to content

fix(evidence): restore --cncf-submission behavioral evidence collection#321

Closed
yuanchen8911 wants to merge 1 commit intomainfrom
fix/restore-cncf-submission
Closed

fix(evidence): restore --cncf-submission behavioral evidence collection#321
yuanchen8911 wants to merge 1 commit intomainfrom
fix/restore-cncf-submission

Conversation

@yuanchen8911
Copy link
Contributor

Summary

Restore the --cncf-submission behavioral evidence collection feature that was inadvertently removed by PR #290 (container-per-validator execution engine), plus fix several pre-existing bugs in the evidence collection script.

Motivation / Context

PR #290 refactored the validation engine and deleted the --cncf-submission flag, --feature flag, runCNCFSubmission() function, and pkg/evidence/collector.go that were originally added in PR #214. The docs still reference these flags but the implementation was gone.

Fixes: N/A
Related: #290, #214

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: pkg/evidence

Implementation Notes

Restored files (from latest pre-PR#290 state, not the original PR #214):

  • pkg/evidence/collector.go — behavioral evidence collector (shell script orchestrator)
  • pkg/evidence/collector_test.go — unit tests for feature resolution, script sections, collector options
  • pkg/evidence/scripts/collect-evidence.sh — 1237-line evidence collection script

Bug fixes in the script:

Fix Issue Solution
DCGM metrics empty kubectl run curl pod raced against DNS; DCGM container too minimal for kubectl exec Port-forward to DCGM service with retry loop
DCGM result false FAIL Stale dcgm_pod variable reference after refactor to dcgm_svc Fixed variable name
ASG details empty Custom ASGs lack eks:nodegroup-name tag; multi-line None broke string comparison Strip whitespace + instance ID fallback via describe-auto-scaling-instances
ELB hostname exposed Public endpoint in evidence docs Post-processing sed redaction
NO_CLEANUP broken Skipped both pre-run and post-run cleanup cleanup_ns takes pre/post phase; pre-run always cleans stale resources

CLI additions:

  • --cncf-submission flag triggers behavioral evidence collection (bypasses normal validation)
  • --feature/-f flag for selective feature collection
  • --kubeconfig propagated to evidence script via KUBECONFIG env var
  • Flag validation: --cncf-submission requires --evidence-dir, --feature requires --cncf-submission

Testing

go test ./pkg/evidence/ ./pkg/cli/ -race -count=1
# ok  github.com/NVIDIA/aicr/pkg/evidence  1.628s
# ok  github.com/NVIDIA/aicr/pkg/cli       1.902s

Verified end-to-end on EKS cluster with H100 GPUs — all 8 evidence features collected successfully with all bug fixes confirmed.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — restores previously existing functionality

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

PR #290 (container-per-validator execution engine) inadvertently removed
the --cncf-submission behavioral evidence collection added in PR #214
during the validation refactor. This restores it on top of the new engine.

Restored:
- pkg/evidence/collector.go — behavioral evidence collector
- pkg/evidence/collector_test.go — unit tests
- pkg/evidence/scripts/collect-evidence.sh — evidence collection script

Bug fixes in the script:
- DCGM metrics: port-forward with retry loop instead of flaky kubectl run
- DCGM result: fixed stale variable reference causing false FAIL verdict
- ASG lookup: instance ID fallback when EKS nodegroup tags are absent
- ELB redaction: auto-redact public ELB hostnames from evidence output
- NO_CLEANUP: pre-run cleanup always runs, post-run respects the flag

CLI additions:
- --cncf-submission flag to trigger behavioral evidence collection
- --feature/-f flag for selective feature collection
- --kubeconfig propagated to evidence script via KUBECONFIG env
- Flag validation tests for regression prevention

Signed-off-by: yuanchen97@gmail.com
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 10, 2026 01:39
@yuanchen8911 yuanchen8911 added bug Something isn't working area/tests labels Mar 10, 2026
@yuanchen8911 yuanchen8911 deleted the fix/restore-cncf-submission branch March 10, 2026 01:41
@github-actions
Copy link

Coverage Report ✅

Metric Value
Coverage 73.3%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-73.3%25-green)

No Go source files changed in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli bug Something isn't working size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant