Skip to content

feat(evidence): add artifact capture for conformance evidence#201

Merged
dims merged 2 commits intoNVIDIA:mainfrom
dims:worktree-conformance-artifacts
Feb 24, 2026
Merged

feat(evidence): add artifact capture for conformance evidence#201
dims merged 2 commits intoNVIDIA:mainfrom
dims:worktree-conformance-artifacts

Conversation

@dims
Copy link
Copy Markdown
Collaborator

@dims dims commented Feb 24, 2026

Summary

  • Add diagnostic artifact capture mechanism so conformance checks record rich evidence (deployment status, metrics samples, test results) during execution, flowing through the pipeline into evidence markdown
  • Artifacts are ephemeral (json:"-") and transported via base64-encoded ARTIFACT: lines in test output, decoded in phases.go, rendered as labeled code blocks in evidence templates
  • All 9 submission requirement checks now record diagnostic artifacts covering deployment status, metrics presence, behavioral test results, and more

Design

Check → ctx.Artifacts.Record(label, data)
  → Cancel() emits t.Logf("ARTIFACT:<base64>")
    → phases.go extracts ARTIFACT: lines, populates CheckResult.Artifacts
      → evidence renderer outputs #### Label + fenced code block

Key constraints:

  • Artifact type lives in checks/ (leaf package, no import cycles)
  • Per-artifact: max 8KB data, max 20 per check, each base64 line under bufio.Scanner 64KB limit
  • Artifacts are ephemeral (json:"-" yaml:"-") — never persisted in saved results
  • Cancel() nil-guards both r.ctx and r.ctx.Artifacts

Files Changed

Area Files Change
Infrastructure checks/artifact.go, checks/registry.go, checks/runner.go, result.go, phases.go Artifact type, collector, transport, pipeline extraction
Evidence evidence/types.go, evidence/renderer.go, evidence/templates.go Pass artifacts through to markdown rendering
Static checks (4) dra_support_check.go, accelerator_metrics_check.go, ai_service_metrics_check.go, inference_gateway_check.go Record deployment/metrics/CRD evidence
Behavioral checks (5) robust_controller_check.go, secure_access_check.go, gang_scheduling_check.go, pod_autoscaling_check.go, cluster_autoscaling_check.go Record behavioral test results
Tests artifact_test.go, runner_test.go, renderer_test.go, dra_support_check_unit_test.go Round-trip, cap enforcement, thread safety, Cancel() emit, renderer with/without artifacts

Test plan

  • make test passes with race detector (73.9% coverage)
  • Artifact encode/decode round-trip test
  • Cap enforcement (count limit, data truncation)
  • Thread safety test (concurrent Record)
  • Cancel() with nil ctx/Artifacts doesn't panic
  • Cancel() emits artifacts via t.Logf
  • Evidence renderer with artifacts: labeled code blocks appear
  • Evidence renderer without artifacts: identical to current output
  • All existing conformance tests pass unchanged

@dims dims requested a review from a team as a code owner February 24, 2026 04:52
@dims dims force-pushed the worktree-conformance-artifacts branch from b56d2dc to 52262f0 Compare February 24, 2026 13:05
mchmarny

This comment was marked as resolved.

@dims dims force-pushed the worktree-conformance-artifacts branch from 52262f0 to 10db91c Compare February 24, 2026 13:21
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean architecture, good test coverage, and the ARTIFACT: transport mirrors the existing CONSTRAINT_RESULT: pattern well. Four items to address — see inline comments.

@dims dims force-pushed the worktree-conformance-artifacts branch from 10db91c to f100479 Compare February 24, 2026 13:24
@dims dims requested a review from a team as a code owner February 24, 2026 13:24
@mchmarny mchmarny dismissed their stale review February 24, 2026 13:24

Superseded by review with inline comments

@dims dims force-pushed the worktree-conformance-artifacts branch 4 times, most recently from dd75d61 to c6df7b5 Compare February 24, 2026 14:54
@dims dims force-pushed the worktree-conformance-artifacts branch from c6df7b5 to 2c2ae52 Compare February 24, 2026 15:25
Add an artifact capture mechanism so conformance checks record rich
diagnostic evidence during execution, flowing it through the pipeline
into evidence markdown. Single command, rich output.

Infrastructure:
- Artifact type, ArtifactCollector with thread-safe Record()/Drain(),
  base64 encode/decode, 8KB per-artifact / 20 per-check caps
- Pipeline: runner.go Cancel() emits via t.Logf → phases.go extracts
  using Contains+SplitN (handles t.Logf source prefixes) → evidence
  renderer emits labeled code blocks in markdown
- Artifacts are ephemeral (json:"-") — never persisted in saved results
- Failed artifact decodes log a warning and preserve the line in Reason

Conformance checks instrumented (9 checks):
- dra_support_check: controller, kubelet plugin, ResourceSlices
- accelerator_metrics_check: DCGM metrics sample, required metrics
- ai_service_metrics_check: Prometheus query, custom metrics API
- inference_gateway_check: GatewayClass, Gateway, CRDs, data plane
- robust_controller_check: Dynamo operator, webhook, rejection test
- secure_access_check: DRA test pod, access patterns, isolation test
- gang_scheduling_check: KAI scheduler, GPU availability, gang results
- pod_autoscaling_check: custom/external metrics API, HPA test
- cluster_autoscaling_check: Karpenter, NodePools, autoscaling test

Testing:
- Artifact encode/decode round-trip, cap enforcement, thread safety
- extractArtifacts() with realistic source-prefixed t.Logf lines
- Evidence renderer with/without artifacts
@dims dims force-pushed the worktree-conformance-artifacts branch from 2c2ae52 to 1d15151 Compare February 24, 2026 15:52
…checks

LoadValidationContext() used DiagnosticTimeout (2 minutes) as the parent
context for all conformance checks. Behavioral checks like DRA secure
access need time for pod creation, CUDA image pull, GPU allocation, and
isolation verification — 2 minutes was insufficient, causing consistent
TIMEOUT failures.

Add CheckExecutionTimeout (10 minutes) for the check execution context,
bounded below ValidateConformanceTimeout (15 minutes).
@mchmarny
Copy link
Copy Markdown
Member

Thanks for resolving the previous comments, most of this looks good now.
There still seems to be a few test artifacts being hardcoded congratulatory strings rather than observed state:

  • cluster_autoscaling_check.go:127 — “HPA: scaling intent detected\nKarpenter: new node(s) provisioned” — these are assertions restated as evidence, not captured state
  • robust_controller_check.go:141 — “Result: PASS — webhook rejected invalid DynamoGraphDeployment”
  • pod_autoscaling_check.go:152 — “Scale-up: PASS — HPA computed desiredReplicas > currentReplicas”
  • secure_access_check.go:124 — “Result: PASS — pod without DRA claims cannot see GPU devices”

The DRA support check does it right now, it captures actual replica counts, image versions, and ResourceSlice counts.

May be better to capture actual HPA .status.desiredReplicas/.status.currentReplicas values, the actual pod scheduling timestamps, or the actual Karpenter node provisioning events rather than static pass/fail strings. Just na idea

@dims
Copy link
Copy Markdown
Collaborator Author

dims commented Feb 24, 2026

May be better to capture actual HPA .status.desiredReplicas/.status.currentReplicas values, the actual pod scheduling timestamps, or the actual Karpenter node provisioning events rather than static pass/fail strings. Just na idea

will iterate for sure! it's not done done yet :)

@dims dims merged commit 766d7c1 into NVIDIA:main Feb 24, 2026
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants