feat(evidence): add artifact capture for conformance evidence#201
feat(evidence): add artifact capture for conformance evidence#201dims merged 2 commits intoNVIDIA:mainfrom
Conversation
b56d2dc to
52262f0
Compare
52262f0 to
10db91c
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Clean architecture, good test coverage, and the ARTIFACT: transport mirrors the existing CONSTRAINT_RESULT: pattern well. Four items to address — see inline comments.
10db91c to
f100479
Compare
Superseded by review with inline comments
dd75d61 to
c6df7b5
Compare
c6df7b5 to
2c2ae52
Compare
Add an artifact capture mechanism so conformance checks record rich diagnostic evidence during execution, flowing it through the pipeline into evidence markdown. Single command, rich output. Infrastructure: - Artifact type, ArtifactCollector with thread-safe Record()/Drain(), base64 encode/decode, 8KB per-artifact / 20 per-check caps - Pipeline: runner.go Cancel() emits via t.Logf → phases.go extracts using Contains+SplitN (handles t.Logf source prefixes) → evidence renderer emits labeled code blocks in markdown - Artifacts are ephemeral (json:"-") — never persisted in saved results - Failed artifact decodes log a warning and preserve the line in Reason Conformance checks instrumented (9 checks): - dra_support_check: controller, kubelet plugin, ResourceSlices - accelerator_metrics_check: DCGM metrics sample, required metrics - ai_service_metrics_check: Prometheus query, custom metrics API - inference_gateway_check: GatewayClass, Gateway, CRDs, data plane - robust_controller_check: Dynamo operator, webhook, rejection test - secure_access_check: DRA test pod, access patterns, isolation test - gang_scheduling_check: KAI scheduler, GPU availability, gang results - pod_autoscaling_check: custom/external metrics API, HPA test - cluster_autoscaling_check: Karpenter, NodePools, autoscaling test Testing: - Artifact encode/decode round-trip, cap enforcement, thread safety - extractArtifacts() with realistic source-prefixed t.Logf lines - Evidence renderer with/without artifacts
2c2ae52 to
1d15151
Compare
…checks LoadValidationContext() used DiagnosticTimeout (2 minutes) as the parent context for all conformance checks. Behavioral checks like DRA secure access need time for pod creation, CUDA image pull, GPU allocation, and isolation verification — 2 minutes was insufficient, causing consistent TIMEOUT failures. Add CheckExecutionTimeout (10 minutes) for the check execution context, bounded below ValidateConformanceTimeout (15 minutes).
|
Thanks for resolving the previous comments, most of this looks good now.
The DRA support check does it right now, it captures actual replica counts, image versions, and ResourceSlice counts. May be better to capture actual HPA |
will iterate for sure! it's not done done yet :) |
Summary
json:"-") and transported via base64-encodedARTIFACT:lines in test output, decoded inphases.go, rendered as labeled code blocks in evidence templatesDesign
Key constraints:
Artifacttype lives inchecks/(leaf package, no import cycles)json:"-" yaml:"-") — never persisted in saved resultsCancel()nil-guards bothr.ctxandr.ctx.ArtifactsFiles Changed
checks/artifact.go,checks/registry.go,checks/runner.go,result.go,phases.goevidence/types.go,evidence/renderer.go,evidence/templates.godra_support_check.go,accelerator_metrics_check.go,ai_service_metrics_check.go,inference_gateway_check.gorobust_controller_check.go,secure_access_check.go,gang_scheduling_check.go,pod_autoscaling_check.go,cluster_autoscaling_check.goartifact_test.go,runner_test.go,renderer_test.go,dra_support_check_unit_test.goTest plan
make testpasses with race detector (73.9% coverage)