feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images by yuanchen8911 · Pull Request #463 · NVIDIA/aicr

yuanchen8911 · 2026-03-25T10:45:26Z

Summary

Split accelerator_metrics/ai_service_metrics evidence collection into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from #438 that caused GPU CI tests to timeout.

Motivation / Context

Evidence split: The CNCF AI Conformance ai_service_metrics requirement asks for "discovering and collecting metrics from workloads exposing Prometheus format." Previously both requirements shared a single evidence file. This split provides dedicated evidence showing real AI workload metrics from both inference and training platforms.

imagePullPolicy fix: PR #438 changed imagePullPolicy to Always for :latest tagged images. On nvkind CI, validators use ko.local:* images (locally built, side-loaded into kind). With PullAlways, the kubelet attempts to pull from the non-existent ko.local registry on every validator pod, timing out after ~5 minutes before falling back to the cached image. This turned a 2-minute conformance run into 50+ minutes, exceeding the CI job timeout. All GPU CI tests have been failing/cancelled since March 19 (#438 merge date).

Fixes: N/A
Related: #460, #438, cncf/k8s-ai-conformance#79

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

imagePullPolicy fix (`pkg/validator/job/deployer.go`)

Local images (ko.local, kind.local, localhost) now always use IfNotPresent. Remote images with :latest tag still use Always. This restores the pre-#438 behavior for CI while preserving the dev-build freshness guarantee for remote registries.

Before (all GPU CI tests timing out since March 19):

GPU Conformance: 60m timeout → cancelled
Each validator: ~5 min (image pull timeout) × 8 validators = 40+ min

After:

GPU Conformance: 17m (pass)
GPU Training: 17m (pass)
Each validator: seconds (cached image, no pull attempt)

Evidence collection (`pkg/evidence`)

Auto-detection of workload type:

Dynamo workload running → PodMonitor path (worker :9090/metrics + frontend :8000/metrics)
No Dynamo → standalone PyTorch training path (:8080/metrics via ServiceMonitor)

Embedded manifests (pkg/evidence/scripts/manifests/):

dynamo-vllm-agg.yaml — DynamoGraphDeployment with DRA ResourceClaim + KAI queue
trainer-pytorch-test.yaml — Standalone PyTorch pod exposing training_step_total, training_loss, training_throughput_samples_per_sec, training_gpu_memory_* metrics

Prometheus query scoping:

Inference: targets matched by job prefix dynamo-system/dynamo-
Training: targets matched by exact job=pytorch-training-metrics, metrics queried with {job="pytorch-training-metrics"} label

E2E tested on:

aicr-cuj2 (inference): Dynamo vLLM → 10 requests → non-zero dynamo_component_* + dynamo_frontend_* metrics in Prometheus
aicr-cuj1 (training): PyTorch training → 100 steps → training_step_total=100, training_loss=1.33, training_throughput=549K samples/s in Prometheus

Testing

# Unit tests
go test -v ./pkg/evidence/... -race     # All pass
go test -v ./pkg/validator/job/... -run TestImagePullPolicy -race  # All pass (7 cases)

# E2E — inference cluster (aicr-cuj2)
aicr validate --recipe h100-inference-recipe.yaml \
  --phase conformance --cncf-submission -f service-metrics \
  --evidence-dir ./evidence
# PASS

# E2E — training cluster (aicr-cuj1)
aicr validate --recipe h100-training-recipe.yaml \
  --phase conformance --cncf-submission -f service-metrics \
  --evidence-dir ./evidence
# PASS

# GPU CI (nvkind)
# GPU Conformance: PASS (17m, was 60m timeout)
# GPU Training: PASS (17m)

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: The metrics alias now maps to accelerator-metrics (was metrics script section). Use -f service-metrics for AI service metrics.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

pkg/evidence/scripts/manifests/trainer-pytorch-test.yaml

…local images Split accelerator_metrics/ai_service_metrics evidence into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local, localhost) now use IfNotPresent instead of Always, preventing 5-minute pull timeout per validator on nvkind CI clusters. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 requested a review from a team as a code owner March 25, 2026 10:45

yuanchen8911 added enhancement New feature or request area/validator area/docs labels Mar 25, 2026

github-actions bot added size/XL and removed area/validator labels Mar 25, 2026

yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch from 5411437 to 21bdffd Compare March 25, 2026 11:13

github-advanced-security bot found potential problems Mar 25, 2026

View reviewed changes

yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch 2 times, most recently from 3928487 to ea75dda Compare March 25, 2026 11:20

yuanchen8911 requested review from dims and mchmarny March 25, 2026 11:20

yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch 3 times, most recently from 9123d9f to e30cd7d Compare March 25, 2026 13:40

github-actions bot added the area/validator label Mar 25, 2026

yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch 4 times, most recently from 5585b03 to b80d266 Compare March 25, 2026 14:04

yuanchen8911 changed the title ~~feat(evidence): split ai_service_metrics with Dynamo and Kubeflow Trainer support~~ feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images Mar 25, 2026

yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch from b80d266 to 41c855e Compare March 25, 2026 14:59

yuanchen8911 requested a review from a team as a code owner March 25, 2026 14:59

github-actions bot added area/ci area/tests labels Mar 25, 2026

yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch from 41c855e to 65adaa1 Compare March 25, 2026 15:01

dims approved these changes Mar 25, 2026

View reviewed changes

yuanchen8911 merged commit 6137c0b into NVIDIA:main Mar 25, 2026
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images#463

feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images#463
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/split-service-metrics-evidence

yuanchen8911 commented Mar 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

imagePullPolicy fix (pkg/validator/job/deployer.go)

Evidence collection (pkg/evidence)

Testing

Risk Assessment

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Mar 25, 2026 •

edited

Loading

imagePullPolicy fix (`pkg/validator/job/deployer.go`)

Evidence collection (`pkg/evidence`)