Skip to content

feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images#463

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/split-service-metrics-evidence
Mar 25, 2026
Merged

feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images#463
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:feat/split-service-metrics-evidence

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

@yuanchen8911 yuanchen8911 commented Mar 25, 2026

Summary

Split accelerator_metrics/ai_service_metrics evidence collection into separate paths with auto-detection of inference (Dynamo) vs training (PyTorch) workloads. Fix imagePullPolicy regression from #438 that caused GPU CI tests to timeout.

Motivation / Context

Evidence split: The CNCF AI Conformance ai_service_metrics requirement asks for "discovering and collecting metrics from workloads exposing Prometheus format." Previously both requirements shared a single evidence file. This split provides dedicated evidence showing real AI workload metrics from both inference and training platforms.

imagePullPolicy fix: PR #438 changed imagePullPolicy to Always for :latest tagged images. On nvkind CI, validators use ko.local:* images (locally built, side-loaded into kind). With PullAlways, the kubelet attempts to pull from the non-existent ko.local registry on every validator pod, timing out after ~5 minutes before falling back to the cached image. This turned a 2-minute conformance run into 50+ minutes, exceeding the CI job timeout. All GPU CI tests have been failing/cancelled since March 19 (#438 merge date).

Fixes: N/A
Related: #460, #438, cncf/k8s-ai-conformance#79

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

imagePullPolicy fix (pkg/validator/job/deployer.go)

Local images (ko.local, kind.local, localhost) now always use IfNotPresent. Remote images with :latest tag still use Always. This restores the pre-#438 behavior for CI while preserving the dev-build freshness guarantee for remote registries.

Before (all GPU CI tests timing out since March 19):

  • GPU Conformance: 60m timeout → cancelled
  • Each validator: ~5 min (image pull timeout) × 8 validators = 40+ min

After:

  • GPU Conformance: 17m (pass)
  • GPU Training: 17m (pass)
  • Each validator: seconds (cached image, no pull attempt)

Evidence collection (pkg/evidence)

Auto-detection of workload type:

  • Dynamo workload running → PodMonitor path (worker :9090/metrics + frontend :8000/metrics)
  • No Dynamo → standalone PyTorch training path (:8080/metrics via ServiceMonitor)

Embedded manifests (pkg/evidence/scripts/manifests/):

  • dynamo-vllm-agg.yaml — DynamoGraphDeployment with DRA ResourceClaim + KAI queue
  • trainer-pytorch-test.yaml — Standalone PyTorch pod exposing training_step_total, training_loss, training_throughput_samples_per_sec, training_gpu_memory_* metrics

Prometheus query scoping:

  • Inference: targets matched by job prefix dynamo-system/dynamo-
  • Training: targets matched by exact job=pytorch-training-metrics, metrics queried with {job="pytorch-training-metrics"} label

E2E tested on:

  • aicr-cuj2 (inference): Dynamo vLLM → 10 requests → non-zero dynamo_component_* + dynamo_frontend_* metrics in Prometheus
  • aicr-cuj1 (training): PyTorch training → 100 steps → training_step_total=100, training_loss=1.33, training_throughput=549K samples/s in Prometheus

Testing

# Unit tests
go test -v ./pkg/evidence/... -race     # All pass
go test -v ./pkg/validator/job/... -run TestImagePullPolicy -race  # All pass (7 cases)

# E2E — inference cluster (aicr-cuj2)
aicr validate --recipe h100-inference-recipe.yaml \
  --phase conformance --cncf-submission -f service-metrics \
  --evidence-dir ./evidence
# PASS

# E2E — training cluster (aicr-cuj1)
aicr validate --recipe h100-training-recipe.yaml \
  --phase conformance --cncf-submission -f service-metrics \
  --evidence-dir ./evidence
# PASS

# GPU CI (nvkind)
# GPU Conformance: PASS (17m, was 60m timeout)
# GPU Training: PASS (17m)

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: The metrics alias now maps to accelerator-metrics (was metrics script section). Use -f service-metrics for AI service metrics.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 25, 2026 10:45
@yuanchen8911 yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch from 5411437 to 21bdffd Compare March 25, 2026 11:13
@yuanchen8911 yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch 2 times, most recently from 3928487 to ea75dda Compare March 25, 2026 11:20
@yuanchen8911 yuanchen8911 requested review from dims and mchmarny March 25, 2026 11:20
@yuanchen8911 yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch 3 times, most recently from 9123d9f to e30cd7d Compare March 25, 2026 13:40
@yuanchen8911 yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch 4 times, most recently from 5585b03 to b80d266 Compare March 25, 2026 14:04
@yuanchen8911 yuanchen8911 changed the title feat(evidence): split ai_service_metrics with Dynamo and Kubeflow Trainer support feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images Mar 25, 2026
@yuanchen8911 yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch from b80d266 to 41c855e Compare March 25, 2026 14:59
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 25, 2026 14:59
…local images

Split accelerator_metrics/ai_service_metrics evidence into separate paths
with auto-detection of inference (Dynamo) vs training (PyTorch) workloads.

Fix imagePullPolicy regression from NVIDIA#438: local images (ko.local, kind.local,
localhost) now use IfNotPresent instead of Always, preventing 5-minute pull
timeout per validator on nvkind CI clusters.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the feat/split-service-metrics-evidence branch from 41c855e to 65adaa1 Compare March 25, 2026 15:01
@yuanchen8911 yuanchen8911 merged commit 6137c0b into NVIDIA:main Mar 25, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants