docs(conformance): add ai_service_metrics evidence for CNCF submission by yuanchen8911 · Pull Request #460 · NVIDIA/aicr

yuanchen8911 · 2026-03-24T13:27:24Z

Summary

Add dedicated evidence for the ai_service_metrics MUST requirement, splitting it from the shared accelerator-metrics.md file.

Motivation / Context

The CNCF AI Conformance ai_service_metrics requirement asks for "discovering and collecting metrics from workloads that expose them in a standard format (Prometheus exposition format)." Previously both accelerator_metrics and ai_service_metrics pointed to the same accelerator-metrics.md evidence file which only covered DCGM hardware metrics. A reviewer could reasonably object that infrastructure-level GPU metrics are not the same as workload-level service metrics.

Fixes: N/A
Related: CNCF AI Conformance submission

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: ____________

Implementation Notes

New ai-service-metrics.md evidence collected from the aicr-cuj2 EKS cluster, showing:

Dynamo operator ServiceMonitor configuration (automatic target discovery)
Prometheus actively scraping the Dynamo operator's /metrics endpoint
199 workload-level metrics including dynamo_operator_reconcile_duration_seconds_* and controller_runtime_reconcile_total per CRD controller

Updated index.md, submission/README.md, and docs/conformance/cncf/index.md to split the combined accelerator_metrics/ai_service_metrics row into separate entries.

Testing

# Documentation-only change, no code affected

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

demos/workloads/inference/vllm-metrics-test.yaml

…location Add dedicated evidence for the ai_service_metrics MUST requirement, showing Prometheus ServiceMonitor discovery and scraping of a vLLM inference workload's Prometheus-format metrics endpoint. Evidence includes real inference traffic: 10 requests, 500 generation tokens, TTFT and inter-token latency metrics collected from Prometheus. Update vllm-agg.yaml to use DRA ResourceClaims instead of device-plugin GPU requests, fixing deployment on DRA-only clusters with KAI scheduler. Add vllm-metrics-test.yaml for standalone vLLM metrics evidence collection. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 · 2026-03-24T14:18:47Z

Trivy Findings

The initial push had 8 Trivy findings on demos/workloads/inference/vllm-metrics-test.yaml. 6 have been resolved by hardening the pod spec:

✅ Pod/container default security context → added explicit securityContext
✅ Missing allowPrivilegeEscalation: false → added
✅ Missing seccompProfile → added RuntimeDefault
✅ Missing resource requests/limits → added CPU/memory requests and limits
✅ Floating :latest tag → pinned to v0.18.0
✅ readOnlyRootFilesystem not set → set to true with emptyDir mounts for /root/.cache and /tmp

2 remaining (justified):

runAsNonRoot: false — Tested runAsUser: 1000 on the cluster; vLLM/PyTorch fails with KeyError: 'getpwuid(): uid not found: 1000' because getpass.getuser() requires a /etc/passwd entry. The upstream vllm/vllm-openai image does not define a non-root user.
Untrusted registry — vllm/vllm-openai is the official vLLM image on Docker Hub. There is no NVIDIA-hosted equivalent on nvcr.io (the Dynamo vLLM runtime wraps vLLM differently and does not expose a Prometheus /metrics endpoint). This manifest is used for evidence collection, not production deployment.

yuanchen8911 requested a review from a team as a code owner March 24, 2026 13:27

yuanchen8911 added documentation Improvements or additions to documentation area/docs labels Mar 24, 2026

github-actions bot added the size/M label Mar 24, 2026

yuanchen8911 requested review from cullenmcdermott, dims, lalitadithya and mchmarny March 24, 2026 13:29

yuanchen8911 force-pushed the docs/ai-service-metrics-evidence branch 2 times, most recently from 7d728b7 to 14003ce Compare March 24, 2026 14:09

github-actions bot added size/L and removed size/M labels Mar 24, 2026

github-advanced-security bot found potential problems Mar 24, 2026

View reviewed changes

yuanchen8911 force-pushed the docs/ai-service-metrics-evidence branch 2 times, most recently from 22a81bd to 0132161 Compare March 24, 2026 14:15

yuanchen8911 force-pushed the docs/ai-service-metrics-evidence branch from 0132161 to e9b3b7c Compare March 24, 2026 14:18

cullenmcdermott approved these changes Mar 24, 2026

View reviewed changes

yuanchen8911 merged commit cfcbc60 into NVIDIA:main Mar 24, 2026
22 of 23 checks passed

yuanchen8911 mentioned this pull request Mar 25, 2026

feat(evidence): split ai_service_metrics and fix imagePullPolicy for local images #463

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(conformance): add ai_service_metrics evidence for CNCF submission#460

docs(conformance): add ai_service_metrics evidence for CNCF submission#460
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:docs/ai-service-metrics-evidence

yuanchen8911 commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yuanchen8911 commented Mar 24, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Mar 24, 2026

Trivy Findings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants