Skip to content

docs(conformance): add ai_service_metrics evidence for CNCF submission#460

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:docs/ai-service-metrics-evidence
Mar 24, 2026
Merged

docs(conformance): add ai_service_metrics evidence for CNCF submission#460
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:docs/ai-service-metrics-evidence

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Add dedicated evidence for the ai_service_metrics MUST requirement, splitting it from the shared accelerator-metrics.md file.

Motivation / Context

The CNCF AI Conformance ai_service_metrics requirement asks for "discovering and collecting metrics from workloads that expose them in a standard format (Prometheus exposition format)." Previously both accelerator_metrics and ai_service_metrics pointed to the same accelerator-metrics.md evidence file which only covered DCGM hardware metrics. A reviewer could reasonably object that infrastructure-level GPU metrics are not the same as workload-level service metrics.

Fixes: N/A
Related: CNCF AI Conformance submission

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

New ai-service-metrics.md evidence collected from the aicr-cuj2 EKS cluster, showing:

  • Dynamo operator ServiceMonitor configuration (automatic target discovery)
  • Prometheus actively scraping the Dynamo operator's /metrics endpoint
  • 199 workload-level metrics including dynamo_operator_reconcile_duration_seconds_* and controller_runtime_reconcile_total per CRD controller

Updated index.md, submission/README.md, and docs/conformance/cncf/index.md to split the combined accelerator_metrics/ai_service_metrics row into separate entries.

Testing

# Documentation-only change, no code affected

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 24, 2026 13:27
@yuanchen8911 yuanchen8911 added documentation Improvements or additions to documentation area/docs labels Mar 24, 2026
@yuanchen8911 yuanchen8911 force-pushed the docs/ai-service-metrics-evidence branch 2 times, most recently from 7d728b7 to 14003ce Compare March 24, 2026 14:09
@github-actions github-actions bot added size/L and removed size/M labels Mar 24, 2026
@yuanchen8911 yuanchen8911 force-pushed the docs/ai-service-metrics-evidence branch 2 times, most recently from 22a81bd to 0132161 Compare March 24, 2026 14:15
…location

Add dedicated evidence for the ai_service_metrics MUST requirement,
showing Prometheus ServiceMonitor discovery and scraping of a vLLM
inference workload's Prometheus-format metrics endpoint. Evidence includes
real inference traffic: 10 requests, 500 generation tokens, TTFT and
inter-token latency metrics collected from Prometheus.

Update vllm-agg.yaml to use DRA ResourceClaims instead of device-plugin
GPU requests, fixing deployment on DRA-only clusters with KAI scheduler.
Add vllm-metrics-test.yaml for standalone vLLM metrics evidence collection.

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the docs/ai-service-metrics-evidence branch from 0132161 to e9b3b7c Compare March 24, 2026 14:18
@yuanchen8911
Copy link
Copy Markdown
Contributor Author

Trivy Findings

The initial push had 8 Trivy findings on demos/workloads/inference/vllm-metrics-test.yaml. 6 have been resolved by hardening the pod spec:

  • ✅ Pod/container default security context → added explicit securityContext
  • ✅ Missing allowPrivilegeEscalation: false → added
  • ✅ Missing seccompProfile → added RuntimeDefault
  • ✅ Missing resource requests/limits → added CPU/memory requests and limits
  • ✅ Floating :latest tag → pinned to v0.18.0
  • readOnlyRootFilesystem not set → set to true with emptyDir mounts for /root/.cache and /tmp

2 remaining (justified):

  1. runAsNonRoot: false — Tested runAsUser: 1000 on the cluster; vLLM/PyTorch fails with KeyError: 'getpwuid(): uid not found: 1000' because getpass.getuser() requires a /etc/passwd entry. The upstream vllm/vllm-openai image does not define a non-root user.

  2. Untrusted registryvllm/vllm-openai is the official vLLM image on Docker Hub. There is no NVIDIA-hosted equivalent on nvcr.io (the Dynamo vLLM runtime wraps vLLM differently and does not expose a Prometheus /metrics endpoint). This manifest is used for evidence collection, not production deployment.

@yuanchen8911 yuanchen8911 merged commit cfcbc60 into NVIDIA:main Mar 24, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs documentation Improvements or additions to documentation size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants