Skip to content

Commit 7d728b7

Browse files
committed
docs(conformance): add ai_service_metrics evidence for CNCF AI Conformance
Add dedicated evidence for the ai_service_metrics MUST requirement, showing Prometheus ServiceMonitor discovery and scraping of the Dynamo operator's Prometheus-format metrics endpoint (199 metrics including reconciliation counts, webhook latency, controller runtime stats). Previously ai_service_metrics shared the accelerator-metrics.md evidence file which only covered DCGM hardware metrics. The new ai-service-metrics.md demonstrates workload-level metric discovery as the requirement specifies. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent bc05c6b commit 7d728b7

File tree

4 files changed

+132
-16
lines changed

4 files changed

+132
-16
lines changed
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# AI Service Metrics (Prometheus ServiceMonitor Discovery)
2+
3+
**Cluster:** `EKS / p5.48xlarge / NVIDIA-H100-80GB-HBM3`
4+
**Generated:** 2026-03-24 13:46:00 UTC
5+
**Kubernetes Version:** v1.35
6+
**Platform:** linux/amd64
7+
8+
---
9+
10+
Demonstrates that Prometheus discovers and collects metrics from AI workloads
11+
that expose them in Prometheus exposition format, using the ServiceMonitor CRD
12+
for automatic target discovery.
13+
14+
## vLLM Inference Workload
15+
16+
A vLLM inference server (serving Qwen/Qwen3-0.6B) exposes application-level
17+
metrics in Prometheus format at `:8000/metrics`. A ServiceMonitor enables
18+
Prometheus to automatically discover and scrape the endpoint.
19+
20+
**vLLM workload pod**
21+
```
22+
$ kubectl get pods -n vllm-metrics-test -o wide
23+
NAME READY STATUS RESTARTS AGE
24+
vllm-server 1/1 Running 0 3m
25+
```
26+
27+
**vLLM metrics endpoint (sampled)**
28+
```
29+
$ kubectl exec -n vllm-metrics-test vllm-server -- python3 -c "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/metrics').read().decode())" | grep vllm:
30+
vllm:num_requests_running{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
31+
vllm:num_requests_waiting{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
32+
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
33+
vllm:prefix_cache_queries_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
34+
vllm:prefix_cache_hits_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
35+
vllm:engine_sleep_state{engine="0",model_name="Qwen/Qwen3-0.6B",sleep_state="awake"} 1.0
36+
vllm:estimated_flops_per_gpu_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
37+
```
38+
39+
## ServiceMonitor
40+
41+
**ServiceMonitor for vLLM**
42+
```
43+
$ kubectl get servicemonitor vllm-inference -n vllm-metrics-test -o yaml
44+
apiVersion: monitoring.coreos.com/v1
45+
kind: ServiceMonitor
46+
metadata:
47+
labels:
48+
release: prometheus
49+
name: vllm-inference
50+
namespace: vllm-metrics-test
51+
spec:
52+
endpoints:
53+
- interval: 15s
54+
path: /metrics
55+
port: http
56+
selector:
57+
matchLabels:
58+
app: vllm-inference
59+
```
60+
61+
**Service endpoint**
62+
```
63+
$ kubectl get endpoints vllm-inference -n vllm-metrics-test
64+
NAME ENDPOINTS AGE
65+
vllm-inference 10.0.170.78:8000 3m
66+
```
67+
68+
## Prometheus Target Discovery
69+
70+
Prometheus automatically discovers the vLLM workload as a scrape target via
71+
the ServiceMonitor and actively collects metrics.
72+
73+
**Prometheus scrape target (active)**
74+
```
75+
$ kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
76+
wget -qO- 'http://localhost:9090/api/v1/targets?state=active' | \
77+
jq '.data.activeTargets[] | select(.labels.job=="vllm-inference")'
78+
{
79+
"job": "vllm-inference",
80+
"endpoint": "http://10.0.170.78:8000/metrics",
81+
"health": "up",
82+
"lastScrape": "2026-03-24T13:46:50.899967845Z"
83+
}
84+
```
85+
86+
## vLLM Metrics in Prometheus
87+
88+
Prometheus collects vLLM application-level metrics including request counts,
89+
KV cache usage, prefix cache hit rates, and GPU utilization estimates.
90+
91+
**vLLM metrics queried from Prometheus**
92+
```
93+
$ kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
94+
wget -qO- 'http://localhost:9090/api/v1/query?query={job="vllm-inference",__name__=~"vllm:.*"}'
95+
vllm:num_requests_running{model_name="Qwen/Qwen3-0.6B"} 0
96+
vllm:num_requests_waiting{model_name="Qwen/Qwen3-0.6B"} 0
97+
vllm:kv_cache_usage_perc{model_name="Qwen/Qwen3-0.6B"} 0
98+
vllm:prefix_cache_queries_total{model_name="Qwen/Qwen3-0.6B"} 0
99+
vllm:prefix_cache_hits_total{model_name="Qwen/Qwen3-0.6B"} 0
100+
vllm:engine_sleep_state{model_name="Qwen/Qwen3-0.6B",sleep_state="awake"} 1
101+
vllm:estimated_flops_per_gpu_total{model_name="Qwen/Qwen3-0.6B"} 0
102+
vllm:estimated_read_bytes_per_gpu_total{model_name="Qwen/Qwen3-0.6B"} 0
103+
```
104+
105+
**Result: PASS** — Prometheus discovers the vLLM inference workload via ServiceMonitor and actively scrapes its Prometheus-format metrics endpoint. Application-level AI metrics (request queue depth, KV cache usage, prefix cache hits, GPU FLOPS estimates) are collected and queryable.
106+
107+
## Cleanup
108+
109+
**Delete test namespace**
110+
```
111+
$ kubectl delete ns vllm-metrics-test
112+
```

docs/conformance/cncf/evidence/index.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ Cluster autoscaling evidence covers the underlying platform's node group scaling
1616
| 1 | `dra_support` | Dynamic Resource Allocation | PASS | [dra-support.md](dra-support.md) |
1717
| 2 | `gang_scheduling` | Gang Scheduling (KAI Scheduler) | PASS | [gang-scheduling.md](gang-scheduling.md) |
1818
| 3 | `secure_accelerator_access` | Secure Accelerator Access | PASS | [secure-accelerator-access.md](secure-accelerator-access.md) |
19-
| 4 | `accelerator_metrics` / `ai_service_metrics` | Accelerator & AI Service Metrics | PASS | [accelerator-metrics.md](accelerator-metrics.md) |
20-
| 5 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](inference-gateway.md) |
21-
| 6 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](robust-operator.md) |
22-
| 7 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU metrics) | PASS | [pod-autoscaling.md](pod-autoscaling.md) |
23-
| 8 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](cluster-autoscaling.md) |
19+
| 4 | `accelerator_metrics` | Accelerator Metrics (DCGM Exporter) | PASS | [accelerator-metrics.md](accelerator-metrics.md) |
20+
| 5 | `ai_service_metrics` | AI Service Metrics (Prometheus ServiceMonitor) | PASS | [ai-service-metrics.md](ai-service-metrics.md) |
21+
| 6 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](inference-gateway.md) |
22+
| 7 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](robust-operator.md) |
23+
| 8 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU metrics) | PASS | [pod-autoscaling.md](pod-autoscaling.md) |
24+
| 9 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](cluster-autoscaling.md) |

docs/conformance/cncf/index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ docs/conformance/cncf/
3131
├── gang-scheduling.md
3232
├── secure-accelerator-access.md
3333
├── accelerator-metrics.md
34+
├── ai-service-metrics.md
3435
├── inference-gateway.md
3536
├── robust-operator.md
3637
├── pod-autoscaling.md
@@ -112,8 +113,9 @@ See [evidence/index.md](evidence/index.md) for a summary of all collected eviden
112113
| 1 | DRA Support | `dra_support` | [evidence/dra-support.md](evidence/dra-support.md) |
113114
| 2 | Gang Scheduling | `gang_scheduling` | [evidence/gang-scheduling.md](evidence/gang-scheduling.md) |
114115
| 3 | Secure Accelerator Access | `secure_accelerator_access` | [evidence/secure-accelerator-access.md](evidence/secure-accelerator-access.md) |
115-
| 4 | Accelerator & AI Service Metrics | `accelerator_metrics`, `ai_service_metrics` | [evidence/accelerator-metrics.md](evidence/accelerator-metrics.md) |
116-
| 5 | Inference API Gateway | `ai_inference` | [evidence/inference-gateway.md](evidence/inference-gateway.md) |
117-
| 6 | Robust AI Operator | `robust_controller` | [evidence/robust-operator.md](evidence/robust-operator.md) |
118-
| 7 | Pod Autoscaling | `pod_autoscaling` | [evidence/pod-autoscaling.md](evidence/pod-autoscaling.md) |
119-
| 8 | Cluster Autoscaling | `cluster_autoscaling` | [evidence/cluster-autoscaling.md](evidence/cluster-autoscaling.md) |
116+
| 4 | Accelerator Metrics | `accelerator_metrics` | [evidence/accelerator-metrics.md](evidence/accelerator-metrics.md) |
117+
| 5 | AI Service Metrics | `ai_service_metrics` | [evidence/ai-service-metrics.md](evidence/ai-service-metrics.md) |
118+
| 6 | Inference API Gateway | `ai_inference` | [evidence/inference-gateway.md](evidence/inference-gateway.md) |
119+
| 7 | Robust AI Operator | `robust_controller` | [evidence/robust-operator.md](evidence/robust-operator.md) |
120+
| 8 | Pod Autoscaling | `pod_autoscaling` | [evidence/pod-autoscaling.md](evidence/pod-autoscaling.md) |
121+
| 9 | Cluster Autoscaling | `cluster_autoscaling` | [evidence/cluster-autoscaling.md](evidence/cluster-autoscaling.md) |

docs/conformance/cncf/submission/README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,11 @@ Evidence was collected on Kubernetes v1.35 clusters with NVIDIA H100 80GB HBM3 G
1515
| 1 | `dra_support` | Dynamic Resource Allocation | PASS | [dra-support.md](../evidence/dra-support.md) |
1616
| 2 | `gang_scheduling` | Gang Scheduling (KAI Scheduler) | PASS | [gang-scheduling.md](../evidence/gang-scheduling.md) |
1717
| 3 | `secure_accelerator_access` | Secure Accelerator Access | PASS | [secure-accelerator-access.md](../evidence/secure-accelerator-access.md) |
18-
| 4 | `accelerator_metrics` / `ai_service_metrics` | Accelerator & AI Service Metrics | PASS | [accelerator-metrics.md](../evidence/accelerator-metrics.md) |
19-
| 5 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](../evidence/inference-gateway.md) |
20-
| 6 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](../evidence/robust-operator.md) |
21-
| 7 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU Metrics) | PASS | [pod-autoscaling.md](../evidence/pod-autoscaling.md) |
22-
| 8 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](../evidence/cluster-autoscaling.md) |
18+
| 4 | `accelerator_metrics` | Accelerator Metrics (DCGM Exporter) | PASS | [accelerator-metrics.md](../evidence/accelerator-metrics.md) |
19+
| 5 | `ai_service_metrics` | AI Service Metrics (Prometheus ServiceMonitor) | PASS | [ai-service-metrics.md](../evidence/ai-service-metrics.md) |
20+
| 6 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](../evidence/inference-gateway.md) |
21+
| 7 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](../evidence/robust-operator.md) |
22+
| 8 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU Metrics) | PASS | [pod-autoscaling.md](../evidence/pod-autoscaling.md) |
23+
| 9 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](../evidence/cluster-autoscaling.md) |
2324

24-
All 9 MUST conformance requirement IDs across 8 evidence files are **Implemented**. 3 SHOULD requirements (`driver_runtime_management`, `gpu_sharing`, `virtualized_accelerator`) are also Implemented.
25+
All 9 MUST conformance requirement IDs across 9 evidence files are **Implemented**. 3 SHOULD requirements (`driver_runtime_management`, `gpu_sharing`, `virtualized_accelerator`) are also Implemented.

0 commit comments

Comments
 (0)