docs(conformance): add ai_service_metrics evidence for CNCF AI Conformance

yuanchen8911 · yuanchen8911 · commit 7d728b7d7c12 · 2026-03-24T14:48:00.000+01:00
Add dedicated evidence for the ai_service_metrics MUST requirement,
showing Prometheus ServiceMonitor discovery and scraping of the Dynamo
operator's Prometheus-format metrics endpoint (199 metrics including
reconciliation counts, webhook latency, controller runtime stats).

Previously ai_service_metrics shared the accelerator-metrics.md evidence
file which only covered DCGM hardware metrics. The new ai-service-metrics.md
demonstrates workload-level metric discovery as the requirement specifies.

Signed-off-by: Yuan Chen &lt;yuanchen97@gmail.com&gt;
diff --git a/docs/conformance/cncf/evidence/ai-service-metrics.md b/docs/conformance/cncf/evidence/ai-service-metrics.md
@@ -0,0 +1,112 @@
+# AI Service Metrics (Prometheus ServiceMonitor Discovery)
+
+**Cluster:** `EKS / p5.48xlarge / NVIDIA-H100-80GB-HBM3`
+**Generated:** 2026-03-24 13:46:00 UTC
+**Kubernetes Version:** v1.35
+**Platform:** linux/amd64
+
+---
+
+Demonstrates that Prometheus discovers and collects metrics from AI workloads
+that expose them in Prometheus exposition format, using the ServiceMonitor CRD
+for automatic target discovery.
+
+## vLLM Inference Workload
+
+A vLLM inference server (serving Qwen/Qwen3-0.6B) exposes application-level
+metrics in Prometheus format at `:8000/metrics`. A ServiceMonitor enables
+Prometheus to automatically discover and scrape the endpoint.
+
+**vLLM workload pod**
+```
+$ kubectl get pods -n vllm-metrics-test -o wide
+NAME          READY   STATUS    RESTARTS   AGE
+vllm-server   1/1     Running   0          3m
+```
+
+**vLLM metrics endpoint (sampled)**
+```
+$ kubectl exec -n vllm-metrics-test vllm-server -- python3 -c "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/metrics').read().decode())" | grep vllm:
+vllm:num_requests_running{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
+vllm:num_requests_waiting{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
+vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
+vllm:prefix_cache_queries_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
+vllm:prefix_cache_hits_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
+vllm:engine_sleep_state{engine="0",model_name="Qwen/Qwen3-0.6B",sleep_state="awake"} 1.0
+vllm:estimated_flops_per_gpu_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
+```
+
+## ServiceMonitor
+
+**ServiceMonitor for vLLM**
+```
+$ kubectl get servicemonitor vllm-inference -n vllm-metrics-test -o yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  labels:
+    release: prometheus
+  name: vllm-inference
+  namespace: vllm-metrics-test
+spec:
+  endpoints:
+  - interval: 15s
+    path: /metrics
+    port: http
+  selector:
+    matchLabels:
+      app: vllm-inference
+```
+
+**Service endpoint**
+```
+$ kubectl get endpoints vllm-inference -n vllm-metrics-test
+NAME             ENDPOINTS          AGE
+vllm-inference   10.0.170.78:8000   3m
+```
+
+## Prometheus Target Discovery
+
+Prometheus automatically discovers the vLLM workload as a scrape target via
+the ServiceMonitor and actively collects metrics.
+
+**Prometheus scrape target (active)**
+```
+$ kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
+    wget -qO- 'http://localhost:9090/api/v1/targets?state=active' | \
+    jq '.data.activeTargets[] | select(.labels.job=="vllm-inference")'
+{
+  "job": "vllm-inference",
+  "endpoint": "http://10.0.170.78:8000/metrics",
+  "health": "up",
+  "lastScrape": "2026-03-24T13:46:50.899967845Z"
+}
+```
+
+## vLLM Metrics in Prometheus
+
+Prometheus collects vLLM application-level metrics including request counts,
+KV cache usage, prefix cache hit rates, and GPU utilization estimates.
+
+**vLLM metrics queried from Prometheus**
+```
+$ kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
+    wget -qO- 'http://localhost:9090/api/v1/query?query={job="vllm-inference",__name__=~"vllm:.*"}'
+vllm:num_requests_running{model_name="Qwen/Qwen3-0.6B"} 0
+vllm:num_requests_waiting{model_name="Qwen/Qwen3-0.6B"} 0
+vllm:kv_cache_usage_perc{model_name="Qwen/Qwen3-0.6B"} 0
+vllm:prefix_cache_queries_total{model_name="Qwen/Qwen3-0.6B"} 0
+vllm:prefix_cache_hits_total{model_name="Qwen/Qwen3-0.6B"} 0
+vllm:engine_sleep_state{model_name="Qwen/Qwen3-0.6B",sleep_state="awake"} 1
+vllm:estimated_flops_per_gpu_total{model_name="Qwen/Qwen3-0.6B"} 0
+vllm:estimated_read_bytes_per_gpu_total{model_name="Qwen/Qwen3-0.6B"} 0
+```
+
+**Result: PASS** — Prometheus discovers the vLLM inference workload via ServiceMonitor and actively scrapes its Prometheus-format metrics endpoint. Application-level AI metrics (request queue depth, KV cache usage, prefix cache hits, GPU FLOPS estimates) are collected and queryable.
+
+## Cleanup
+
+**Delete test namespace**
+```
+$ kubectl delete ns vllm-metrics-test
+```
diff --git a/docs/conformance/cncf/evidence/index.md b/docs/conformance/cncf/evidence/index.md
@@ -16,8 +16,9 @@ Cluster autoscaling evidence covers the underlying platform's node group scaling
 | 1 | `dra_support` | Dynamic Resource Allocation | PASS | [dra-support.md](dra-support.md) |
 | 2 | `gang_scheduling` | Gang Scheduling (KAI Scheduler) | PASS | [gang-scheduling.md](gang-scheduling.md) |
 | 3 | `secure_accelerator_access` | Secure Accelerator Access | PASS | [secure-accelerator-access.md](secure-accelerator-access.md) |
-| 4 | `accelerator_metrics` / `ai_service_metrics` | Accelerator & AI Service Metrics | PASS | [accelerator-metrics.md](accelerator-metrics.md) |
-| 5 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](inference-gateway.md) |
-| 6 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](robust-operator.md) |
-| 7 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU metrics) | PASS | [pod-autoscaling.md](pod-autoscaling.md) |
-| 8 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](cluster-autoscaling.md) |
+| 4 | `accelerator_metrics` | Accelerator Metrics (DCGM Exporter) | PASS | [accelerator-metrics.md](accelerator-metrics.md) |
+| 5 | `ai_service_metrics` | AI Service Metrics (Prometheus ServiceMonitor) | PASS | [ai-service-metrics.md](ai-service-metrics.md) |
+| 6 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](inference-gateway.md) |
+| 7 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](robust-operator.md) |
+| 8 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU metrics) | PASS | [pod-autoscaling.md](pod-autoscaling.md) |
+| 9 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](cluster-autoscaling.md) |
diff --git a/docs/conformance/cncf/index.md b/docs/conformance/cncf/index.md
@@ -31,6 +31,7 @@ docs/conformance/cncf/
     ├── gang-scheduling.md
     ├── secure-accelerator-access.md
     ├── accelerator-metrics.md
+    ├── ai-service-metrics.md
     ├── inference-gateway.md
     ├── robust-operator.md
     ├── pod-autoscaling.md
@@ -112,8 +113,9 @@ See [evidence/index.md](evidence/index.md) for a summary of all collected eviden
 | 1 | DRA Support | `dra_support` | [evidence/dra-support.md](evidence/dra-support.md) |
 | 2 | Gang Scheduling | `gang_scheduling` | [evidence/gang-scheduling.md](evidence/gang-scheduling.md) |
 | 3 | Secure Accelerator Access | `secure_accelerator_access` | [evidence/secure-accelerator-access.md](evidence/secure-accelerator-access.md) |
-| 4 | Accelerator & AI Service Metrics | `accelerator_metrics`, `ai_service_metrics` | [evidence/accelerator-metrics.md](evidence/accelerator-metrics.md) |
-| 5 | Inference API Gateway | `ai_inference` | [evidence/inference-gateway.md](evidence/inference-gateway.md) |
-| 6 | Robust AI Operator | `robust_controller` | [evidence/robust-operator.md](evidence/robust-operator.md) |
-| 7 | Pod Autoscaling | `pod_autoscaling` | [evidence/pod-autoscaling.md](evidence/pod-autoscaling.md) |
-| 8 | Cluster Autoscaling | `cluster_autoscaling` | [evidence/cluster-autoscaling.md](evidence/cluster-autoscaling.md) |
+| 4 | Accelerator Metrics | `accelerator_metrics` | [evidence/accelerator-metrics.md](evidence/accelerator-metrics.md) |
+| 5 | AI Service Metrics | `ai_service_metrics` | [evidence/ai-service-metrics.md](evidence/ai-service-metrics.md) |
+| 6 | Inference API Gateway | `ai_inference` | [evidence/inference-gateway.md](evidence/inference-gateway.md) |
+| 7 | Robust AI Operator | `robust_controller` | [evidence/robust-operator.md](evidence/robust-operator.md) |
+| 8 | Pod Autoscaling | `pod_autoscaling` | [evidence/pod-autoscaling.md](evidence/pod-autoscaling.md) |
+| 9 | Cluster Autoscaling | `cluster_autoscaling` | [evidence/cluster-autoscaling.md](evidence/cluster-autoscaling.md) |
diff --git a/docs/conformance/cncf/submission/README.md b/docs/conformance/cncf/submission/README.md
@@ -15,10 +15,11 @@ Evidence was collected on Kubernetes v1.35 clusters with NVIDIA H100 80GB HBM3 G
 | 1 | `dra_support` | Dynamic Resource Allocation | PASS | [dra-support.md](../evidence/dra-support.md) |
 | 2 | `gang_scheduling` | Gang Scheduling (KAI Scheduler) | PASS | [gang-scheduling.md](../evidence/gang-scheduling.md) |
 | 3 | `secure_accelerator_access` | Secure Accelerator Access | PASS | [secure-accelerator-access.md](../evidence/secure-accelerator-access.md) |
-| 4 | `accelerator_metrics` / `ai_service_metrics` | Accelerator & AI Service Metrics | PASS | [accelerator-metrics.md](../evidence/accelerator-metrics.md) |
-| 5 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](../evidence/inference-gateway.md) |
-| 6 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](../evidence/robust-operator.md) |
-| 7 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU Metrics) | PASS | [pod-autoscaling.md](../evidence/pod-autoscaling.md) |
-| 8 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](../evidence/cluster-autoscaling.md) |
+| 4 | `accelerator_metrics` | Accelerator Metrics (DCGM Exporter) | PASS | [accelerator-metrics.md](../evidence/accelerator-metrics.md) |
+| 5 | `ai_service_metrics` | AI Service Metrics (Prometheus ServiceMonitor) | PASS | [ai-service-metrics.md](../evidence/ai-service-metrics.md) |
+| 6 | `ai_inference` | Inference API Gateway (kgateway) | PASS | [inference-gateway.md](../evidence/inference-gateway.md) |
+| 7 | `robust_controller` | Robust AI Operator (Dynamo + Kubeflow Trainer) | PASS | [robust-operator.md](../evidence/robust-operator.md) |
+| 8 | `pod_autoscaling` | Pod Autoscaling (HPA + GPU Metrics) | PASS | [pod-autoscaling.md](../evidence/pod-autoscaling.md) |
+| 9 | `cluster_autoscaling` | Cluster Autoscaling | PASS | [cluster-autoscaling.md](../evidence/cluster-autoscaling.md) |
 
-All 9 MUST conformance requirement IDs across 8 evidence files are **Implemented**. 3 SHOULD requirements (`driver_runtime_management`, `gpu_sharing`, `virtualized_accelerator`) are also Implemented.
+All 9 MUST conformance requirement IDs across 9 evidence files are **Implemented**. 3 SHOULD requirements (`driver_runtime_management`, `gpu_sharing`, `virtualized_accelerator`) are also Implemented.