Skip to content

Commit 5411437

Browse files
committed
feat(evidence): split ai_service_metrics into Dynamo PodMonitor-based collection
Split the combined accelerator_metrics/ai_service_metrics evidence into separate collection paths: - accelerator-metrics: DCGM Exporter hardware GPU metrics (unchanged) - ai-service-metrics: Dynamo inference workload metrics via PodMonitor The Dynamo operator auto-creates PodMonitors for worker/frontend pods, which Prometheus discovers and scrapes. The worker runtime exposes both Dynamo-specific metrics (dynamo_component_requests_total, request duration, bytes) and embedded vLLM metrics on port 9090. Evidence now uses a running DynamoGraphDeployment (vllm-agg) instead of a standalone vLLM pod, demonstrating the full AICR inference platform metrics pipeline. Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
1 parent cfcbc60 commit 5411437

File tree

7 files changed

+835
-116
lines changed

7 files changed

+835
-116
lines changed
Lines changed: 159 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,118 +1,206 @@
1-
# AI Service Metrics (Prometheus ServiceMonitor Discovery)
1+
# AI Service Metrics (Prometheus Discovery)
22

3-
**Cluster:** `EKS / p5.48xlarge / NVIDIA-H100-80GB-HBM3`
4-
**Generated:** 2026-03-24 14:06:00 UTC
53
**Kubernetes Version:** v1.35
64
**Platform:** linux/amd64
5+
**Validated on:** EKS / p5.48xlarge / NVIDIA H100 80GB HBM3
76

87
---
98

109
Demonstrates that Prometheus discovers and collects metrics from AI workloads
11-
that expose them in Prometheus exposition format, using the ServiceMonitor CRD
12-
for automatic target discovery.
10+
that expose them in Prometheus exposition format, using PodMonitor and
11+
ServiceMonitor CRDs for automatic target discovery across both inference and
12+
training workloads.
1313

14-
## vLLM Inference Workload
14+
## Inference: Dynamo Platform (PodMonitor)
1515

16-
A vLLM inference server (serving Qwen/Qwen3-0.6B on GPU via DRA ResourceClaim)
17-
exposes application-level metrics in Prometheus format at `:8000/metrics`.
18-
A ServiceMonitor enables Prometheus to automatically discover and scrape the endpoint.
16+
**Cluster:** `aicr-cuj2` (EKS, inference)
17+
**Generated:** 2026-03-25 10:18:30 UTC
1918

20-
**vLLM workload pod**
19+
The Dynamo operator auto-creates PodMonitors for worker and frontend pods.
20+
The Dynamo vLLM runtime exposes both Dynamo-specific and embedded vLLM metrics
21+
on port 9090 (`system` port) in Prometheus format.
22+
23+
### Dynamo Workload Pods
24+
25+
**Dynamo workload pods**
2126
```
22-
$ kubectl get pods -n vllm-metrics-test -o wide
23-
NAME READY STATUS RESTARTS AGE
24-
vllm-server 1/1 Running 0 5m
27+
$ kubectl get pods -n dynamo-workload -o wide
28+
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
29+
vllm-agg-0-frontend-qqrff 1/1 Running 0 3m29s 10.0.159.241 ip-10-0-184-187.ec2.internal <none> <none>
30+
vllm-agg-0-vllmdecodeworker-95ths 1/1 Running 0 3m29s 10.0.214.229 ip-10-0-180-136.ec2.internal <none> <none>
2531
```
2632

27-
**vLLM metrics endpoint (sampled after 10 inference requests)**
33+
### Worker Metrics Endpoint
34+
35+
**Worker metrics (sampled after 10 inference requests)**
2836
```
29-
$ kubectl exec -n vllm-metrics-test vllm-server -- python3 -c "..." | grep vllm:
30-
vllm:request_success_total{engine="0",finished_reason="length",model_name="Qwen/Qwen3-0.6B"} 10.0
31-
vllm:prompt_tokens_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 80.0
32-
vllm:generation_tokens_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 500.0
33-
vllm:time_to_first_token_seconds_count{engine="0",model_name="Qwen/Qwen3-0.6B"} 10.0
34-
vllm:time_to_first_token_seconds_sum{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.205
35-
vllm:inter_token_latency_seconds_count{engine="0",model_name="Qwen/Qwen3-0.6B"} 490.0
36-
vllm:inter_token_latency_seconds_sum{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.864
37-
vllm:e2e_request_latency_seconds_count{engine="0",model_name="Qwen/Qwen3-0.6B"} 10.0
38-
vllm:kv_cache_usage_perc{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
39-
vllm:prefix_cache_queries_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 80.0
40-
vllm:num_requests_running{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
41-
vllm:num_requests_waiting{engine="0",model_name="Qwen/Qwen3-0.6B"} 0.0
37+
dynamo_component_request_bytes_total{dynamo_component="backend",dynamo_endpoint="generate",model="Qwen/Qwen3-0.6B"} 11230
38+
dynamo_component_request_duration_seconds_sum{dynamo_component="backend",dynamo_endpoint="generate",model="Qwen/Qwen3-0.6B"} 0.984
39+
dynamo_component_request_duration_seconds_count{dynamo_component="backend",dynamo_endpoint="generate",model="Qwen/Qwen3-0.6B"} 10
40+
dynamo_component_requests_total{dynamo_component="backend",dynamo_endpoint="generate",model="Qwen/Qwen3-0.6B"} 10
41+
dynamo_component_response_bytes_total{dynamo_component="backend",dynamo_endpoint="generate",model="Qwen/Qwen3-0.6B"} 31826
42+
dynamo_component_uptime_seconds 223.250
43+
vllm:engine_sleep_state{engine="0",model_name="Qwen/Qwen3-0.6B",sleep_state="awake"} 1.0
44+
vllm:prefix_cache_queries_total{engine="0",model_name="Qwen/Qwen3-0.6B"} 50.0
4245
```
4346

44-
## ServiceMonitor
47+
### PodMonitors (Auto-Created by Dynamo Operator)
4548

46-
**ServiceMonitor for vLLM**
49+
**Dynamo PodMonitors**
50+
```
51+
$ kubectl get podmonitors -n dynamo-system
52+
NAME AGE
53+
dynamo-frontend 11d
54+
dynamo-planner 11d
55+
dynamo-worker 11d
4756
```
48-
$ kubectl get servicemonitor vllm-inference -n vllm-metrics-test -o yaml
57+
58+
**Worker PodMonitor spec**
59+
```
60+
$ kubectl get podmonitor dynamo-worker -n dynamo-system -o yaml
4961
apiVersion: monitoring.coreos.com/v1
50-
kind: ServiceMonitor
62+
kind: PodMonitor
5163
metadata:
52-
labels:
53-
release: prometheus
54-
name: vllm-inference
55-
namespace: vllm-metrics-test
64+
name: dynamo-worker
65+
namespace: dynamo-system
5666
spec:
57-
endpoints:
58-
- interval: 15s
67+
namespaceSelector:
68+
any: true
69+
podMetricsEndpoints:
70+
- interval: 5s
5971
path: /metrics
60-
port: http
72+
port: system
6173
selector:
6274
matchLabels:
63-
app: vllm-inference
75+
nvidia.com/dynamo-component-type: worker
76+
nvidia.com/metrics-enabled: "true"
6477
```
6578

66-
**Service endpoint**
79+
### Prometheus Target Discovery
80+
81+
**Prometheus scrape targets (active)**
6782
```
68-
$ kubectl get endpoints vllm-inference -n vllm-metrics-test
69-
NAME ENDPOINTS AGE
70-
vllm-inference 10.0.170.78:8000 5m
83+
{
84+
"job": "dynamo-system/dynamo-frontend",
85+
"endpoint": "http://10.0.159.241:8000/metrics",
86+
"health": "up",
87+
"lastScrape": "2026-03-25T10:19:21.101766071Z"
88+
}
89+
{
90+
"job": "dynamo-system/dynamo-worker",
91+
"endpoint": "http://10.0.214.229:9090/metrics",
92+
"health": "up",
93+
"lastScrape": "2026-03-25T10:19:22.70334816Z"
94+
}
95+
```
96+
97+
### Dynamo Metrics in Prometheus
98+
99+
**Dynamo metrics queried from Prometheus (after 10 inference requests)**
100+
```
101+
dynamo_component_requests_total{endpoint="generate"} = 10
102+
dynamo_component_request_bytes_total{endpoint="generate"} = 11230
103+
dynamo_component_response_bytes_total{endpoint="generate"} = 31826
104+
dynamo_component_request_duration_seconds_count{endpoint="generate"} = 10
105+
dynamo_component_request_duration_seconds_sum{endpoint="generate"} = 0.984
106+
dynamo_component_uptime_seconds = 223.250
107+
dynamo_frontend_input_sequence_tokens_sum = 50
108+
dynamo_frontend_input_sequence_tokens_count = 10
109+
dynamo_frontend_inter_token_latency_seconds_sum = 0.866
110+
dynamo_frontend_inter_token_latency_seconds_count = 490
111+
dynamo_frontend_model_context_length = 40960
112+
dynamo_frontend_model_total_kv_blocks = 37710
71113
```
72114

73-
## Prometheus Target Discovery
115+
**Result: PASS** — Prometheus discovers Dynamo inference workloads (frontend + worker) via operator-managed PodMonitors and actively scrapes their Prometheus-format metrics endpoints. Application-level AI inference metrics (request count, request duration, inter-token latency, token throughput, KV cache utilization) are collected and queryable.
116+
117+
---
118+
119+
## Training: Kubeflow Trainer (ServiceMonitor)
120+
121+
**Cluster:** `aicr-cuj1` (EKS, training)
122+
**Generated:** 2026-03-25 10:38:58 UTC
74123

75-
Prometheus automatically discovers the vLLM workload as a scrape target via
76-
the ServiceMonitor and actively collects metrics.
124+
The Kubeflow Trainer controller-manager exposes training-specific metrics
125+
(TrainJob reconciliation, webhook latency) on port 8443 (HTTPS) in Prometheus
126+
format, discovered via ServiceMonitor.
127+
128+
### Kubeflow Trainer Components
129+
130+
**Kubeflow Trainer deployments**
131+
```
132+
$ kubectl get deploy -n kubeflow
133+
NAME READY UP-TO-DATE AVAILABLE AGE
134+
jobset-controller 1/1 1 1 12d
135+
kubeflow-trainer-controller-manager 1/1 1 1 12d
136+
```
137+
138+
### ServiceMonitor
139+
140+
**Kubeflow Trainer ServiceMonitor**
141+
```
142+
$ kubectl get servicemonitor kubeflow-trainer -n kubeflow -o yaml
143+
apiVersion: monitoring.coreos.com/v1
144+
kind: ServiceMonitor
145+
metadata:
146+
labels:
147+
release: kube-prometheus-stack
148+
name: kubeflow-trainer
149+
namespace: kubeflow
150+
spec:
151+
endpoints:
152+
- interval: 15s
153+
path: /metrics
154+
port: metrics
155+
scheme: https
156+
tlsConfig:
157+
insecureSkipVerify: true
158+
selector:
159+
matchLabels:
160+
app.kubernetes.io/component: manager
161+
app.kubernetes.io/name: kubeflow-trainer
162+
```
163+
164+
### Prometheus Target Discovery
77165

78166
**Prometheus scrape target (active)**
79167
```
80-
$ kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
81-
wget -qO- 'http://localhost:9090/api/v1/targets?state=active' | \
82-
jq '.data.activeTargets[] | select(.labels.job=="vllm-inference")'
83168
{
84-
"job": "vllm-inference",
85-
"endpoint": "http://10.0.170.78:8000/metrics",
169+
"job": "kubeflow-trainer-controller-manager",
170+
"endpoint": "https://10.0.7.127:8443/metrics",
86171
"health": "up",
87-
"lastScrape": "2026-03-24T14:06:50.899967845Z"
172+
"lastScrape": "2026-03-25T10:39:07.735479672Z"
88173
}
89174
```
90175

91-
## vLLM Metrics in Prometheus
92-
93-
Prometheus collects vLLM application-level inference metrics including request
94-
throughput, token counts, latency distributions, and KV cache utilization.
176+
### Kubeflow Trainer Metrics in Prometheus
95177

96-
**vLLM metrics queried from Prometheus (after 10 inference requests)**
178+
**Kubeflow Trainer metrics queried from Prometheus**
97179
```
98-
$ kubectl exec -n monitoring prometheus-kube-prometheus-prometheus-0 -- \
99-
wget -qO- 'http://localhost:9090/api/v1/query?query={job="vllm-inference",__name__=~"vllm:.*"}'
100-
vllm:request_success_total{model_name="Qwen/Qwen3-0.6B"} 10
101-
vllm:prompt_tokens_total{model_name="Qwen/Qwen3-0.6B"} 80
102-
vllm:generation_tokens_total{model_name="Qwen/Qwen3-0.6B"} 500
103-
vllm:time_to_first_token_seconds_count{model_name="Qwen/Qwen3-0.6B"} 10
104-
vllm:time_to_first_token_seconds_sum{model_name="Qwen/Qwen3-0.6B"} 0.205
105-
vllm:inter_token_latency_seconds_count{model_name="Qwen/Qwen3-0.6B"} 490
106-
vllm:inter_token_latency_seconds_sum{model_name="Qwen/Qwen3-0.6B"} 0.864
107-
vllm:prefix_cache_queries_total{model_name="Qwen/Qwen3-0.6B"} 80
108-
vllm:iteration_tokens_total_sum{model_name="Qwen/Qwen3-0.6B"} 580
180+
controller_runtime_max_concurrent_reconciles{controller="trainjob_controller"} = 1
181+
controller_runtime_reconcile_total{controller="trainjob_controller"} = 112
182+
controller_runtime_reconcile_errors_total{controller="trainjob_controller"} = 7
183+
controller_runtime_reconcile_time_seconds_sum{controller="trainjob_controller"} = 0.458
184+
controller_runtime_reconcile_time_seconds_count{controller="trainjob_controller"} = 112
185+
controller_runtime_webhook_latency_seconds_sum = 0.001
186+
controller_runtime_webhook_latency_seconds_count = 2
187+
controller_runtime_webhook_requests_total = 2
109188
```
110189

111-
**Result: PASS** — Prometheus discovers the vLLM inference workload via ServiceMonitor and actively scrapes its Prometheus-format metrics endpoint. Application-level AI inference metrics (request success count, prompt/generation token throughput, time-to-first-token latency, inter-token latency, KV cache usage, prefix cache queries) are collected and queryable in Prometheus.
190+
**Result: PASS** — Prometheus discovers the Kubeflow Trainer controller via ServiceMonitor and actively scrapes its Prometheus-format metrics endpoint. Training-specific metrics (TrainJob reconciliation, webhook latency) are collected and queryable.
191+
192+
---
193+
194+
## Summary
195+
196+
| Workload | Discovery | Metrics Port | Metrics Type | Result |
197+
|----------|-----------|-------------|--------------|--------|
198+
| **Dynamo vLLM** (inference) | PodMonitor (auto-created) | 9090 (HTTP) | `dynamo_component_*`, `dynamo_frontend_*`, `vllm:*` | **PASS** |
199+
| **Kubeflow Trainer** (training) | ServiceMonitor | 8443 (HTTPS) | `controller_runtime_*{controller="trainjob_controller"}` | **PASS** |
112200

113201
## Cleanup
114202

115-
**Delete test namespace**
203+
**Delete inference workload**
116204
```
117-
$ kubectl delete ns vllm-metrics-test
205+
$ kubectl delete ns dynamo-workload
118206
```

pkg/evidence/collector.go

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ var ValidFeatures = []string{
3939
"gang-scheduling",
4040
"secure-access",
4141
"accelerator-metrics",
42+
"ai-service-metrics",
4243
"inference-gateway",
4344
"robust-operator",
4445
"pod-autoscaling",
@@ -50,7 +51,8 @@ var featureToScript = map[string]string{
5051
"dra-support": "dra",
5152
"gang-scheduling": "gang",
5253
"secure-access": "secure",
53-
"accelerator-metrics": "metrics",
54+
"accelerator-metrics": "accelerator-metrics",
55+
"ai-service-metrics": "service-metrics",
5456
"inference-gateway": "gateway",
5557
"robust-operator": "operator",
5658
"pod-autoscaling": "hpa",
@@ -59,13 +61,14 @@ var featureToScript = map[string]string{
5961

6062
// featureAliases maps short names to canonical feature names for convenience.
6163
var featureAliases = map[string]string{
62-
"dra": "dra-support",
63-
"gang": "gang-scheduling",
64-
"secure": "secure-access",
65-
"metrics": "accelerator-metrics",
66-
"gateway": "inference-gateway",
67-
"operator": "robust-operator",
68-
"hpa": "pod-autoscaling",
64+
"dra": "dra-support",
65+
"gang": "gang-scheduling",
66+
"secure": "secure-access",
67+
"metrics": "accelerator-metrics",
68+
"service-metrics": "ai-service-metrics",
69+
"gateway": "inference-gateway",
70+
"operator": "robust-operator",
71+
"hpa": "pod-autoscaling",
6972
}
7073

7174
// ResolveFeature returns the canonical feature name, resolving aliases.
@@ -103,7 +106,8 @@ var FeatureDescriptions = map[string]string{
103106
"dra-support": "DRA GPU allocation test",
104107
"gang-scheduling": "Gang scheduling co-scheduling test",
105108
"secure-access": "Secure accelerator access verification",
106-
"accelerator-metrics": "Accelerator & AI service metrics",
109+
"accelerator-metrics": "Accelerator metrics (DCGM exporter)",
110+
"ai-service-metrics": "AI service metrics (Prometheus ServiceMonitor discovery)",
107111
"inference-gateway": "Inference API gateway conditions",
108112
"robust-operator": "Robust AI operator + webhook test",
109113
"pod-autoscaling": "HPA pod autoscaling (scale-up + scale-down)",

pkg/evidence/collector_test.go

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ func TestResolveFeature(t *testing.T) {
3131
{"alias gang", "gang", "gang-scheduling"},
3232
{"alias secure", "secure", "secure-access"},
3333
{"alias metrics", "metrics", "accelerator-metrics"},
34+
{"alias service-metrics", "service-metrics", "ai-service-metrics"},
3435
{"alias gateway", "gateway", "inference-gateway"},
3536
{"alias operator", "operator", "robust-operator"},
3637
{"alias hpa", "hpa", "pod-autoscaling"},
@@ -56,7 +57,8 @@ func TestScriptSection(t *testing.T) {
5657
{"dra-support", "dra-support", "dra"},
5758
{"gang-scheduling", "gang-scheduling", "gang"},
5859
{"secure-access", "secure-access", "secure"},
59-
{"accelerator-metrics", "accelerator-metrics", "metrics"},
60+
{"accelerator-metrics", "accelerator-metrics", "accelerator-metrics"},
61+
{"ai-service-metrics", "ai-service-metrics", "service-metrics"},
6062
{"inference-gateway", "inference-gateway", "gateway"},
6163
{"robust-operator", "robust-operator", "operator"},
6264
{"pod-autoscaling", "pod-autoscaling", "hpa"},

pkg/evidence/requirements.go

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -47,15 +47,15 @@ var requirements = map[string]requirementMeta{
4747
},
4848
"accelerator-metrics": {
4949
RequirementID: "accelerator_metrics",
50-
Title: "Accelerator & AI Service Metrics",
50+
Title: "Accelerator Metrics (DCGM Exporter)",
5151
Description: "Demonstrates that the DCGM exporter exposes per-GPU metrics (utilization, memory, temperature, power) in Prometheus format.",
5252
File: "accelerator-metrics.md",
5353
},
5454
"ai-service-metrics": {
55-
RequirementID: "accelerator_metrics",
56-
Title: "Accelerator & AI Service Metrics",
57-
Description: "Demonstrates that GPU metrics flow through Prometheus and are available via the Kubernetes custom metrics API for HPA scaling.",
58-
File: "accelerator-metrics.md",
55+
RequirementID: "ai_service_metrics",
56+
Title: "AI Service Metrics (Prometheus ServiceMonitor Discovery)",
57+
Description: "Demonstrates that Prometheus discovers and collects metrics from AI workloads exposing Prometheus exposition format via ServiceMonitors.",
58+
File: "ai-service-metrics.md",
5959
},
6060
"inference-gateway": {
6161
RequirementID: "ai_inference",

0 commit comments

Comments
 (0)