This document describes the metrics collection feature, which captures system and application metrics during benchmark runs.
The metrics collection system automatically gathers performance and resource utilization metrics from deployed pods during benchmark execution. These metrics are integrated into the benchmark report and can be visualized as time series graphs.
Harness Pod (in-cluster)
|
|-- collect_metrics.sh start # continuous Prometheus scraping (every 15s)
| |-- collect_replica_status() # one-time: Deployment/StatefulSet replica counts
| |-- collect_pod_startup_times() # one-time: pod creation-to-Ready durations
| \-- collect_metrics_snapshot() # repeated: curl /metrics from each vLLM pod
|
|-- collect_metrics.sh stop # sends SIGTERM to collector process
|-- collect_metrics.sh process # delegates to process_metrics.py for aggregation
|
\-- Results directory:
metrics/
raw/ # per-pod, per-snapshot .log files
processed/
metrics_summary.json # aggregated statistics per pod per metric
replica_status.json # desired vs ready vs available replicas
pod_startup_times.json # creation-to-Ready time per pod per node
graphs/ # time-series PNG plots (generated post-run)
- Collection (
collect_metrics.sh): Discovers vLLM pods via label selectors, scrapes Prometheus/metricsendpoints every 15s via direct HTTP to pod IPs. Also collects one-time infrastructure snapshots (replica counts, startup times). - Processing (
process_metrics.py): Parses raw.logfiles, aggregates per-pod statistics (mean, stddev, min, max, p25/p50/p75/p90/p95/p99). Computes ratio metrics (e.g., prefix cache hit rate). - Visualization (
visualize_metrics.py): Generates time-series PNG graphs from raw metric files for each tracked metric. - Report Integration (
metrics_processor.py): Loads processed summaries and feeds them into the benchmark report underresults.observability.
Metrics collection can be enabled via CLI (llmdbenchmark run --monitoring) or via scenario config. The CLI flag sets metricsScrapeEnabled: true at runtime, overriding the scenario default.
| Environment Variable | Default | Description |
|---|---|---|
LLMDBENCH_VLLM_COMMON_METRICS_SCRAPE_ENABLED |
false |
Enable/disable metrics collection. Set automatically when --monitoring is passed. |
METRICS_COLLECTION_INTERVAL |
15 |
Seconds between collection snapshots |
LLMDBENCH_VLLM_COMMON_METRICS_PORT |
8200 |
Prometheus metrics port (modelservice) |
LLMDBENCH_VLLM_COMMON_INFERENCE_PORT |
8000 |
Fallback port (standalone vLLM) |
LLMDBENCH_VLLM_MONITORING_METRICS_PATH |
/metrics |
Prometheus endpoint path |
METRICS_CURL_TIMEOUT |
30 |
Max seconds per curl request |
LLMDBENCH_METRICS_POD_PATTERN |
decode |
Fallback pod name pattern for discovery |
Pods are discovered using label selectors in priority order:
llm-d.ai/inferenceServing=true(modelservice deployments)stood-up-via=standalone(standalone deployments)- Pod name pattern matching (fallback)
Only pods with status.phase=Running are scraped.
- vLLM Pod Prometheus Metrics - Cache, queue, memory, preemption, and NIXL KV transfer metrics scraped from vLLM pods
- EPP Prometheus Metrics - Pool-level gauges, scheduler histograms, token distributions, P/D decision counters scraped from EPP pods
- DCGM GPU Metrics - GPU utilization, framebuffer usage, and power consumption (requires DCGM exporter deployed on the cluster)
- Container Metrics (cAdvisor) - Memory, CPU, and network usage from kubelet
- Metrics Processing - Per-pod and cluster-wide aggregation with statistics (mean, stddev, min, max, p25/p50/p75/p90/p95/p99)
- Replica Status - Desired vs ready vs available replica counts per Deployment/StatefulSet, grouped by model and role
- Pod Startup Times - Creation-to-Ready duration per pod, per node, grouped by model and role
- EPP Log-Derived Metrics - Dispatch latency, endpoint scores, request distribution, plugin latencies, saturation config from EPP pod logs
- Time-Series Visualization - Static PNG graphs generated after benchmark completion
- Benchmark Report Integration - All metrics surfaced in
results.observabilitysection - RBAC Setup - Automatic ServiceAccount creation with required permissions
- Metrics Storage - Raw and processed metrics saved to results directory
-
Real-time Visualization - Live metric streaming during benchmark execution
- Currently generates static graphs after benchmark completion
-
Custom Metric Queries - User-defined Prometheus queries
- Currently collects predefined set of metrics
The metrics collection system scrapes Prometheus endpoints from vLLM pods and EPP pods, collects Kubernetes infrastructure snapshots, and extracts scheduling metrics from EPP logs. All metrics are aggregated into per-pod statistics (mean, stddev, min, max, p25/p50/p75/p90/p95/p99).
Scraped from each vLLM pod (default port 8200 for modelservice, 8000 for standalone):
vllm:kv_cache_usage_perc- KV cache utilization (%)vllm:gpu_cache_usage_perc- GPU cache utilization (%)vllm:cpu_cache_usage_perc- CPU cache utilization (%)vllm:prefix_cache_hits_total- Prefix cache hits (tokens)vllm:prefix_cache_queries_total- Prefix cache queries (tokens)vllm:external_prefix_cache_hits_total- Cross-instance external cache hits (tokens)vllm:external_prefix_cache_queries_total- Cross-instance external cache queries (tokens)vllm:prefix_cache_hit_rate- Computed:hits / queries(%)vllm:external_prefix_cache_hit_rate- Computed:external_hits / external_queries(%)
vllm:num_requests_running- Requests currently in execution batches (count)vllm:num_requests_waiting- Requests waiting to be processed (count)vllm:num_requests_swapped- Requests swapped to CPU (count)vllm:num_preemptions_total- Cumulative request preemptions (count)
vllm:gpu_memory_usage_bytes- GPU memory usage (bytes)vllm:cpu_memory_usage_bytes- CPU memory usage (bytes)
vllm:nixl_xfer_time_seconds_sum- Cumulative NIXL transfer time (seconds)vllm:nixl_xfer_time_seconds_count- Number of NIXL transfers (count)vllm:nixl_bytes_transferred_sum- Cumulative bytes transferred via NIXL (bytes)vllm:nixl_bytes_transferred_count- Number of NIXL byte transfers (count)
Scraped from cluster monitoring infrastructure when available (requires DCGM exporter or kubelet cAdvisor):
DCGM_FI_DEV_FB_USED- GPU framebuffer memory used (bytes)DCGM_FI_DEV_GPU_UTIL- GPU utilization (%)DCGM_FI_DEV_POWER_USAGE- GPU power consumption (watts)
container_memory_usage_bytes- Container memory usage (converted to GB in output)container_memory_working_set_bytes- Container working set memory (converted to GB)container_network_receive_bytes_total- Network bytes received (converted to MB)container_network_transmit_bytes_total- Network bytes transmitted (converted to MB)container_cpu_usage_seconds_total- Container CPU time (seconds)
Scraped from EPP pods (default port 9090, bearer token auth):
inference_pool_average_kv_cache_utilization- Pool-wide KV cache utilization (%)inference_pool_average_queue_size- Average request queue depth (count)inference_pool_average_running_requests- Average running requests (count)inference_pool_ready_pods- Ready pod count (count)
inference_extension_scheduler_e2e_duration_seconds- End-to-end scheduling latency (bucket/sum/count)inference_extension_plugin_duration_seconds- Per-plugin processing time (bucket/sum/count)inference_extension_request_duration_seconds- Total request duration (bucket/sum/count)inference_extension_request_ttft_duration_seconds- Time to first token (bucket/sum/count)
inference_extension_input_tokens- Input token count distribution (bucket)inference_extension_output_tokens- Output token count distribution (bucket)inference_extension_normalized_time_per_output_token- NTPOT distribution (bucket)inference_extension_prefix_indexer_hit_ratio- Prefix indexer hit ratio (bucket)inference_extension_prefix_indexer_size- Prefix indexer size (count)
llm_d_inference_scheduler_pd_decision_total- P/D routing decisions (count)llm_d_inference_scheduler_disagg_decision_total- Disaggregation decisions (count)
Collected as one-time snapshots at the start of metrics collection:
Per Deployment/StatefulSet with llm-d.ai/inferenceServing=true pod template label:
desired_replicas-spec.replicasready_replicas-status.readyReplicasavailable_replicas-status.availableReplicasupdated_replicas-status.updatedReplicasmodel- Fromllm-d.ai/modellabelrole- Fromllm-d.ai/rolelabel (decode/prefill/both)
Per Running vLLM pod:
startup_seconds-conditions[Ready].lastTransitionTime - metadata.creationTimestampnode-spec.nodeNamemodel- Fromllm-d.ai/modellabelrole- Fromllm-d.ai/rolelabel
Extracted from EPP pod structured JSON logs by process_epp_logs.py:
- Dispatch latency - End-to-end request routing latency (
assembled_timetopicker_complete_time), with mean/stddev/min/max/p50/p95/p99 - Endpoint scores - Per-endpoint scoring statistics
- Request distribution - Request count per picked endpoint
- Plugin latencies - Per filter and scorer plugin processing times (seconds)
- Saturation config -
queueDepthThreshold,kvCacheUtilThreshold,metricsStalenessThresholdfrom EPP startup logs - Error tracking - Error message counts by type
| File | Purpose |
|---|---|
workload/harnesses/collect_metrics.sh |
Main collection driver (pod discovery, scraping, infrastructure snapshots) |
workload/harnesses/process_metrics.py |
Raw metric aggregation into metrics_summary.json |
workload/harnesses/process_epp_logs.py |
EPP log parsing into epp_metrics_summary.json |
llmdbenchmark/analysis/visualize_metrics.py |
Time-series PNG graph generation |
llmdbenchmark/analysis/benchmark_report/metrics_processor.py |
Benchmark report integration |