Please join SIG-Observability to contribute to monitoring and observability topics within llm-d.
- If running on Google Kubernetes Engine (GKE),
- Refer to Google Cloud Managed Prometheus documentation for general guidance on how to collect metrics.
- Enable automatic application monitoring which will automatically collect metrics for vLLM.
- GKE provides an out of box inference gateway dashboard.
- If running on OpenShift, User Workload Monitoring provides an accessible Prometheus Stack for scraping metrics. See the OpenShift documentation to enable this feature.
- In other Kubernetes environments, Prometheus custom resources must be available in the cluster. To install a simple Prometheus and Grafana stack, refer to prometheus-grafana-stack.md.
All llm-d guides have monitoring enabled by default, supporting multiple monitoring stacks depending on the environment. We provide out of box monitoring configurations for scraping the Endpoint Picker (EPP) metrics, and vLLM metrics.
See the vLLM Metrics and EPP Metrics sections below for how to further config or disable monitoring.
vLLM metrics collection is enabled by default with:
# In your ms-*/values.yaml files
decode:
monitoring:
podmonitor:
enabled: true
prefill:
monitoring:
podmonitor:
enabled: trueUpon installation, view prefill and/or decode podmonitors with:
kubectl get podmonitors -n my-llm-d-namespaceThe vLLM metrics from prefill and decode pods will be visible from the Prometheus and/or Grafana user interface.
EPP provides additional metrics for request routing, scheduling latency, and plugin performance. EPP metrics collection is enabled by default with:
-
For self-installed Prometheus,
# In your gaie-*/values.yaml files inferenceExtension: monitoring: prometheus: enabled: true
Upon installation, view EPP servicemonitors with:
kubectl get servicemonitors -n my-llm-d-namespace
-
For GKE managed Prometheus,
# In your gaie-*/values.yaml files inferenceExtension: monitoring: gke: enabled: true
EPP metrics include request rates, error rates, scheduling latency, and plugin processing times, providing insights into the inference routing and scheduling performance.
Grafana dashboard raw JSON files can be imported manually into a Grafana UI. Here is a current list of community dashboards:
- llm-d dashboard
- vLLM metrics
- inference-gateway dashboard v1.0.1
- EPP metrics
- GKE managed inference gateway dashboard
For specific PromQL queries to monitor LLM-D deployments, see:
- Example PromQL Queries - Ready-to-use queries for monitoring vLLM, EPP, and prefix caching metrics
To populate metrics (especially error metrics) for testing and monitoring validation:
- Load Generation Script - Sends both valid and malformed requests to generate metrics