[Dashboard][Serve LLM] Add NIXL KV transfer metrics to Serve LLM Grafana dashboard#60819
[Dashboard][Serve LLM] Add NIXL KV transfer metrics to Serve LLM Grafana dashboard#60819kouroshHakha wants to merge 1 commit intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds valuable new Grafana panels to the Serve LLM dashboard for monitoring NIXL KV cache transfers. The changes are well-described and the test plan is thorough. I've identified a few areas for improvement in the new panel definitions to enhance consistency and correctness. Specifically, I'm suggesting a change to the throughput calculation to align with Grafana best practices, and updates to the failure/expiration panels to improve observability by including model_name in the aggregation.
| id=41, | ||
| title="NIXL: Transfer Throughput", | ||
| description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).", | ||
| unit="GBs", | ||
| targets=[ | ||
| Target( | ||
| expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024', | ||
| legend="Throughput - {{model_name}} - {{WorkerId}}", | ||
| ), | ||
| ], | ||
| fill=1, | ||
| linewidth=2, | ||
| stack=False, | ||
| grid_pos=GridPos(12, 64, 12, 8), |
There was a problem hiding this comment.
The current implementation for "NIXL: Transfer Throughput" has some inconsistencies: the expression calculates throughput in Gibibytes per second (GiB/s) using base-1024 division, while the description refers to GB/s (base-1000), and the unit GBs is non-standard in Grafana.
To align with Grafana best practices and improve clarity, I recommend removing the manual division from the expression and setting the unit to bytes/sec. Grafana will then automatically format the value with the appropriate SI prefix (e.g., KB/s, MB/s, GB/s), which is standard for data rates.
| id=41, | |
| title="NIXL: Transfer Throughput", | |
| description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).", | |
| unit="GBs", | |
| targets=[ | |
| Target( | |
| expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024', | |
| legend="Throughput - {{model_name}} - {{WorkerId}}", | |
| ), | |
| ], | |
| fill=1, | |
| linewidth=2, | |
| stack=False, | |
| grid_pos=GridPos(12, 64, 12, 8), | |
| id=41, | |
| title="NIXL: Transfer Throughput", | |
| description="NIXL KV cache transfer throughput (bytes transferred / transfer time).", | |
| unit="bytes/sec", | |
| targets=[ | |
| Target( | |
| expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])', | |
| legend="Throughput - {{model_name}} - {{WorkerId}}", | |
| ), | |
| ], | |
| fill=1, | |
| linewidth=2, | |
| stack=False, | |
| grid_pos=GridPos(12, 64, 12, 8), |
| expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | ||
| legend="Failed Transfers - {{WorkerId}}", |
There was a problem hiding this comment.
For better observability, it would be helpful to see failed transfers broken down by model_name, especially when multiple models are served. The current query filters by model_name but aggregates failures across all selected models. Please consider adding model_name to the sum by clause and the legend.
| expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | |
| legend="Failed Transfers - {{WorkerId}}", | |
| expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | |
| legend="Failed Transfers - {{model_name}} - {{WorkerId}}", |
| expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | ||
| legend="KV Expired - {{WorkerId}}", |
There was a problem hiding this comment.
Similar to the transfer failures panel, it would be beneficial to see expired requests per model_name for more granular monitoring. The current query aggregates these across all selected models. Please consider adding model_name to the sum by clause and the legend.
| expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | |
| legend="KV Expired - {{WorkerId}}", | |
| expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | |
| legend="KV Expired - {{model_name}} - {{WorkerId}}", |
| unit="ms", | ||
| targets=[ | ||
| Target( | ||
| expr='rate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_count{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n* 1000', |
There was a problem hiding this comment.
Missing aggregation in NIXL latency/throughput PromQL queries
Medium Severity
The NIXL Transfer Latency (panel 40), Transfer Throughput (panel 41), and Avg Post Time (panel 43) panels divide rate() expressions without using sum by(model_name, WorkerId) aggregation. All other average calculations in this file (e.g., lines 55, 91, 163, 227) follow the pattern sum by(model_name, WorkerId) (rate(..._sum...)) / sum by(model_name, WorkerId) (rate(..._count...)). Without aggregation, if metrics have additional labels beyond model_name and WorkerId, Prometheus will perform element-wise division which may produce cluttered graphs or no data when label sets don't match exactly between numerator and denominator.
Additional Locations (2)
eicherseiji
left a comment
There was a problem hiding this comment.
lgtm. Suggest to replace workerId with replicaId and include model name in legends
| expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | ||
| legend="Failed Transfers - {{WorkerId}}", |
| expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))', | ||
| legend="KV Expired - {{WorkerId}}", |


Summary
New Panels
ray_vllm_nixl_xfer_time_secondsray_vllm_nixl_bytes_transferred/ray_vllm_nixl_xfer_time_secondsray_vllm_nixl_xfer_time_seconds_countray_vllm_nixl_post_time_secondsray_vllm_nixl_num_failed_transfersray_vllm_nixl_num_kv_expired_reqsThese metrics are emitted by vLLM's
NixlConnectorand wrapped viaRayPrometheusStatLogger->RayKVConnectorPrometheus->NixlPromMetrics. The failure/expiration panels only show data when errors occur (counters are lazily registered on first increment).Screenshots
Test plan
Since the dashboard panels file is loaded at Ray startup and cannot be hot-reloaded on a running cluster, we used the following approach to validate the changes end-to-end:
1. Panel definition validation
Confirmed: 31 panels loaded (25 existing + 6 new NIXL), all IDs unique, GridPos layout correct.
2. Dashboard JSON generation
Since the installed
raypackage doesn't include the new panels yet, we generated the dashboard JSON by patching the module at import time:3. Live Grafana validation
NixlConnector)ClusterIdtemplate variable to scope metrics to the active cluster