[Dashboard][Serve LLM] Add NIXL KV transfer metrics to Serve LLM Grafana dashboard by kouroshHakha · Pull Request #60819 · ray-project/ray

kouroshHakha · 2026-02-07T02:14:58Z

Summary

Add 6 new Grafana panels to the Serve LLM dashboard for monitoring NIXL KV cache transfers in prefill-decode (P/D) disaggregated serving
Panels are inserted after the existing vLLM engine metrics and before the token summary panels, with all subsequent panel positions adjusted accordingly

New Panels

Panel	Metric	Description
NIXL: Transfer Latency	`ray_vllm_nixl_xfer_time_seconds`	Average RDMA transfer duration (ms)
NIXL: Transfer Throughput	`ray_vllm_nixl_bytes_transferred` / `ray_vllm_nixl_xfer_time_seconds`	Effective transfer bandwidth (GB/s)
NIXL: Transfer Rate	`ray_vllm_nixl_xfer_time_seconds_count`	KV transfers per second
NIXL: Avg Post Time	`ray_vllm_nixl_post_time_seconds`	Time to post/initiate a transfer (ms)
NIXL: KV Transfer Failures	`ray_vllm_nixl_num_failed_transfers`	Failed RDMA transfers (alerting)
NIXL: KV Expired Requests	`ray_vllm_nixl_num_kv_expired_reqs`	Requests whose KV blocks expired before decode consumed them (alerting)

These metrics are emitted by vLLM's NixlConnector and wrapped via RayPrometheusStatLogger -> RayKVConnectorPrometheus -> NixlPromMetrics. The failure/expiration panels only show data when errors occur (counters are lazily registered on first increment).

Screenshots

Test plan

Since the dashboard panels file is loaded at Ray startup and cannot be hot-reloaded on a running cluster, we used the following approach to validate the changes end-to-end:

1. Panel definition validation

cd ray && python -B -c "
from ray.dashboard.modules.metrics.dashboards.serve_llm_dashboard_panels import SERVE_LLM_GRAFANA_PANELS
print(f'Total panels: {len(SERVE_LLM_GRAFANA_PANELS)}')
for p in SERVE_LLM_GRAFANA_PANELS:
    print(f'  id={p.id:3d}  y={p.grid_pos.y:3d}  x={p.grid_pos.x:2d}  {p.title}')
"

Confirmed: 31 panels loaded (25 existing + 6 new NIXL), all IDs unique, GridPos layout correct.

2. Dashboard JSON generation

Since the installed ray package doesn't include the new panels yet, we generated the dashboard JSON by patching the module at import time:

# generate_nixl_dashboard.py
import importlib.util, os, sys

local_panels_path = "ray/python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py"
spec = importlib.util.spec_from_file_location(
    "ray.dashboard.modules.metrics.dashboards.serve_llm_dashboard_panels",
    local_panels_path,
)
local_panels_module = importlib.util.module_from_spec(spec)
sys.modules["ray.dashboard.modules.metrics.dashboards.serve_llm_dashboard_panels"] = local_panels_module
spec.loader.exec_module(local_panels_module)

from ray.dashboard.modules.metrics.grafana_dashboard_factory import _generate_grafana_dashboard

config = local_panels_module.serve_llm_dashboard_config
content, uid = _generate_grafana_dashboard(config)

with open("serve_llm_dashboard_nixl.json", "w") as f:
    f.write(content)

3. Live Grafana validation

Imported the generated JSON into Grafana on an Anyscale cluster running NIXL P/D disaggregation (8 prefill + 8 decode replicas with NixlConnector)
Added a ClusterId template variable to scope metrics to the active cluster
Confirmed all 6 NIXL panels render correctly with live data under load
Verified the existing 25 panels continue to work as expected

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gemini-code-assist

Code Review

This pull request adds valuable new Grafana panels to the Serve LLM dashboard for monitoring NIXL KV cache transfers. The changes are well-described and the test plan is thorough. I've identified a few areas for improvement in the new panel definitions to enhance consistency and correctness. Specifically, I'm suggesting a change to the throughput calculation to align with Grafana best practices, and updates to the failure/expiration panels to improve observability by including model_name in the aggregation.

gemini-code-assist · 2026-02-07T02:16:44Z

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py

+        id=41,
+        title="NIXL: Transfer Throughput",
+        description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).",
+        unit="GBs",
+        targets=[
+            Target(
+                expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024',
+                legend="Throughput - {{model_name}} - {{WorkerId}}",
+            ),
+        ],
+        fill=1,
+        linewidth=2,
+        stack=False,
+        grid_pos=GridPos(12, 64, 12, 8),


The current implementation for "NIXL: Transfer Throughput" has some inconsistencies: the expression calculates throughput in Gibibytes per second (GiB/s) using base-1024 division, while the description refers to GB/s (base-1000), and the unit GBs is non-standard in Grafana.

To align with Grafana best practices and improve clarity, I recommend removing the manual division from the expression and setting the unit to bytes/sec. Grafana will then automatically format the value with the appropriate SI prefix (e.g., KB/s, MB/s, GB/s), which is standard for data rates.

Suggested change

id=41,

title="NIXL: Transfer Throughput",

description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).",

unit="GBs",

targets=[

Target(

expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024',

legend="Throughput - {{model_name}} - {{WorkerId}}",

),

],

fill=1,

linewidth=2,

stack=False,

grid_pos=GridPos(12, 64, 12, 8),

id=41,

title="NIXL: Transfer Throughput",

description="NIXL KV cache transfer throughput (bytes transferred / transfer time).",

unit="bytes/sec",

targets=[

Target(

expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])',

legend="Throughput - {{model_name}} - {{WorkerId}}",

),

],

fill=1,

linewidth=2,

stack=False,

grid_pos=GridPos(12, 64, 12, 8),

gemini-code-assist · 2026-02-07T02:16:44Z

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py

+                expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
+                legend="Failed Transfers - {{WorkerId}}",


For better observability, it would be helpful to see failed transfers broken down by model_name, especially when multiple models are served. The current query filters by model_name but aggregates failures across all selected models. Please consider adding model_name to the sum by clause and the legend.

Suggested change

expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',

legend="Failed Transfers - {{WorkerId}}",

expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',

legend="Failed Transfers - {{model_name}} - {{WorkerId}}",

+1 for the legend

gemini-code-assist · 2026-02-07T02:16:44Z

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py

+                expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
+                legend="KV Expired - {{WorkerId}}",


Similar to the transfer failures panel, it would be beneficial to see expired requests per model_name for more granular monitoring. The current query aggregates these across all selected models. Please consider adding model_name to the sum by clause and the legend.

Suggested change

expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',

legend="KV Expired - {{WorkerId}}",

expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',

legend="KV Expired - {{model_name}} - {{WorkerId}}",

+1 for the legend

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-07T02:19:41Z

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py

+        unit="ms",
+        targets=[
+            Target(
+                expr='rate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_count{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n* 1000',


Missing aggregation in NIXL latency/throughput PromQL queries

Medium Severity

The NIXL Transfer Latency (panel 40), Transfer Throughput (panel 41), and Avg Post Time (panel 43) panels divide rate() expressions without using sum by(model_name, WorkerId) aggregation. All other average calculations in this file (e.g., lines 55, 91, 163, 227) follow the pattern sum by(model_name, WorkerId) (rate(..._sum...)) / sum by(model_name, WorkerId) (rate(..._count...)). Without aggregation, if metrics have additional labels beyond model_name and WorkerId, Prometheus will perform element-wise division which may produce cluttered graphs or no data when label sets don't match exactly between numerator and denominator.

Additional Locations (2)

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py#L354-L355

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py#L386-L387

eicherseiji

lgtm. Suggest to replace workerId with replicaId and include model name in legends

eicherseiji · 2026-02-07T08:02:05Z

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py

+                expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
+                legend="Failed Transfers - {{WorkerId}}",


+1 for the legend

eicherseiji · 2026-02-07T08:02:09Z

python/ray/dashboard/modules/metrics/dashboards/serve_llm_dashboard_panels.py

+                expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
+                legend="KV Expired - {{WorkerId}}",


+1 for the legend

wip

255ef76

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha requested a review from a team as a code owner February 7, 2026 02:14

kouroshHakha requested a review from eicherseiji February 7, 2026 02:15

kouroshHakha added the go add ONLY when ready to merge, run all tests label Feb 7, 2026

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

cursor bot reviewed Feb 7, 2026

View reviewed changes

eicherseiji approved these changes Feb 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dashboard][Serve LLM] Add NIXL KV transfer metrics to Serve LLM Grafana dashboard#60819

[Dashboard][Serve LLM] Add NIXL KV transfer metrics to Serve LLM Grafana dashboard#60819
kouroshHakha wants to merge 1 commit intoray-project:masterfrom
kouroshHakha:kh/nixl-metrics

kouroshHakha commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

eicherseiji Feb 7, 2026

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

eicherseiji Feb 7, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 7, 2026

Uh oh!

eicherseiji left a comment

Uh oh!

eicherseiji Feb 7, 2026

Uh oh!

eicherseiji Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
		legend="Failed Transfers - {{WorkerId}}",

		expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
		legend="KV Expired - {{WorkerId}}",

Conversation

kouroshHakha commented Feb 7, 2026

Summary

New Panels

Screenshots

Test plan

1. Panel definition validation

2. Dashboard JSON generation

3. Live Grafana validation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

eicherseiji Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

eicherseiji Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 7, 2026

Choose a reason for hiding this comment

Missing aggregation in NIXL latency/throughput PromQL queries

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

eicherseiji Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

eicherseiji Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants