Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -329,6 +329,102 @@
stack=False,
grid_pos=GridPos(12, 56, 12, 8),
),
Panel(
id=40,
title="NIXL: Transfer Latency",
description="Average NIXL KV cache transfer latency in milliseconds.",
unit="ms",
targets=[
Target(
expr='rate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_count{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n* 1000',
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing aggregation in NIXL latency/throughput PromQL queries

Medium Severity

The NIXL Transfer Latency (panel 40), Transfer Throughput (panel 41), and Avg Post Time (panel 43) panels divide rate() expressions without using sum by(model_name, WorkerId) aggregation. All other average calculations in this file (e.g., lines 55, 91, 163, 227) follow the pattern sum by(model_name, WorkerId) (rate(..._sum...)) / sum by(model_name, WorkerId) (rate(..._count...)). Without aggregation, if metrics have additional labels beyond model_name and WorkerId, Prometheus will perform element-wise division which may produce cluttered graphs or no data when label sets don't match exactly between numerator and denominator.

Additional Locations (2)

Fix in Cursor Fix in Web

legend="Avg Latency - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 64, 12, 8),
),
Panel(
id=41,
title="NIXL: Transfer Throughput",
description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).",
unit="GBs",
targets=[
Target(
expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024',
legend="Throughput - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 64, 12, 8),
Comment on lines +349 to +362
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation for "NIXL: Transfer Throughput" has some inconsistencies: the expression calculates throughput in Gibibytes per second (GiB/s) using base-1024 division, while the description refers to GB/s (base-1000), and the unit GBs is non-standard in Grafana.

To align with Grafana best practices and improve clarity, I recommend removing the manual division from the expression and setting the unit to bytes/sec. Grafana will then automatically format the value with the appropriate SI prefix (e.g., KB/s, MB/s, GB/s), which is standard for data rates.

Suggested change
id=41,
title="NIXL: Transfer Throughput",
description="NIXL KV cache transfer throughput in GB/s (bytes transferred / transfer time).",
unit="GBs",
targets=[
Target(
expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/ 1024 / 1024 / 1024',
legend="Throughput - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 64, 12, 8),
id=41,
title="NIXL: Transfer Throughput",
description="NIXL KV cache transfer throughput (bytes transferred / transfer time).",
unit="bytes/sec",
targets=[
Target(
expr='rate(ray_vllm_nixl_bytes_transferred_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_xfer_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])',
legend="Throughput - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 64, 12, 8),

),
Panel(
id=42,
title="NIXL: Transfer Rate",
description="Number of NIXL KV cache transfers per second.",
unit="ops",
targets=[
Target(
expr='sum by (model_name, WorkerId) (rate(ray_vllm_nixl_xfer_time_seconds_count{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Transfers/s - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 72, 12, 8),
),
Panel(
id=43,
title="NIXL: Avg Post Time",
description="Average time to post/initiate a NIXL transfer in milliseconds.",
unit="ms",
targets=[
Target(
expr='rate(ray_vllm_nixl_post_time_seconds_sum{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n/\nrate(ray_vllm_nixl_post_time_seconds_count{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval])\n* 1000',
legend="Avg Post Time - {{model_name}} - {{WorkerId}}",
),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 72, 12, 8),
),
Panel(
id=44,
title="NIXL: KV Transfer Failures",
description="Number of failed NIXL KV cache transfers. Any non-zero value is concerning and indicates RDMA transfer errors.",
unit="short",
targets=[
Target(
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Failed Transfers - {{WorkerId}}",
Comment on lines +403 to +404
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better observability, it would be helpful to see failed transfers broken down by model_name, especially when multiple models are served. The current query filters by model_name but aggregates failures across all selected models. Please consider adding model_name to the sum by clause and the legend.

Suggested change
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Failed Transfers - {{WorkerId}}",
expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_failed_transfers{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="Failed Transfers - {{model_name}} - {{WorkerId}}",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the legend

),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 80, 12, 8),
),
Panel(
id=45,
title="NIXL: KV Expired Requests",
description="Number of requests whose KV blocks expired before decode consumed them. Spikes indicate prefill is outrunning decode or the timeout is too short.",
unit="short",
targets=[
Target(
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="KV Expired - {{WorkerId}}",
Comment on lines +419 to +420
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the transfer failures panel, it would be beneficial to see expired requests per model_name for more granular monitoring. The current query aggregates these across all selected models. Please consider adding model_name to the sum by clause and the legend.

Suggested change
expr='sum by (WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="KV Expired - {{WorkerId}}",
expr='sum by (model_name, WorkerId) (increase(ray_vllm_nixl_num_kv_expired_reqs{{model_name=~"$vllm_model_name", WorkerId=~"$workerid", {global_filters}}}[$interval]))',
legend="KV Expired - {{model_name}} - {{WorkerId}}",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the legend

),
],
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 80, 12, 8),
),
Panel(
id=14,
title="Tokens Last 24 Hours",
Expand All @@ -347,7 +443,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 64, 12, 8),
grid_pos=GridPos(0, 88, 12, 8),
template=PanelTemplate.STAT,
),
Panel(
Expand All @@ -368,7 +464,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 64, 12, 8),
grid_pos=GridPos(12, 88, 12, 8),
template=PanelTemplate.STAT,
),
Panel(
Expand All @@ -385,7 +481,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 72, 12, 8),
grid_pos=GridPos(0, 96, 12, 8),
template=PanelTemplate.STAT,
),
Panel(
Expand All @@ -402,7 +498,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 72, 12, 8),
grid_pos=GridPos(12, 96, 12, 8),
template=PanelTemplate.PIE_CHART,
),
Panel(
Expand All @@ -419,7 +515,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 80, 12, 8),
grid_pos=GridPos(0, 104, 12, 8),
template=PanelTemplate.STAT,
),
Panel(
Expand All @@ -436,7 +532,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 80, 12, 8),
grid_pos=GridPos(12, 104, 12, 8),
template=PanelTemplate.STAT,
),
Panel(
Expand All @@ -453,7 +549,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 88, 12, 8),
grid_pos=GridPos(0, 112, 12, 8),
template=PanelTemplate.GAUGE,
),
Panel(
Expand All @@ -470,7 +566,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 88, 12, 8),
grid_pos=GridPos(12, 112, 12, 8),
template=PanelTemplate.GAUGE,
),
Panel(
Expand All @@ -491,7 +587,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(0, 96, 12, 8),
grid_pos=GridPos(0, 120, 12, 8),
template=PanelTemplate.GAUGE,
),
Panel(
Expand All @@ -508,7 +604,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 96, 12, 8),
grid_pos=GridPos(12, 120, 12, 8),
template=PanelTemplate.GAUGE,
),
Panel(
Expand All @@ -529,7 +625,7 @@
fill=1,
linewidth=2,
stack=False,
grid_pos=GridPos(12, 104, 12, 8),
grid_pos=GridPos(12, 128, 12, 8),
template=PanelTemplate.GAUGE,
),
]
Expand Down