[P/D][Serve LLM] Fix NIXL side channel host to use Ray node IP for cross-node P/D disaggregation by kouroshHakha · Pull Request #60817 · ray-project/ray

kouroshHakha · 2026-02-07T00:52:53Z

Summary

Fix NIXL side channel host resolution to use Ray's node IP instead of vLLM's get_ip().

In Kubernetes deployments with hostNetwork: true, vLLM's get_ip() (which does socket.connect(("8.8.8.8", 80))) returns the host machine's external/public IP (e.g., 134.199.206.88), not the internal cluster IP that Ray uses (e.g., 10.128.0.50). These external IPs are typically NAT'd and not routable between nodes within the cluster.

This causes cross-node NIXL handshakes to fail in P/D disaggregated deployments where Prefill and Decode replicas run on different worker nodes:

Prefill engine sets its side channel host to the external IP and embeds it in the engine_id
Decode engine tries to connect to that external IP for the ZMQ handshake — connection fails/times out
The handshake never completes, so the remote engine's block size is never registered
When get_finished() processes the failed request, it hits a KeyError in block_size_ratio_from_engine_id(), crashing the EngineCore with EngineDeadError

The fix replaces get_ip() with ray.util.get_node_ip_address(), which returns the same internal IP shown by ray list nodes — ensuring the NIXL side channel is reachable across nodes.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

gemini-code-assist

Code Review

This pull request provides a crucial fix for a bug affecting cross-node communication in disaggregated Prefill/Decode deployments on Kubernetes. By replacing vLLM's IP detection with Ray's, it ensures the correct internal node IP is used for the NIXL side channel, resolving handshake failures. The change is well-explained and directly addresses the issue. I have included one minor suggestion to improve import consistency.

gemini-code-assist · 2026-02-07T00:54:04Z

python/ray/llm/_internal/serve/engines/vllm/kv_transfer/nixl.py

    def _set_side_channel_host(self):
        from vllm import envs as vllm_envs
-        from vllm.utils.network_utils import get_ip

        if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"):
-            os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_ip()
+            # Use Ray's node IP (internal/cluster IP) instead of vLLM's
+            # get_ip() which can return external/public IPs on hostNetwork
+            # pods, causing cross-node NIXL handshakes to fail.
+            os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = ray.util.get_node_ip_address()


For consistency with the local import pattern used for vllm dependencies in this file, it's preferable to import get_node_ip_address within this method. This also allows for the removal of the top-level import ray, keeping the module's namespace cleaner as it's only used here.

Suggested change

def _set_side_channel_host(self):

from vllm import envs as vllm_envs

from vllm.utils.network_utils import get_ip

if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"):

os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_ip()

# Use Ray's node IP (internal/cluster IP) instead of vLLM's

# get_ip() which can return external/public IPs on hostNetwork

# pods, causing cross-node NIXL handshakes to fail.

os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = ray.util.get_node_ip_address()

def _set_side_channel_host(self):

from vllm import envs as vllm_envs

from ray.util import get_node_ip_address

if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"):

# Use Ray's node IP (internal/cluster IP) instead of vLLM's

# get_ip() which can return external/public IPs on hostNetwork

# pods, causing cross-node NIXL handshakes to fail.

os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_node_ip_address()

wip

7cca37c

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

kouroshHakha requested a review from a team as a code owner February 7, 2026 00:52

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

eicherseiji approved these changes Feb 7, 2026

View reviewed changes

kouroshHakha added the go add ONLY when ready to merge, run all tests label Feb 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[P/D][Serve LLM] Fix NIXL side channel host to use Ray node IP for cross-node P/D disaggregation#60817

[P/D][Serve LLM] Fix NIXL side channel host to use Ray node IP for cross-node P/D disaggregation#60817
kouroshHakha wants to merge 1 commit intoray-project:masterfrom
kouroshHakha:kh/fix-ip

kouroshHakha commented Feb 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kouroshHakha commented Feb 7, 2026

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants