Skip to content

[P/D][Serve LLM] Fix NIXL side channel host to use Ray node IP for cross-node P/D disaggregation#60817

Open
kouroshHakha wants to merge 1 commit intoray-project:masterfrom
kouroshHakha:kh/fix-ip
Open

[P/D][Serve LLM] Fix NIXL side channel host to use Ray node IP for cross-node P/D disaggregation#60817
kouroshHakha wants to merge 1 commit intoray-project:masterfrom
kouroshHakha:kh/fix-ip

Conversation

@kouroshHakha
Copy link
Contributor

Summary

Fix NIXL side channel host resolution to use Ray's node IP instead of vLLM's get_ip().

In Kubernetes deployments with hostNetwork: true, vLLM's get_ip() (which does socket.connect(("8.8.8.8", 80))) returns the host machine's external/public IP (e.g., 134.199.206.88), not the internal cluster IP that Ray uses (e.g., 10.128.0.50). These external IPs are typically NAT'd and not routable between nodes within the cluster.

This causes cross-node NIXL handshakes to fail in P/D disaggregated deployments where Prefill and Decode replicas run on different worker nodes:

  1. Prefill engine sets its side channel host to the external IP and embeds it in the engine_id
  2. Decode engine tries to connect to that external IP for the ZMQ handshake — connection fails/times out
  3. The handshake never completes, so the remote engine's block size is never registered
  4. When get_finished() processes the failed request, it hits a KeyError in block_size_ratio_from_engine_id(), crashing the EngineCore with EngineDeadError

The fix replaces get_ip() with ray.util.get_node_ip_address(), which returns the same internal IP shown by ray list nodes — ensuring the NIXL side channel is reachable across nodes.

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha requested a review from a team as a code owner February 7, 2026 00:52
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a crucial fix for a bug affecting cross-node communication in disaggregated Prefill/Decode deployments on Kubernetes. By replacing vLLM's IP detection with Ray's, it ensures the correct internal node IP is used for the NIXL side channel, resolving handshake failures. The change is well-explained and directly addresses the issue. I have included one minor suggestion to improve import consistency.

Comment on lines 24 to +31
def _set_side_channel_host(self):
from vllm import envs as vllm_envs
from vllm.utils.network_utils import get_ip

if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"):
os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_ip()
# Use Ray's node IP (internal/cluster IP) instead of vLLM's
# get_ip() which can return external/public IPs on hostNetwork
# pods, causing cross-node NIXL handshakes to fail.
os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = ray.util.get_node_ip_address()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with the local import pattern used for vllm dependencies in this file, it's preferable to import get_node_ip_address within this method. This also allows for the removal of the top-level import ray, keeping the module's namespace cleaner as it's only used here.

Suggested change
def _set_side_channel_host(self):
from vllm import envs as vllm_envs
from vllm.utils.network_utils import get_ip
if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"):
os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_ip()
# Use Ray's node IP (internal/cluster IP) instead of vLLM's
# get_ip() which can return external/public IPs on hostNetwork
# pods, causing cross-node NIXL handshakes to fail.
os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = ray.util.get_node_ip_address()
def _set_side_channel_host(self):
from vllm import envs as vllm_envs
from ray.util import get_node_ip_address
if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"):
# Use Ray's node IP (internal/cluster IP) instead of vLLM's
# get_ip() which can return external/public IPs on hostNetwork
# pods, causing cross-node NIXL handshakes to fail.
os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_node_ip_address()

@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Feb 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants