[P/D][Serve LLM] Fix NIXL side channel host to use Ray node IP for cross-node P/D disaggregation#60817
[P/D][Serve LLM] Fix NIXL side channel host to use Ray node IP for cross-node P/D disaggregation#60817kouroshHakha wants to merge 1 commit intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request provides a crucial fix for a bug affecting cross-node communication in disaggregated Prefill/Decode deployments on Kubernetes. By replacing vLLM's IP detection with Ray's, it ensures the correct internal node IP is used for the NIXL side channel, resolving handshake failures. The change is well-explained and directly addresses the issue. I have included one minor suggestion to improve import consistency.
| def _set_side_channel_host(self): | ||
| from vllm import envs as vllm_envs | ||
| from vllm.utils.network_utils import get_ip | ||
|
|
||
| if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"): | ||
| os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_ip() | ||
| # Use Ray's node IP (internal/cluster IP) instead of vLLM's | ||
| # get_ip() which can return external/public IPs on hostNetwork | ||
| # pods, causing cross-node NIXL handshakes to fail. | ||
| os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = ray.util.get_node_ip_address() |
There was a problem hiding this comment.
For consistency with the local import pattern used for vllm dependencies in this file, it's preferable to import get_node_ip_address within this method. This also allows for the removal of the top-level import ray, keeping the module's namespace cleaner as it's only used here.
| def _set_side_channel_host(self): | |
| from vllm import envs as vllm_envs | |
| from vllm.utils.network_utils import get_ip | |
| if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"): | |
| os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_ip() | |
| # Use Ray's node IP (internal/cluster IP) instead of vLLM's | |
| # get_ip() which can return external/public IPs on hostNetwork | |
| # pods, causing cross-node NIXL handshakes to fail. | |
| os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = ray.util.get_node_ip_address() | |
| def _set_side_channel_host(self): | |
| from vllm import envs as vllm_envs | |
| from ray.util import get_node_ip_address | |
| if not vllm_envs.is_set("VLLM_NIXL_SIDE_CHANNEL_HOST"): | |
| # Use Ray's node IP (internal/cluster IP) instead of vLLM's | |
| # get_ip() which can return external/public IPs on hostNetwork | |
| # pods, causing cross-node NIXL handshakes to fail. | |
| os.environ["VLLM_NIXL_SIDE_CHANNEL_HOST"] = get_node_ip_address() |
Summary
Fix NIXL side channel host resolution to use Ray's node IP instead of vLLM's
get_ip().In Kubernetes deployments with
hostNetwork: true, vLLM'sget_ip()(which doessocket.connect(("8.8.8.8", 80))) returns the host machine's external/public IP (e.g.,134.199.206.88), not the internal cluster IP that Ray uses (e.g.,10.128.0.50). These external IPs are typically NAT'd and not routable between nodes within the cluster.This causes cross-node NIXL handshakes to fail in P/D disaggregated deployments where Prefill and Decode replicas run on different worker nodes:
get_finished()processes the failed request, it hits aKeyErrorinblock_size_ratio_from_engine_id(), crashing theEngineCorewithEngineDeadErrorThe fix replaces
get_ip()withray.util.get_node_ip_address(), which returns the same internal IP shown byray list nodes— ensuring the NIXL side channel is reachable across nodes.