Skip to content

vllm: set VLLM_NIXL_SIDE_CHANNEL_HOST to node's routable IP#158

Open
esmeetu wants to merge 1 commit into
NVIDIA:mainfrom
esmeetu:nixl-side-channel-host
Open

vllm: set VLLM_NIXL_SIDE_CHANNEL_HOST to node's routable IP#158
esmeetu wants to merge 1 commit into
NVIDIA:mainfrom
esmeetu:nixl-side-channel-host

Conversation

@esmeetu
Copy link
Copy Markdown

@esmeetu esmeetu commented May 15, 2026

Summary

vLLM defaults VLLM_NIXL_SIDE_CHANNEL_HOST to 0.0.0.0 / localhost, which breaks the NIXL side-channel handshake when prefill and decode workers live on different nodes — the advertised host is unreachable and KV-transfer setup fails.

Set the per-process host to the node's routable IP (the same resolution build_worker_command already uses for the distributed-init leader) whenever a process has a nixl_port allocated.

Background

This fix originally landed in #11 ("Add Kimi-K2.5 vLLM recipes and fix NIXL side channel host"), but PR #11 was merged into the sa-submission-q2-2026 branch and never forward-ported to main. Multi-node vLLM disaggregated runs against main still hit the handshake failure today.

Changes

  • src/srtctl/backends/vllm.py — when process.nixl_port is not None, also stamp VLLM_NIXL_SIDE_CHANNEL_HOST = get_hostname_ip(process.node) in get_process_environment().
  • tests/test_configs.py — extend test_vllm_get_process_environment to assert the new env var (patches get_hostname_ip) and add the negative assertion to test_vllm_get_process_environment_none_ports.

make check clean (718 passed, 2 skipped).

Test plan

  • make check passes
  • Multi-node vLLM disaggregated run on the cluster reaches WORKERS_READY (no NIXL handshake errors in worker logs)

vLLM defaults VLLM_NIXL_SIDE_CHANNEL_HOST to 0.0.0.0/localhost, which
breaks the NIXL side-channel handshake across multiple nodes — workers
advertise an unreachable address and KV transfer setup fails.

Stamp the per-process host to the node's routable IP (same resolution
already used for distributed-init leader IPs) whenever the process has a
nixl_port allocated.

Originally landed on the sa-submission-q2-2026 branch via NVIDIA#11 but never
forward-ported to main; cherry-picking the minimal one-line behavior
change here with an updated test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant