feat(vllm): switch DP+EP launch to hybrid_lb (per-node process) by esmeetu · Pull Request #90 · NVIDIA/srt-slurm

esmeetu · 2026-04-27T06:10:28Z

DP+EP mode previously launched one srun task per GPU with restricted CUDA_VISIBLE_DEVICES, which makes vLLM auto-select external_lb (one pod per rank). Locally co-located ranks then each see only cuda:0, and CUDA Symmetric Memory rendezvous fails with:

CUDASymmetricMemoryAllocator::rendezvous: detected allocations from
overlapping devices from different ranks.

This blocks DSv4 MegaMoE (deep_gemm.get_symm_buffer_for_mega_moe), the SymmMemCommunicator all-reduce path, and any future shared-namespace fast paths on GB200/GB300 nodes that pack multiple DP ranks per node.

Switch DP+EP to hybrid_lb:

endpoints_to_processes(): one Process per node (full local GPU set) instead of one Process per GPU. Reserves local_dp_size kv-events ports per process so dynamo's per-rank ZMQ publishers do not collide.
build_worker_command(): drop --data-parallel-rank (which silently flips vLLM into external_lb and forces size_local=1, see vllm/engine/arg_utils.py:1702-1717). Pass instead:
```
--data-parallel-hybrid-lb
--data-parallel-size-local <local_dp>
--data-parallel-start-rank <node_rank * local_dp>
--data-parallel-address    <leader_ip>
--data-parallel-rpc-port   <port>
```
--data-parallel-hybrid-lb is passed explicitly because vLLM's auto-detect at arg_utils.py:1721 (if self.data_parallel_start_rank and not headless) uses Python truthiness, so the leader node (start_rank=0) silently falls out of hybrid_lb and rejects worker engines with "Remote engine N must use --headless unless in external or hybrid dp lb mode".

Side effect: worker_stage.py's len(gpu_indices) < gpus_per_node check is now False for DP processes, so CUDA_VISIBLE_DEVICES is no longer injected — all local GPUs share one CUDA namespace.

Single-node DP (size_local == data_parallel_size) automatically collapses to internal_lb inside vLLM (arg_utils.py:1735-1737), so the flag is harmless there.

Tests: rewrite test_dp_mode_creates_per_node_processes and test_dp_mode_command_includes_dp_flags to assert the new shape; extend test_tp_mode_command_includes_multinode_flags to reject the hybrid_lb flags.

Verified on GB300 1P1D DEP4 (single-node) and 1P10D DEP16 (4-node) recipes via dry-run.

DP+EP mode previously launched one srun task per GPU with restricted CUDA_VISIBLE_DEVICES, which makes vLLM auto-select external_lb (one pod per rank). Locally co-located ranks then each see only `cuda:0`, and CUDA Symmetric Memory rendezvous fails with: CUDASymmetricMemoryAllocator::rendezvous: detected allocations from overlapping devices from different ranks. This blocks DSv4 MegaMoE (deep_gemm.get_symm_buffer_for_mega_moe), the SymmMemCommunicator all-reduce path, and any future shared-namespace fast paths on GB200/GB300 nodes that pack multiple DP ranks per node. Switch DP+EP to hybrid_lb: - endpoints_to_processes(): one Process per node (full local GPU set) instead of one Process per GPU. Reserves local_dp_size kv-events ports per process so dynamo's per-rank ZMQ publishers do not collide. - build_worker_command(): drop --data-parallel-rank (which silently flips vLLM into external_lb and forces size_local=1, see vllm/engine/arg_utils.py:1702-1717). Pass instead: --data-parallel-hybrid-lb --data-parallel-size-local <local_dp> --data-parallel-start-rank <node_rank * local_dp> --data-parallel-address <leader_ip> --data-parallel-rpc-port <port> --data-parallel-hybrid-lb is passed explicitly because vLLM's auto-detect at arg_utils.py:1721 (`if self.data_parallel_start_rank and not headless`) uses Python truthiness, so the leader node (start_rank=0) silently falls out of hybrid_lb and rejects worker engines with "Remote engine N must use --headless unless in external or hybrid dp lb mode". Side effect: worker_stage.py's `len(gpu_indices) < gpus_per_node` check is now False for DP processes, so CUDA_VISIBLE_DEVICES is no longer injected — all local GPUs share one CUDA namespace. Single-node DP (size_local == data_parallel_size) automatically collapses to internal_lb inside vLLM (arg_utils.py:1735-1737), so the flag is harmless there. Tests: rewrite test_dp_mode_creates_per_node_processes and test_dp_mode_command_includes_dp_flags to assert the new shape; extend test_tp_mode_command_includes_multinode_flags to reject the hybrid_lb flags. Verified on GB300 1P1D DEP4 (single-node) and 1P10D DEP16 (4-node) recipes via dry-run.

esmeetu requested review from alec-flowers, csahithi, ishandhanani and nlevin-ui as code owners April 27, 2026 06:10

esmeetu marked this pull request as draft April 27, 2026 06:40

njhill mentioned this pull request Apr 28, 2026

feat(vllm): update DP+EP to use --local-data-parallel-rank #112

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(vllm): switch DP+EP launch to hybrid_lb (per-node process)#90

feat(vllm): switch DP+EP launch to hybrid_lb (per-node process)#90
esmeetu wants to merge 1 commit intoNVIDIA:mainfrom
esmeetu:yasong/vllm-hybrid-dp-lb

esmeetu commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

esmeetu commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant