[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation #4280

jianjunzhong · 2025-11-25T03:19:21Z

What does this PR do?

Refactor vLLM co-located training-inference rollout from single-process to multi-process architecture. This refactoring separates training and inference into different processes, enabling better resource isolation and paving the way for future checkpoint-engine integration (in roadmap #3624).

Key Changes:

Transform vLLMAsyncRollout into ServerAdapter - a client-side adapter that communicates with the inference executor
Replace ExternalZeroMQDistributedExecutor with vLLMMultiprocExecutor - a new multiproc executor that serves as the inference backend
Implement CUDA IPC-based weight updates via ZeroMQ for efficient parameter synchronization between training and inference processes

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

This refactoring maintains full backward compatibility with existing vLLM rollout APIs. No changes are required to user code.

Key API Components:

ServerAdapter (replaces vLLMAsyncRollout):
- Acts as client-side adapter for communicating with inference executor
- Manages CUDA IPC-based weight updates
- Provides same interface as previous vLLMAsyncRollout class
vLLMMultiprocExecutor (replaces ExternalZeroMQDistributedExecutor):
- Inherits from vLLM's MultiprocExecutor
- Handles RPC communication with training workers
- Manages inference worker processes

Design & Code Changes

Architecture Overview

Before (Single-Process Architecture)

Single-Process Design

In the original AsyncActorRolloutRefWorker, the training engine and inference engine shared the same process. The vLLM inference engine directly received weight updates through parameter passing.

Communication Architecture

ExternalZeroMQDistributedExecutor acts as a client, sending instructions to all AsyncActorRolloutRefWorker inference engines via ZMQ to execute operations like init_worker, load_model, init_device, and generate. Operations like wake_up, sleep, and weight updates were executed directly in vLLMAsyncRollout without going through ExternalZeroMQDistributedExecutor.

After (Multi-Process Architecture):

Multi-Process Design

Transform vLLMAsyncRollout into ServerAdapter, serving as a client for communicating with the Executor. Weight updates are based on CUDA IPC, passing through ZeroMQ to the inference engine.

Communication Architecture

Deprecate the original ExternalZeroMQDistributedExecutor class and create a new vLLMMultiprocExecutor class that inherits from MultiprocExecutor. This acts as a server receiving instruction operations from local_rank=0. All inference engine operations are uniformly broadcast to all inference workers through vLLMMultiprocExecutor's RPC Broadcast MQ.

Detailed Code Changes

verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py: Changes to vLLMAsyncRollout → ServerAdapter

Removed: inference_engine attribute and related initialization (_init_worker, _load_model, _init_device)
Removed: ZMQ server functionality (address, get_zeromq_address(), _init_zeromq(), _loop_forever(), _execute_method())
Added: ZMQ client functionality (set_executor_zmq_address(), _init_zmq_client(), execute_method())
Modified: resume(), release(), update_weights() to send messages via executor
Added: CUDA IPC handle management (get_update_weights_zmq_handle(), set_update_weights_zmq_handles())

verl/workers/rollout/vllm_rollout/vllm_multiproc_executor.py: New file - Core Components

vLLMWorkerProc: Extends vLLM's WorkerProc with custom initialization
- Rewrites __init__ method to adapt to verl's initialization requirements:
  - Applies FP8 quantization patches if enabled via VERL_VLLM_FP8_QUANT_ENABLED
  - Applies vocabulary size monkey patch for logits computation
- Rewrites make_worker_process static method (modified from vLLM's implementation)
- Rewrites worker_main static method to run worker initialization and execution loops
- Handles graceful shutdown with death monitoring for parent process detection
vLLMMultiprocExecutor: Extends vLLM's MultiprocExecutor
- Inherits multiproc execution capabilities from vLLM
- Adds ZMQ communication with training workers
- Broadcasts RPC commands to all inference workers
- Manages lifecycle of inference worker processes

Note: Once vLLM updates make_worker_process and worker_main to class methods of WorkerProc, these 2 overrides will be removed

verl/workers/fsdp_workers.py: Changes to AsyncActorRolloutRefWorker

Removed: get_zeromq_address() method (no longer needed)
Added: set_executor_zmq_address() - sets ZMQ address for executor communication
Added: set_update_weights_zmq_handles() - configures IPC handles for weight updates
Added: get_update_weights_zmq_handle() - retrieves handle for weight synchronization

verl/workers/rollout/vllm_rollout/utils.py: New class - vLLMColocateWorkerExtension

Worker extension class for vLLM instances
Integrates via --worker_extension_cls parameter
Enables CUDA IPC-based weight update mechanism
Based on vLLM PR #24295 implementation

verl/workers/rollout/vllm_rollout/vllm_async_server.py: Changes to vLLMHttpServerBase.launch_server()

Modified to use vLLMMultiprocExecutor instead of ExternalZeroMQDistributedExecutor
Added --worker_extension_cls parameter to pass vLLMColocateWorkerExtension
Generates and sets VERL_VLLM_EXECUTOR_ZMQ_ADDRESS environment variable
Distributes executor ZMQ address to all training workers
Retrieves and configures update weights ZMQ handles
Removed: VERL_VLLM_ZMQ_ADDRESSES environment variable (no longer needed)

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

…ss separation Signed-off-by: jianjunzhong <[email protected]>

Signed-off-by: jianjunzhong <[email protected]>

feat: implement vLLM co-located training-inference rollout with proce…

714a32f

…ss separation Signed-off-by: jianjunzhong <[email protected]>

jianjunzhong force-pushed the refactor/vllm_sep_proc branch from 51c8ad9 to 714a32f Compare November 27, 2025 14:59

jianjunzhong changed the title ~~[BREAKING][worker, rollout, vllm] feat: implement vLLM co-located training-inference rollout with process separation~~ [WIP][BREAKING][worker, rollout, vllm] feat: implement vLLM co-located training-inference rollout with process separation Nov 28, 2025

jianjunzhong added 14 commits November 28, 2025 09:22

add vLLMWorkerProc

06959df

Signed-off-by: jianjunzhong <[email protected]>

update

d1cd582

Signed-off-by: jianjunzhong <[email protected]>

update

b1e954c

Signed-off-by: jianjunzhong <[email protected]>

update

fab21aa

Signed-off-by: jianjunzhong <[email protected]>

fix and add test cases

d5c11be

Signed-off-by: jianjunzhong <[email protected]>

update

c994d67

Signed-off-by: jianjunzhong <[email protected]>

update

a5de2d5

Signed-off-by: jianjunzhong <[email protected]>

update

e956ca0

Signed-off-by: jianjunzhong <[email protected]>

update

4680742

Signed-off-by: jianjunzhong <[email protected]>

update

8a0b4ef

Signed-off-by: jianjunzhong <[email protected]>

update

d2b7c8b

Signed-off-by: jianjunzhong <[email protected]>

add 'non_block' arg for _execute_method()

6b861a5

Signed-off-by: jianjunzhong <[email protected]>

support async execute method in vLLMRollout

685fd2a

Signed-off-by: jianjunzhong <[email protected]>

add lock for _do_execute()

686891d

Signed-off-by: jianjunzhong <[email protected]>

Shangwei-Li mentioned this pull request Dec 3, 2025

What's the purpose of ExternalZeroMQDistributedExecutor? #4383

Closed

jianjunzhong added 4 commits December 3, 2025 22:50

fix and remove redundant codes

8635731

Signed-off-by: jianjunzhong <[email protected]>

rename vLLMAsyncRollout to ServerAdapter and update class description

38971b6

Signed-off-by: jianjunzhong <[email protected]>

update

b684a86

Signed-off-by: jianjunzhong <[email protected]>

update

ca088a2

Signed-off-by: jianjunzhong <[email protected]>

jianjunzhong force-pushed the refactor/vllm_sep_proc branch from ba4512b to ca088a2 Compare December 7, 2025 14:44

remove VERL_VLLM_MULTIPROC_RANK_OFFSET, use CUDA_VISIBLE_DEVICES

87b9843

Signed-off-by: jianjunzhong <[email protected]>

wuxibin89 mentioned this pull request Dec 8, 2025

[perf] fix: profiler bug fix for agent loop #4320

Open

7 tasks

jianjunzhong force-pushed the refactor/vllm_sep_proc branch 3 times, most recently from ef46ad3 to 2d5b9f1 Compare December 11, 2025 01:45

jianjunzhong added 2 commits December 11, 2025 15:02

fix cuda invalid device ordinal

638339e

Signed-off-by: jianjunzhong <[email protected]>

remove unnessesary docstring

6e2e0aa

Signed-off-by: jianjunzhong <[email protected]>

jianjunzhong force-pushed the refactor/vllm_sep_proc branch from 796366c to 6e2e0aa Compare December 11, 2025 07:06

jianjunzhong added 2 commits December 15, 2025 16:34

remove vLLMMultiprocExecutor

64c9bf8

Signed-off-by: jianjunzhong <[email protected]>

update

7c23dad

Signed-off-by: jianjunzhong <[email protected]>

david6666666 mentioned this pull request Dec 15, 2025

[Feature][RL]: Support Model weight offload, reload and sync model weight & Offload DIT cache vllm-project/vllm-omni#316

Open

1 task

jianjunzhong added 7 commits December 15, 2025 20:22

update

9ca2e2c

Signed-off-by: jianjunzhong <[email protected]>

update

5b03b23

Signed-off-by: jianjunzhong <[email protected]>

remove useless codes

68258b5

Signed-off-by: jianjunzhong <[email protected]>

remove useless codes

dfe4e41

Signed-off-by: jianjunzhong <[email protected]>

merge with main

7510e73

Signed-off-by: jianjunzhong <[email protected]>

update

fcf779d

Signed-off-by: jianjunzhong <[email protected]>

fix pre-commit

afa9da9

Signed-off-by: jianjunzhong <[email protected]>

jianjunzhong marked this pull request as ready for review December 16, 2025 07:02

jianjunzhong requested review from ISEEKYAN, PeterSH6, chenhaiq, vermouth1992, wuxibin89 and zw0610 as code owners December 16, 2025 07:02

jianjunzhong changed the title ~~[WIP][BREAKING][worker, rollout, vllm] feat: implement vLLM co-located training-inference rollout with process separation~~ [BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation Dec 16, 2025

jianjunzhong marked this pull request as draft December 16, 2025 14:08

wuxibin89 mentioned this pull request Dec 17, 2025

[RFC][ray] Create single placement group for resource pool #4555

Closed

kip-cxj mentioned this pull request Dec 18, 2025

[WIP][BREAKING][worker, ckpt] support checkpoint engine for sync parameters in hybrid mode #4602

Open

7 tasks

fix ci

d427468

Signed-off-by: jianjunzhong <[email protected]>

jianjunzhong force-pushed the refactor/vllm_sep_proc branch from 96c6d23 to d427468 Compare December 23, 2025 09:25

update weights using buffer

eb6fb52

Signed-off-by: jianjunzhong <[email protected]>

jianjunzhong force-pushed the refactor/vllm_sep_proc branch from 9f9fa58 to eb6fb52 Compare December 24, 2025 13:22

jianjunzhong added 2 commits December 24, 2025 21:38

Merge branch 'main' into refactor/vllm_sep_proc

8a7c137

Signed-off-by: jianjunzhong <[email protected]>

fix

4207bdf

Signed-off-by: jianjunzhong <[email protected]>

wuxibin89 mentioned this pull request Dec 27, 2025

feat: Add SHM cache for multi-turn multi-modal workloads #4688

Open

7 tasks

add ascend npu support

3ff7574

Signed-off-by: jianjunzhong <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation #4280

[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation #4280

Uh oh!

jianjunzhong commented Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation #4280

Are you sure you want to change the base?

[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation #4280

Uh oh!

Conversation

jianjunzhong commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Architecture Overview

Detailed Code Changes

Checklist Before Submitting

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jianjunzhong commented Nov 25, 2025 •

edited

Loading