[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665

joyang-nv · 2025-12-25T05:53:01Z

What does this PR do?

TensorRT-LLM has recently added Ray orchestrator and essential features required for the RL workflow. This PR introduces TensorRT-LLM as a new rollout engine for VeRL.

VeRL currently supports several rollout modes:

Hybrid engine: The training and rollout engines share the same process group. VeRL uses the WorkerDict class to manage multiple workers within a single process group. Communication between training and rollout workers takes place within the same process, allowing them to share the Torch GPU memory pool.
Colocated: Different engines use the same set of GPUs but run in separate process groups. Currently, this mode is used only by the reward model.
Standalone: Rollout engines use completely independent GPU resources.

Unlike other rollout engines, TensorRT-LLM primarily targets the colocated mode. However, instead of relying purely on standard colocated mode, we introduced a mixed design combining aspects of the hybrid engine and colocated mode. The design goals are:

Clear resource separation through distinct process groups, offering maximum flexibility between training and rollout processes.
Hybrid workers that act as proxies to LLM servers.
Fully RESTful rollout API support through TRTLLMHttpServer.
A unified framework for both asynchronous and synchronous RL workflows.

This PR aims to make the integration as minimally intrusive as possible to VeRL's infrastructure. Currently, it only invokes RolloutReplica.init_hybrid_colocated() when both the hybrid engine is enabled and the rollout engine is set to TensorRT-LLM.

High Level Design

Please refer to workers/rollout/trtllm_rollout/trtllm_async_rollout.md for more details.

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'18px', 'edgeLabelBackground':'#eeeeee'}}}%%
flowchart TB
    space1[" "]
    style space1 fill:none,stroke:none
    
    subgraph VERL["<b>VERL Training Pipeline</b>"]
        subgraph Workers["<b>Training Workers</b>"]
            Actor["<b>Actor Worker</b>"]
            Critic["<b>Critic Worker</b>"]
            RefModel["<b>Ref Model Worker</b>"]
        end
        
        Actor -->|<b>Weight Updates<br/>IPC</b>| Rollout["<b>TensorRT-LLM Rollout</b>"]
        
        subgraph RayCluster["<b>Rollout Workers<br/>(Ray Cluster)</b>"]
            space2[" "]
            style space2 fill:none,stroke:none
            
            subgraph AsyncRollout["<b>TRTLLMAsyncRollout<br/>(per DP rank)</b>"]
                DPLeader["<b>• DP Leader coordination</b>"]
                IPCMgmt["<b>• IPC handle management</b>"]
                HTTPAdapter["<b>• HTTP adapter for server communication</b>"]
            end
            
            AsyncRollout -->|<b>HTTP/REST API</b>| HTTPServer
            
            subgraph HTTPServer["<b>TRTLLMHttpServer<br/>(Ray Actor per Replica)</b>"]
                OpenAI["<b>• OpenAI Server wrapper</b>"]
                EngMgmt["<b>• AsyncLLM engine management</b>"]
                MemMgmt["<b>• Memory management (resume/release)</b>"]
            end
            
            HTTPServer --> AsyncLLM
            
            subgraph AsyncLLM["<b>TensorRT-LLM<br/>AsyncLLM Engine</b>"]
                GPUWorkers["<b>• GPU workers (Tensor Parallel)</b>"]
                KVCache["<b>• KV Cache management</b>"]
                CUDAGraph["<b>• CUDA Graph optimization</b>"]
            end
        end
    end
    
    space1 ~~~ VERL
    
    style VERL fill:#e1f5ff
    style RayCluster fill:#fff4e6
    style AsyncRollout fill:#f3e5f5
    style HTTPServer fill:#e8f5e9
    style AsyncLLM fill:#fce4ec

Experiments results:

Setup: single node with H100 * 8/slurm env.

FSDP/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 1)
- Convergence:
- Validation:
FSDP/GRPO: Qwen2-7B (TP4 * 2 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 4)
- Convergence:
- Validation:
Megatron/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 1)
- Convergence:
- Validation:
Megatron/GRPO: Qwen2-7B (TP2 * 2 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 4)
- Convergence:
- Validation:

Special notes for using VeRL with TensorRT-LLM:

All RL required APIs for VeRL were implemented within TensorRT-LLM 1.2.0rc6. To install VeRL with TensorRT-LLM, please use command pip install -e ".[trtllm]" --extra-index-url https://pypi.nvidia.com/.
All verification of integration work was primarily done in Slurm environment.
The current design requires export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 and the following env settings before launching the Ray cluster. While these have been included in any example scripts or tests added, we will work toward removing such dependencies to improve the user experience in the near future.

# Clean all slurm / MPI / PMIx env to avoid pmix mismatch error
for v in $(env | awk -F= '/^(PMI|PMIX|MPI|OMPI|SLURM)_/{print $1}'); do
    unset "$v"
done
# Force UCX to use only eth0; otherwise, it will attempt to use all available devices and raise warnings if any issues occur.
export TRTLLM_UCX_INTERFACE=eth0

Outstanding issues for this MR:

WIP on passing CI tests

Upcoming works (in separate MRs)

Further performance optimization
Multi-node testing and functionality will be delivered in the near future.
The current MR focuses on and was validated wtih Qwen model variants. We'll work on validations and optimizations for MoE models as the next step.

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

CLAassistant · 2025-12-25T05:53:09Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces TensorRT-LLM as a new rollout engine, which is a significant feature addition. The implementation is comprehensive, including a new Dockerfile, example scripts, documentation, and the core logic for the TRTLLM-based rollout in a hybrid colocated mode. The design leverages Ray for orchestration and IPC for efficient weight updates.

My review has identified a critical bug in one of the new example scripts (recipe/dapo/test_dapo_7b_math_trtllm.sh) where a variable is used before definition, which would lead to incorrect checkpoint paths. I've also found a potential IndexError in the GPU device ID mapping logic in verl/workers/rollout/trtllm_rollout/trtllm_rollout.py that could cause a crash in certain environments.

Overall, the changes are well-structured and the addition of TensorRT-LLM is a valuable enhancement. Addressing the identified issues will improve the robustness of this new feature.

recipe/dapo/test_dapo_7b_math_trtllm.sh

verl/workers/rollout/trtllm_rollout/trtllm_rollout.py

Signed-off-by: Jonas Yang <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

joyang-nv requested review from FightingZhen, ISEEKYAN, PeterSH6, eric-haibin-lin, ji-huazhong, tongyx361, vermouth1992, wuxibin89, zhaochenyang20 and zw0610 as code owners December 25, 2025 05:53

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

recipe/dapo/test_dapo_7b_math_trtllm.sh Show resolved Hide resolved

verl/workers/rollout/trtllm_rollout/trtllm_rollout.py Outdated Show resolved Hide resolved

joyang-nv changed the title ~~[BREAKING][ray,rollout] feat: Adding tensorrt_llm as new rollout engine~~ [ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine Dec 25, 2025

joyang-nv and others added 3 commits December 29, 2025 23:43

Add trtllm as new rollout engine.

cfa9ecc

Signed-off-by: Jonas Yang <[email protected]>

Update megatron testing script.

6f2779d

Signed-off-by: Jonas Yang <[email protected]>

Update verl/workers/rollout/trtllm_rollout/trtllm_rollout.py

fb4a83c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

joyang-nv force-pushed the trtllm branch from 3e977ae to fb4a83c Compare December 30, 2025 08:40

joyang-nv marked this pull request as draft December 30, 2025 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665

[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665

joyang-nv commented Dec 25, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Dec 25, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665

Are you sure you want to change the base?

[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665

Conversation

joyang-nv commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

High Level Design

Experiments results:

Special notes for using VeRL with TensorRT-LLM:

Outstanding issues for this MR:

Upcoming works (in separate MRs)

Uh oh!

CLAassistant commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joyang-nv commented Dec 25, 2025 •

edited

Loading

CLAassistant commented Dec 25, 2025 •

edited

Loading