Skip to content

Conversation

@joyang-nv
Copy link
Collaborator

@joyang-nv joyang-nv commented Dec 25, 2025

What does this PR do?

TensorRT-LLM has recently added Ray orchestrator and essential features required for the RL workflow. This PR introduces TensorRT-LLM as a new rollout engine for VeRL.

VeRL currently supports several rollout modes:

  • Hybrid engine: The training and rollout engines share the same process group. VeRL uses the WorkerDict class to manage multiple workers within a single process group. Communication between training and rollout workers takes place within the same process, allowing them to share the Torch GPU memory pool.
  • Colocated: Different engines use the same set of GPUs but run in separate process groups. Currently, this mode is used only by the reward model.
  • Standalone: Rollout engines use completely independent GPU resources.

Unlike other rollout engines, TensorRT-LLM primarily targets the colocated mode. However, instead of relying purely on standard colocated mode, we introduced a mixed design combining aspects of the hybrid engine and colocated mode. The design goals are:

  • Clear resource separation through distinct process groups, offering maximum flexibility between training and rollout processes.
  • Hybrid workers that act as proxies to LLM servers.
  • Fully RESTful rollout API support through TRTLLMHttpServer.
  • A unified framework for both asynchronous and synchronous RL workflows.

This PR aims to make the integration as minimally intrusive as possible to VeRL's infrastructure. Currently, it only invokes RolloutReplica.init_hybrid_colocated() when both the hybrid engine is enabled and the rollout engine is set to TensorRT-LLM.

High Level Design

Please refer to workers/rollout/trtllm_rollout/trtllm_async_rollout.md for more details.

%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'18px', 'edgeLabelBackground':'#eeeeee'}}}%%
flowchart TB
    space1[" "]
    style space1 fill:none,stroke:none
    
    subgraph VERL["<b>VERL Training Pipeline</b>"]
        subgraph Workers["<b>Training Workers</b>"]
            Actor["<b>Actor Worker</b>"]
            Critic["<b>Critic Worker</b>"]
            RefModel["<b>Ref Model Worker</b>"]
        end
        
        Actor -->|<b>Weight Updates<br/>IPC</b>| Rollout["<b>TensorRT-LLM Rollout</b>"]
        
        subgraph RayCluster["<b>Rollout Workers<br/>(Ray Cluster)</b>"]
            space2[" "]
            style space2 fill:none,stroke:none
            
            subgraph AsyncRollout["<b>TRTLLMAsyncRollout<br/>(per DP rank)</b>"]
                DPLeader["<b>• DP Leader coordination</b>"]
                IPCMgmt["<b>• IPC handle management</b>"]
                HTTPAdapter["<b>• HTTP adapter for server communication</b>"]
            end
            
            AsyncRollout -->|<b>HTTP/REST API</b>| HTTPServer
            
            subgraph HTTPServer["<b>TRTLLMHttpServer<br/>(Ray Actor per Replica)</b>"]
                OpenAI["<b>• OpenAI Server wrapper</b>"]
                EngMgmt["<b>• AsyncLLM engine management</b>"]
                MemMgmt["<b>• Memory management (resume/release)</b>"]
            end
            
            HTTPServer --> AsyncLLM
            
            subgraph AsyncLLM["<b>TensorRT-LLM<br/>AsyncLLM Engine</b>"]
                GPUWorkers["<b>• GPU workers (Tensor Parallel)</b>"]
                KVCache["<b>• KV Cache management</b>"]
                CUDAGraph["<b>• CUDA Graph optimization</b>"]
            end
        end
    end
    
    space1 ~~~ VERL
    
    style VERL fill:#e1f5ff
    style RayCluster fill:#fff4e6
    style AsyncRollout fill:#f3e5f5
    style HTTPServer fill:#e8f5e9
    style AsyncLLM fill:#fce4ec
Loading

Experiments results:

Setup: single node with H100 * 8/slurm env.

  1. FSDP/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 1)

    • Convergence:
      image

    • Validation:
      image

  2. FSDP/GRPO: Qwen2-7B (TP4 * 2 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 4)

    • Convergence:
      image

    • Validation:
      image

  3. Megatron/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 1)

    • Convergence:
      image

    • Validation:
      image

  4. Megatron/GRPO: Qwen2-7B (TP2 * 2 on 8 GPUs, launching cmd bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 4)

    • Convergence:
      image

    • Validation:
      image

Special notes for using VeRL with TensorRT-LLM:

  1. All RL required APIs for VeRL were implemented within TensorRT-LLM 1.2.0rc6. To install VeRL with TensorRT-LLM, please use command pip install -e ".[trtllm]" --extra-index-url https://pypi.nvidia.com/.
  2. All verification of integration work was primarily done in Slurm environment.
  3. The current design requires export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 and the following env settings before launching the Ray cluster. While these have been included in any example scripts or tests added, we will work toward removing such dependencies to improve the user experience in the near future.
# Clean all slurm / MPI / PMIx env to avoid pmix mismatch error
for v in $(env | awk -F= '/^(PMI|PMIX|MPI|OMPI|SLURM)_/{print $1}'); do
    unset "$v"
done
# Force UCX to use only eth0; otherwise, it will attempt to use all available devices and raise warnings if any issues occur.
export TRTLLM_UCX_INTERFACE=eth0

Outstanding issues for this MR:

  1. WIP on passing CI tests

Upcoming works (in separate MRs)

  1. Further performance optimization
  2. Multi-node testing and functionality will be delivered in the near future.
  3. The current MR focuses on and was validated wtih Qwen model variants. We'll work on validations and optimizations for MoE models as the next step.

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Dec 25, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces TensorRT-LLM as a new rollout engine, which is a significant feature addition. The implementation is comprehensive, including a new Dockerfile, example scripts, documentation, and the core logic for the TRTLLM-based rollout in a hybrid colocated mode. The design leverages Ray for orchestration and IPC for efficient weight updates.

My review has identified a critical bug in one of the new example scripts (recipe/dapo/test_dapo_7b_math_trtllm.sh) where a variable is used before definition, which would lead to incorrect checkpoint paths. I've also found a potential IndexError in the GPU device ID mapping logic in verl/workers/rollout/trtllm_rollout/trtllm_rollout.py that could cause a crash in certain environments.

Overall, the changes are well-structured and the addition of TensorRT-LLM is a valuable enhancement. Addressing the identified issues will improve the robustness of this new feature.

@joyang-nv joyang-nv changed the title [BREAKING][ray,rollout] feat: Adding tensorrt_llm as new rollout engine [ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine Dec 25, 2025
joyang-nv and others added 3 commits December 29, 2025 23:43
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@joyang-nv joyang-nv marked this pull request as draft December 30, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants