[ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665
+1,823
−20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
TensorRT-LLM has recently added Ray orchestrator and essential features required for the RL workflow. This PR introduces TensorRT-LLM as a new rollout engine for VeRL.
VeRL currently supports several rollout modes:
WorkerDictclass to manage multiple workers within a single process group. Communication between training and rollout workers takes place within the same process, allowing them to share the Torch GPU memory pool.Unlike other rollout engines, TensorRT-LLM primarily targets the colocated mode. However, instead of relying purely on standard colocated mode, we introduced a mixed design combining aspects of the hybrid engine and colocated mode. The design goals are:
TRTLLMHttpServer.This PR aims to make the integration as minimally intrusive as possible to VeRL's infrastructure. Currently, it only invokes
RolloutReplica.init_hybrid_colocated()when both the hybrid engine is enabled and the rollout engine is set to TensorRT-LLM.High Level Design
Please refer to workers/rollout/trtllm_rollout/trtllm_async_rollout.md for more details.
%%{init: {'theme':'base', 'themeVariables': { 'fontSize':'18px', 'edgeLabelBackground':'#eeeeee'}}}%% flowchart TB space1[" "] style space1 fill:none,stroke:none subgraph VERL["<b>VERL Training Pipeline</b>"] subgraph Workers["<b>Training Workers</b>"] Actor["<b>Actor Worker</b>"] Critic["<b>Critic Worker</b>"] RefModel["<b>Ref Model Worker</b>"] end Actor -->|<b>Weight Updates<br/>IPC</b>| Rollout["<b>TensorRT-LLM Rollout</b>"] subgraph RayCluster["<b>Rollout Workers<br/>(Ray Cluster)</b>"] space2[" "] style space2 fill:none,stroke:none subgraph AsyncRollout["<b>TRTLLMAsyncRollout<br/>(per DP rank)</b>"] DPLeader["<b>• DP Leader coordination</b>"] IPCMgmt["<b>• IPC handle management</b>"] HTTPAdapter["<b>• HTTP adapter for server communication</b>"] end AsyncRollout -->|<b>HTTP/REST API</b>| HTTPServer subgraph HTTPServer["<b>TRTLLMHttpServer<br/>(Ray Actor per Replica)</b>"] OpenAI["<b>• OpenAI Server wrapper</b>"] EngMgmt["<b>• AsyncLLM engine management</b>"] MemMgmt["<b>• Memory management (resume/release)</b>"] end HTTPServer --> AsyncLLM subgraph AsyncLLM["<b>TensorRT-LLM<br/>AsyncLLM Engine</b>"] GPUWorkers["<b>• GPU workers (Tensor Parallel)</b>"] KVCache["<b>• KV Cache management</b>"] CUDAGraph["<b>• CUDA Graph optimization</b>"] end end end space1 ~~~ VERL style VERL fill:#e1f5ff style RayCluster fill:#fff4e6 style AsyncRollout fill:#f3e5f5 style HTTPServer fill:#e8f5e9 style AsyncLLM fill:#fce4ecExperiments results:
Setup: single node with H100 * 8/slurm env.
FSDP/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 1)Convergence:

Validation:

FSDP/GRPO: Qwen2-7B (TP4 * 2 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_trtllm.sh 4)Convergence:

Validation:

Megatron/GRPO: Qwen2-7B (TP1 * 8 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 1)Convergence:

Validation:

Megatron/GRPO: Qwen2-7B (TP2 * 2 on 8 GPUs, launching cmd
bash examples/grpo_trainer/run_qwen2-7b_math_megatron_trtllm.sh 4)Convergence:

Validation:

Special notes for using VeRL with TensorRT-LLM:
pip install -e ".[trtllm]" --extra-index-url https://pypi.nvidia.com/.export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1and the following env settings before launching the Ray cluster. While these have been included in any example scripts or tests added, we will work toward removing such dependencies to improve the user experience in the near future.Outstanding issues for this MR:
Upcoming works (in separate MRs)
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)