[trainer] fix: model engine vlm multi_modal_inputs to NonTensorStack #4492

wuxibin89 · 2025-12-11T08:52:10Z

What does this PR do?

Fix RL model engine for VLM.

Qwen/Qwen3-VL-30B-A3B-Instruct fsdp vs megatron on geo3k:

gemini-code-assist

Code Review

This pull request addresses an issue with handling multi_modal_inputs for VLM model engines by correctly transposing them into a NonTensorStack. It also includes a good refactoring in verl/workers/engine_workers.py to centralize device placement logic by moving inference results to the CPU and removing redundant .cpu() calls. My review includes one suggestion in verl/protocol.py to improve the memory efficiency of key collection, which is an important consideration for large-scale models.

verl/protocol.py

wuxibin89 · 2025-12-15T15:37:51Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces several important fixes and refactorings to improve support for Vision Language Models (VLMs), particularly around handling multi_modal_inputs and variable-length sequences. The core change is to group multi-modal data under a single multi_modal_inputs key, which is a cleaner data structure. The PR correctly disables pin_memory in dataloaders to prevent crashes with NestedTensors, and adds a necessary workaround for chunking TensorDicts containing 3D jagged tensors. My review identifies one critical issue in the updated data collation logic that could lead to a KeyError when processing mixed batches of VLM and text-only data. A code suggestion is provided to fix this.

gemini-code-assist · 2025-12-15T15:40:11Z

verl/utils/dataset/dataset_utils.py

+            if isinstance(batch[0][key], torch.Tensor):
                tensors = [item[key] for item in batch]
                final_batch[key] = torch.nested.as_nested_tensor(tensors, layout=torch.jagged)
+            else:
+                tensors = [NonTensorData(item.get(key)) for item in batch]
+                final_batch[key] = torch.stack(tensors, dim=0)


The logic isinstance(batch[0][key], torch.Tensor) is not robust and can lead to a KeyError. The tensor_keys set is a union of keys from all samples in the batch. If a key (e.g., multi_modal_inputs) is present in some samples but not in batch[0], accessing batch[0][key] will cause a crash. This is likely to happen when a batch mixes vision-language and text-only data.

Checking for the key's existence in batch[0] before checking its type will prevent this crash.

Suggested change

if isinstance(batch[0][key], torch.Tensor):

tensors = [item[key] for item in batch]

final_batch[key] = torch.nested.as_nested_tensor(tensors, layout=torch.jagged)

else:

tensors = [NonTensorData(item.get(key)) for item in batch]

final_batch[key] = torch.stack(tensors, dim=0)

if key in batch[0] and isinstance(batch[0][key], torch.Tensor):

tensors = [item[key] for item in batch]

final_batch[key] = torch.nested.as_nested_tensor(tensors, layout=torch.jagged)

else:

tensors = [NonTensorData(item.get(key)) for item in batch]

final_batch[key] = torch.stack(tensors, dim=0)

verl/single_controller/base/decorator.py

wuxibin89 requested review from ISEEKYAN, PeterSH6 and vermouth1992 December 11, 2025 08:52

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

verl/protocol.py Outdated Show resolved Hide resolved

wuxibin89 added 5 commits December 15, 2025 19:25

[trainer] fix: model engine vlm multi_modal_inputs to NonTensorStack

e27443c

breakpoint

4dcf3e7

fix vlm attention_mask

cc5ebb3

add vlm sft ci

bc36a73

fix multi_modal_inputs

954033c

wuxibin89 force-pushed the wuxibin/rl_engine_vlm branch from 8358145 to 954033c Compare December 15, 2025 15:28

wuxibin89 requested review from ZihengJiang, eric-haibin-lin, tongyx361 and zw0610 as code owners December 15, 2025 15:28

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

fix critic use_legacy_worker_impl assertion

4e35c5b

vermouth1992 reviewed Dec 16, 2025

View reviewed changes

verl/single_controller/base/decorator.py Outdated Show resolved Hide resolved

wuxibin89 added 3 commits December 16, 2025 11:13

add chunk_tensordict ut

54ed5aa

fix vllm sft ci cuda oom

b8f037f

set sft ci rtol=3e-2

cfa62e8

vermouth1992 approved these changes Dec 16, 2025

View reviewed changes

wuxibin89 merged commit fdf0046 into volcengine:main Dec 16, 2025
77 of 81 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[trainer] fix: model engine vlm multi_modal_inputs to NonTensorStack #4492

[trainer] fix: model engine vlm multi_modal_inputs to NonTensorStack #4492

Uh oh!

wuxibin89 commented Dec 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

wuxibin89 commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[trainer] fix: model engine vlm multi_modal_inputs to NonTensorStack #4492

[trainer] fix: model engine vlm multi_modal_inputs to NonTensorStack #4492

Uh oh!

Conversation

wuxibin89 commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

wuxibin89 commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wuxibin89 commented Dec 11, 2025 •

edited

Loading