Skip to content

RL Training RuntimeError in vLLM rollout when videos have different frame lengths #2

@JoeYangRL

Description

@JoeYangRL

Hi, thanks for the great work on this project.
While reproducing RL training, we encountered a runtime error during the vllm.generate() stage after a tool call.

Description

When the tool returns videos with different numbers of frames, the preprocessing step inside the Qwen2-VL video processor fails due to a torch.stack size mismatch.

It appears that group_videos_by_shape only groups videos by (H, W) but later stacks tensors that may have different temporal lengths T.

This happens when multiple videos with the same resolution but different frame counts are passed in the same batch.

sample info

videos_count=2 videos=[{'i': 0, 'shape': [76, 3, 364, 644], 'dtype': 'torch.float32', 'device': 'cpu'}, {'i': 1, 'shape': [12, 3
, 364, 644], 'dtype': 'torch.float32', 'device': 'cpu'}]

Error Log

File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 152, in __call__
    videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_processing_utils.py", line 197, in __call__
    return self.preprocess(videos, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_processing_utils.py", line 278, in preprocess
    return self._preprocess(videos=videos, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/models/qwen2_vl/video_processing_qwen2_vl.py", line 133, in _preprocess
    grouped_videos, grouped_videos_index = group_videos_by_shape(videos)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_utils.py", line 704, in group_videos_by_shape
    grouped_videos = {shape: torch.stack(videos, dim=0) for shape, videos in grouped_videos.items()}
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_utils.py", line 704, in <dictcomp>
    grouped_videos = {shape: torch.stack(videos, dim=0) for shape, videos in grouped_videos.items()}
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [76, 3, 364, 644] at entry 0 and [12, 3, 364, 644] at entry 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationgood first issueGood for newcomershelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions