RL Training RuntimeError in vLLM rollout when videos have different frame lengths

Hi, thanks for the great work on this project.  
While reproducing RL training, we encountered a runtime error during the `vllm.generate()` stage after a tool call.

### Description

When the tool returns videos with **different numbers of frames**, the preprocessing step inside the Qwen2-VL video processor fails due to a `torch.stack` size mismatch.

It appears that `group_videos_by_shape` only groups videos by `(H, W)` but later stacks tensors that may have different temporal lengths `T`.

This happens when multiple videos with the same resolution but different frame counts are passed in the same batch.

### sample info
```
videos_count=2 videos=[{'i': 0, 'shape': [76, 3, 364, 644], 'dtype': 'torch.float32', 'device': 'cpu'}, {'i': 1, 'shape': [12, 3
, 364, 644], 'dtype': 'torch.float32', 'device': 'cpu'}]
```
### Error Log
```
File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 152, in __call__
    videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_processing_utils.py", line 197, in __call__
    return self.preprocess(videos, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_processing_utils.py", line 278, in preprocess
    return self._preprocess(videos=videos, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/models/qwen2_vl/video_processing_qwen2_vl.py", line 133, in _preprocess
    grouped_videos, grouped_videos_index = group_videos_by_shape(videos)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_utils.py", line 704, in group_videos_by_shape
    grouped_videos = {shape: torch.stack(videos, dim=0) for shape, videos in grouped_videos.items()}
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_utils.py", line 704, in <dictcomp>
    grouped_videos = {shape: torch.stack(videos, dim=0) for shape, videos in grouped_videos.items()}
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [76, 3, 364, 644] at entry 0 and [12, 3, 364, 644] at entry 1
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RL Training RuntimeError in vLLM rollout when videos have different frame lengths #2

Description

sample info

Error Log

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RL Training RuntimeError in vLLM rollout when videos have different frame lengths #2

Description

Description

sample info

Error Log

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions