Hi, thanks for the great work on this project.
While reproducing RL training, we encountered a runtime error during the vllm.generate() stage after a tool call.
Description
When the tool returns videos with different numbers of frames, the preprocessing step inside the Qwen2-VL video processor fails due to a torch.stack size mismatch.
It appears that group_videos_by_shape only groups videos by (H, W) but later stacks tensors that may have different temporal lengths T.
This happens when multiple videos with the same resolution but different frame counts are passed in the same batch.
sample info
videos_count=2 videos=[{'i': 0, 'shape': [76, 3, 364, 644], 'dtype': 'torch.float32', 'device': 'cpu'}, {'i': 1, 'shape': [12, 3
, 364, 644], 'dtype': 'torch.float32', 'device': 'cpu'}]
Error Log
File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 152, in __call__
videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_processing_utils.py", line 197, in __call__
return self.preprocess(videos, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_processing_utils.py", line 278, in preprocess
return self._preprocess(videos=videos, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/models/qwen2_vl/video_processing_qwen2_vl.py", line 133, in _preprocess
grouped_videos, grouped_videos_index = group_videos_by_shape(videos)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_utils.py", line 704, in group_videos_by_shape
grouped_videos = {shape: torch.stack(videos, dim=0) for shape, videos in grouped_videos.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/anaconda/anaconda3/envs/verl/lib/python3.11/site-packages/transformers/video_utils.py", line 704, in <dictcomp>
grouped_videos = {shape: torch.stack(videos, dim=0) for shape, videos in grouped_videos.items()}
^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [76, 3, 364, 644] at entry 0 and [12, 3, 364, 644] at entry 1
Hi, thanks for the great work on this project.
While reproducing RL training, we encountered a runtime error during the
vllm.generate()stage after a tool call.Description
When the tool returns videos with different numbers of frames, the preprocessing step inside the Qwen2-VL video processor fails due to a
torch.stacksize mismatch.It appears that
group_videos_by_shapeonly groups videos by(H, W)but later stacks tensors that may have different temporal lengthsT.This happens when multiple videos with the same resolution but different frame counts are passed in the same batch.
sample info
Error Log