Skip to content

Vision Utils Breaks on Video Only Samples #319

@schopra8

Description

@schopra8

If we try to use UnslothVisionDataCollator on a dataset of conversations that do not have images and only has samples that have videos, as multimodal inputs the script fails in two places

  1. Empty Images List If no images are provided as a part of the message, images defaults to an empty list (not None). Downstream processors (like Qwen-VL-2.5's ImageProcessor) does not know how to handle an empty list for a sample's images and throws and error.

images = []
videos = []
video_kwargs = {'fps': []}
for example in examples:
messages = self._select_messages_or_raw(example)
# Check if data format is correct for VLMs!
if len(messages) != 0:
messages = self._validate_and_normalize_first_message(messages)
# Also fix the messages if assistant must only be 1 string!
# Only affects Mistral V3 I think!
if self.assistant_single_content:
messages = self._collapse_assistant_content(messages)
pass
message = self.processor.apply_chat_template(
messages,
tokenize = False,
add_generation_prompt = False,
)
texts.append(message)
# Dataset with 2 columns messages / images
image, video, video_kwarg = self._extract_images_videos_for_example(example, messages)
image = self._resize_images_inplace(image)
images.append(image)
if len(video) > 0: # Works for list, tuple or tensor
videos.append(video)
if video_kwarg is None:
video_kwarg = {"fps": []}
video_kwargs['fps'].extend(video_kwarg['fps'])
pass

  1. _cast_pixel_values_dtype_inplace expects pixel_values If there are no images, there is no pixel_values value and the _cast_pixel_values_dtype_inplace function errors out.

def _cast_pixel_values_dtype_inplace(self, batch):
# Pixtral accepts multiple images, so we have to cast it individually
pixel_values = batch["pixel_values"]
if type(pixel_values) is list:
for j, pixel_value_j in enumerate(pixel_values):
if type(pixel_value_j) is list:
for k, pixel_value_k in enumerate(pixel_value_j):
pixel_value_j[k] = pixel_value_k.to(self.dtype)
else:
pixel_values[j] = pixel_value_j.to(self.dtype)
pass
batch["pixel_values"] = pixel_values
else:
batch["pixel_values"] = batch["pixel_values"].to(self.dtype)
pass
return batch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions