-
Notifications
You must be signed in to change notification settings - Fork 210
Open
Description
If we try to use UnslothVisionDataCollator on a dataset of conversations that do not have images and only has samples that have videos, as multimodal inputs the script fails in two places
- Empty Images List If no images are provided as a part of the message,
imagesdefaults to an empty list (not None). Downstream processors (like Qwen-VL-2.5's ImageProcessor) does not know how to handle an empty list for a sample's images and throws and error.
unsloth-zoo/unsloth_zoo/vision_utils.py
Lines 775 to 807 in ea85a26
| images = [] | |
| videos = [] | |
| video_kwargs = {'fps': []} | |
| for example in examples: | |
| messages = self._select_messages_or_raw(example) | |
| # Check if data format is correct for VLMs! | |
| if len(messages) != 0: | |
| messages = self._validate_and_normalize_first_message(messages) | |
| # Also fix the messages if assistant must only be 1 string! | |
| # Only affects Mistral V3 I think! | |
| if self.assistant_single_content: | |
| messages = self._collapse_assistant_content(messages) | |
| pass | |
| message = self.processor.apply_chat_template( | |
| messages, | |
| tokenize = False, | |
| add_generation_prompt = False, | |
| ) | |
| texts.append(message) | |
| # Dataset with 2 columns messages / images | |
| image, video, video_kwarg = self._extract_images_videos_for_example(example, messages) | |
| image = self._resize_images_inplace(image) | |
| images.append(image) | |
| if len(video) > 0: # Works for list, tuple or tensor | |
| videos.append(video) | |
| if video_kwarg is None: | |
| video_kwarg = {"fps": []} | |
| video_kwargs['fps'].extend(video_kwarg['fps']) | |
| pass |
_cast_pixel_values_dtype_inplaceexpectspixel_valuesIf there are no images, there isno pixel_valuesvalue and the_cast_pixel_values_dtype_inplacefunction errors out.
unsloth-zoo/unsloth_zoo/vision_utils.py
Lines 950 to 965 in ea85a26
| def _cast_pixel_values_dtype_inplace(self, batch): | |
| # Pixtral accepts multiple images, so we have to cast it individually | |
| pixel_values = batch["pixel_values"] | |
| if type(pixel_values) is list: | |
| for j, pixel_value_j in enumerate(pixel_values): | |
| if type(pixel_value_j) is list: | |
| for k, pixel_value_k in enumerate(pixel_value_j): | |
| pixel_value_j[k] = pixel_value_k.to(self.dtype) | |
| else: | |
| pixel_values[j] = pixel_value_j.to(self.dtype) | |
| pass | |
| batch["pixel_values"] = pixel_values | |
| else: | |
| batch["pixel_values"] = batch["pixel_values"].to(self.dtype) | |
| pass | |
| return batch |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels