[Model] Add video input support for transformers modeling backend #30680

ch3nku1 · 2025-12-15T08:24:00Z

I am a developer from Cybercore. We are developing a multimodal model named Leum and plan to deploy it using the vLLM transformers modeling backend. We noticed that the current implementation does not support video input, which is a necessary feature for our model. This pull request introduces the required changes to enable video input processing.

Key changes:

Extended multimodal classes (MultiModalProcessingInfo, MultiModalProcessor, MultiModalMixin) to handle video-specific logic, including token calculation, dummy data generation, and embedding.
Corrected the frame size extraction for video frames in vllm/multimodal/parse.py.
Updated documentation to reflect video support.
Fixed a potential OOM issue in the dummy batch generator for multimodal models.

Thank you for considering our contribution!

…ckend Key changes: - Extended multimodal classes (`MultiModalProcessingInfo`, `MultiModalProcessor`, `MultiModalMixin`) to handle video-specific logic, including token calculation, dummy data generation, and embedding. - Corrected the frame size extraction for video frames in `vllm/multimodal/parse.py`. - Updated documentation to reflect video support. - Fixed a potential OOM issue in the dummy batch generator for multimodal models. Signed-off-by: chenkui.shen <[email protected]>

chatgpt-codex-connector · 2025-12-15T08:24:06Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

mergify · 2025-12-15T08:24:36Z

Documentation preview: https://vllm--30680.org.readthedocs.build/en/30680/

gemini-code-assist

Code Review

This pull request introduces video input support for the transformers modeling backend. The changes are comprehensive, touching multimodal processing classes, documentation, and fixing a bug in video frame size extraction. A key contribution is a fix for an OOM issue during dummy batch generation for profiling. While the fix prevents crashes, I've identified a critical issue with the approach and suggested a more robust solution to ensure accurate memory profiling and prevent potential OOMs in production.

gemini-code-assist · 2025-12-15T08:37:43Z

vllm/v1/worker/gpu_model_runner.py

        dummy_modality = mm_budget.get_modality_with_max_tokens()
-        return self._get_mm_dummy_batch(dummy_modality, num_seqs)
+
+        # TBD:
+        # The mm_dummy_batch below is only retrieved when
+        # supports_multimodal_raw_input_only is True.
+        # Currently, only the transform modeling backend and terratorch have
+        # supports_multimodal_raw_input_only as True.
+        # When testing the transform modeling backend, it was found that
+        # if num_seqs (usually the default 256) is passed in here,
+        # an OOM error occurs.
+        # It needs to be confirmed what value should be passed in here,
+        # for now it is fixed to 1.
+        return self._get_mm_dummy_batch(dummy_modality, 1)


Hardcoding the dummy batch size to 1 for profiling is a pragmatic fix to avoid OOM errors, but it can lead to inaccurate memory profiling. This might cause the scheduler to underestimate memory usage, potentially leading to OOM errors in production with larger batches.

A more robust approach would be to calculate a reasonable batch size based on the model's configuration. This provides a more realistic batch size for profiling, reducing the risk of production OOMs while still preventing OOMs during profiling.

dummy_modality = mm_budget.get_modality_with_max_tokens() max_tokens_per_item = mm_budget.max_tokens_by_modality.get(dummy_modality) if max_tokens_per_item and max_tokens_per_item > 0: # Heuristic to derive a reasonable batch size for profiling. # Using max_num_seqs can cause OOM for vision models. # Hardcoding to 1 can lead to inaccurate profiling. num_items = self.scheduler_config.max_num_batched_tokens // max_tokens_per_item # Also respect the per-prompt limit and max sequences. max_items_for_modality = mm_budget.max_items_per_batch_by_modality[dummy_modality] num_items = min(num_items, max_items_for_modality) # Ensure at least 1 item. num_items = max(num_items, 1) else: # Fallback for safety, though this path should ideally not be taken. num_items = 1 return self._get_mm_dummy_batch(dummy_modality, num_items)

github-actions · 2025-12-15T08:42:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

hmellor

Awesome PR! Thank you for using and contributing to the Transformers modelling backend!

I've left a few comments and looped in @zucchini-nlp. I want to make sure that the standards we're defining here for video model interfaces align with how Transformers would like to standardise all the video models in the Transformers library.

hmellor · 2025-12-15T12:05:31Z

vllm/v1/worker/gpu_model_runner.py

+        # TBD:
+        # The mm_dummy_batch below is only retrieved when
+        # supports_multimodal_raw_input_only is True.
+        # Currently, only the transform modeling backend and terratorch have
+        # supports_multimodal_raw_input_only as True.
+        # When testing the transform modeling backend, it was found that
+        # if num_seqs (usually the default 256) is passed in here,
+        # an OOM error occurs.
+        # It needs to be confirmed what value should be passed in here,
+        # for now it is fixed to 1.
+        return self._get_mm_dummy_batch(dummy_modality, 1)


I was recently working in this area and noticed this too. It causes 256 x 100MP images (the max image size defined in vllm/model_executor/models/transformers/multimodal.py) to be materialised on the GPU.

@ywang96 what do you think should be the correct behaviour here?

I think here the profiling is actually correct for models that have is_multimodal_raw_input_only_model, since the maximum number of videos possible during inference is indeed based on max_num_seqs (and can be actually more than that if the model accepts more than 1 video per prompt).

So I prefer patching this on the model side. If removing supports_multimodal_raw_input_only=True is not feasible, then maybe you can override the mm_counts in get_dummy_mm_data to use the same value as for regular MM models.

I forget the exact reason we chose to use supports_multimodal_raw_input_only=True. IIRC it was because it reduced the amount of necessary monkey patching because the processor already exists on the Transformers side so it's needlessly complicated to monkey patch the processing on the vLLM side.

@DarkLight1337 do you know more about the key differences between supports_multimodal_raw_input_only being True/False?

since the maximum number of videos possible during inference is indeed based on max_num_seqs

Why is this not also the case when is_multimodal_raw_input_only_model=False?

hmellor · 2025-12-15T12:07:55Z

vllm/multimodal/parse.py

            return ImageSize(*image.size)
        if isinstance(image, (np.ndarray, torch.Tensor)):
-            _, h, w = image.shape
+            w, h, _ = image.shape


I doubt that this was wrong for all models already in vLLM, why does this need changing here?

hmellor · 2025-12-15T12:22:58Z

vllm/model_executor/models/transformers/multimodal.py

        processor = self.info.get_hf_processor()
        if "gemma3" in processor.__class__.__name__.lower():
            image_token = processor.boi_token
+            video_token = ""


Suggested change

video_token = ""

hmellor · 2025-12-15T12:23:23Z

vllm/model_executor/models/transformers/multimodal.py

        else:
            image_token = getattr(processor, "image_token", "")
-        return image_token * num_images
+            video_token = getattr(processor, "video_token", "")


Given that Gemma3 will have no video_token this should be fine, right?

Suggested change

video_token = getattr(processor, "video_token", "")

video_token = getattr(processor, "video_token", "")

hmellor · 2025-12-15T12:30:52Z

vllm/model_executor/models/transformers/multimodal.py

+            "video": self._get_dummy_videos(
+                width=target_width,
+                height=target_height,
+                num_frames=target_num_frames,
+                num_videos=num_videos,
+            ),


We should also provide video overrides, right?

hmellor · 2025-12-15T13:02:58Z

vllm/model_executor/models/transformers/multimodal.py

        kwargs.pop("token_type_ids", None)  # used only in `forward`
+
        if pixel_values is not None:
+            num_image_patches = kwargs.pop("num_image_patches")


I think this was outside the if block so that it was always popped from kwargs regardless of if we used it. The same shold probably be done for num_video_patches

hmellor · 2025-12-15T13:09:03Z

vllm/model_executor/models/transformers/multimodal.py

+            if isinstance(vision_embeddings, tuple):
+                # For qwen3 vl, The deepstack visual features are also returned
+                vision_embeddings = vision_embeddings[0]


This might not be intentional. @zucchini-nlp should the return value of Qwen3 VL have been changed to this?

We never supportted qwen3-vl in vLLM because of this 😿 The model has to propagate deepstack visual features down to the LM's forward, which we could theoretically overcome

We are currently doing standardization of get_image_features in transformers side and I think the best would be to always ask for a dict output here (i.e. we get all the outputs from vision encoder) and pass it over to LM, let it handle encoder outputs whichever way it pleases

Do you mean in vLLM with the Transformers modelling backend? vLLM does support this model natively.

Yeah a standardisation on the Transformers side would be great.

Yes, with the backend the model was not supported when released

hmellor · 2025-12-15T13:11:47Z

vllm/model_executor/models/transformers/multimodal.py

+            multimodal_embeddings += tuple(vision_embeddings)
+
+        if video_embeds is not None:
+            multimodal_embeddings += tuple(video_embeds)
+
+        if pixel_values_videos is not None:
+            num_video_patches = kwargs.pop("num_video_patches")
+            vision_embeddings = self.model.get_video_features(
+                pixel_values_videos, **kwargs
+            )
+
+            if isinstance(vision_embeddings, tuple):
+                # For qwen3 vl, The deepstack visual features are also returned
+                vision_embeddings = vision_embeddings[0]
+            if isinstance(vision_embeddings, torch.Tensor):
+                if vision_embeddings.ndim == 2:
+                    vision_embeddings = vision_embeddings.unsqueeze(0)
+
+                # Embeddings have to be 2D tensors of length `num_images`
+                # but transformers returns concat tensors if each patch
+                # is of different size. We split it back to make vLLM happy
+                vision_embeddings = torch.split(
+                    vision_embeddings, num_video_patches.flatten().tolist()
+                )
+                vision_embeddings = [
+                    embed.flatten(start_dim=0, end_dim=-2)
+                    for embed in vision_embeddings
+                ]
+            multimodal_embeddings += tuple(vision_embeddings)


Similar comment to above, this is very similar to the image code. Could they be deduplicated?

hmellor · 2025-12-15T13:13:22Z

vllm/model_executor/models/transformers/multimodal.py

        image_grid_thw = kwargs.get("image_grid_thw", [])
        video_grid_thw = kwargs.get("video_grid_thw", [])


Now that we're not trying to create empty tensors with these

Suggested change

image_grid_thw = kwargs.get("image_grid_thw", [])

video_grid_thw = kwargs.get("video_grid_thw", [])

image_grid_thw = kwargs.get("image_grid_thw", None)

video_grid_thw = kwargs.get("video_grid_thw", None)

hmellor · 2025-12-15T13:14:19Z

vllm/model_executor/models/transformers/multimodal.py

+        image_grid_thw = torch.stack(image_grid_thw) if image_grid_thw else None
+        video_grid_thw = torch.stack(video_grid_thw) if video_grid_thw else None


@zucchini-nlp was it important that vLLM passed empty grids rather than None to get_rope_index?

should be None if the modality is not present

To clarify, it should be None if:

the modality is not present in the individual request? (i.e. an image model but no image has been passed)

the modelity is not present in the model? (i.e. a model which doesn't support video)

DarkLight1337 · 2025-12-15T13:20:10Z

vllm/model_executor/models/transformers/multimodal.py

        if split_sizes:
            chunked_mm_positions = torch.split(mm_positions, split_sizes)
-            mm_tokens = torch.tensor(prompt_ids)[mm_token_type_ids[0].bool()]
+            mm_tokens = torch.tensor(prompt_ids)[mm_token_type_ids[0] == 1]


Avoid computing mm_token_type_ids == 1 twice

Also is there somewhere in Transformers where the token type ID is defined by modality? I don't like this magic number

zucchini-nlp

Left a few comments from my experience trying to make videos compatible with the backend. IMO this PR has to wait until necessary changes are made in transformers side for video LLMs. I am quite sure we can't support all of them, but we could try to support 70%

There aren't many models, around 10 I think in total with explicit video support.

zucchini-nlp · 2025-12-15T12:49:58Z

vllm/model_executor/models/transformers/multimodal.py

+            "image": self.get_max_image_tokens(),
+            "video": self.get_max_video_tokens(seq_len),


I remember that vLLM chooses only one modality when profling, whichever has more tokens. So I think we will need to safe-get mm_tokens["num_video_tokens"] and otherwise set to 0 because not all models support videos

Qwen2-VL would be a good example of this with get_num_frames_with_most_features. But it gets its mm_counts from the limits in get_supported_mm_limits unless the user specified otherwise.

Ideally there would be a way to set get_supported_mm_limits()["video"] to zero when the model doesn't support video.

Yeah, that would be great. I even can say how we infer if model supports videos from model class 😄 It has a class attribute model.input_modalities

Oh ok, are these class attributes or instance attributes?

class attributes available from v5 and on

zucchini-nlp · 2025-12-15T12:51:17Z

vllm/model_executor/models/transformers/multimodal.py

        else:
            image_token = getattr(processor, "image_token", "")
-        return image_token * num_images
+            video_token = getattr(processor, "video_token", "")


yeah, this is the part which might fails for some models. There are models that treat videos as a sequence of images and thus don't have a specific video token 🥲 It made me crazy trying to make them work

zucchini-nlp · 2025-12-15T12:55:16Z

vllm/model_executor/models/transformers/multimodal.py

+        video_grid_thw = hf_inputs.get("video_grid_thw", torch.empty((0, 3)))
+        video_grid_sizes = video_grid_thw.prod(-1)


not super generalizable, only qwen-like models have a THW tensor. Ideally we need to use num_video_patches as in image modality

zucchini-nlp · 2025-12-15T12:58:35Z

vllm/model_executor/models/transformers/multimodal.py


        mm_tokens_per_modality = hf_processor._get_num_multimodal_tokens(
-            image_sizes=image_sizes, **mm_processor_kwargs
+            image_sizes=image_sizes, video_sizes=video_sizes, **mm_processor_kwargs


I don't think all processor expect video_sizes, though it will be swallowed by kwargs and shouldn't raise issues

zucchini-nlp · 2025-12-15T13:01:51Z

vllm/model_executor/models/transformers/multimodal.py

+            ranges = [
+                PlaceholderRange(
+                    offset=positions[0].item(),
+                    length=positions.shape[0],
+                    is_embed=(mm_tokens == hf_processor.video_token_id).bool(),
+                )


the hard part here are models which add timestamps between each video frame and the timestamps are encoded differently, they have no special tokens. So inferring the ranges is not as easy as with images

zucchini-nlp · 2025-12-15T13:02:31Z

vllm/model_executor/models/transformers/multimodal.py

+            mm_placeholders["video"] = ranges
+
+        processed_data["num_video_patches"] = torch.tensor(
+            mm_tokens_per_modality["num_video_patches"]


mm_tokens_per_modality isn't guaranteed to contain num_video_patches if the model doesn't support video modality. Same for key=num_video_tokens

zucchini-nlp · 2025-12-15T13:06:23Z

vllm/model_executor/models/transformers/multimodal.py

            vision_embeddings = self.model.get_image_features(pixel_values, **kwargs)

+            if isinstance(vision_embeddings, tuple):
+                # For qwen3 vl, The deepstack visual features are also returned


right, though qwen3-vl will give bad performance without deepstack visual features, Currently vLLM doesn't support models like Qwen3-VL and Ovis2 on purpose

zucchini-nlp · 2025-12-15T13:17:34Z

vllm/model_executor/models/transformers/multimodal.py

+                # Embeddings have to be 2D tensors of length `num_images`
+                # but transformers returns concat tensors if each patch
+                # is of different size. We split it back to make vLLM happy
+                vision_embeddings = torch.split(
+                    vision_embeddings, num_video_patches.flatten().tolist()
+                )
+                vision_embeddings = [


I am not 100% sure, though for video modality this might not be needed. Mostly videos don't have several patches per one frame

vllm/model_executor/models/transformers/multimodal.py

ch3nku1 requested review from DarkLight1337, NickLucche, hmellor, tjtanaa and ywang96 as code owners December 15, 2025 08:24

github-project-automation bot added this to Transformers backend Dec 15, 2025

github-project-automation bot moved this to Todo in Transformers backend Dec 15, 2025

mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) v1 labels Dec 15, 2025

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

hmellor self-assigned this Dec 15, 2025

hmellor reviewed Dec 15, 2025

View reviewed changes

DarkLight1337 reviewed Dec 15, 2025

View reviewed changes

zucchini-nlp reviewed Dec 15, 2025

View reviewed changes

	video_token = getattr(processor, "video_token", "")
	video_token = getattr(processor, "video_token", "")

		image_grid_thw = kwargs.get("image_grid_thw", [])
		video_grid_thw = kwargs.get("video_grid_thw", [])

		image_grid_thw = torch.stack(image_grid_thw) if image_grid_thw else None
		video_grid_thw = torch.stack(video_grid_thw) if video_grid_thw else None

		"image": self.get_max_image_tokens(),
		"video": self.get_max_video_tokens(seq_len),

		video_grid_thw = hf_inputs.get("video_grid_thw", torch.empty((0, 3)))
		video_grid_sizes = video_grid_thw.prod(-1)

Uh oh!

[Model] Add video input support for transformers modeling backend #30680

Are you sure you want to change the base?

[Model] Add video input support for transformers modeling backend #30680

Conversation

ch3nku1 commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

mergify bot commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

hmellor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ch3nku1 commented Dec 15, 2025 •

edited by github-actions bot

Loading

DarkLight1337 Dec 15, 2025 •

edited

Loading

DarkLight1337 Dec 15, 2025 •

edited

Loading

DarkLight1337 Dec 15, 2025 •

edited

Loading

zucchini-nlp Dec 15, 2025 •

edited

Loading