[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface #30684

Isotr0py · 2025-12-15T08:39:28Z

Purpose

Following PR for [CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it. #30125
Migrate MultiHeadAttention usage to new MMEncoderAttention

Test Plan

pytest - s-v tests/kernels/attention/test_attention.py

pytest -s -v tests/kernels/attention/test_mha_attn.py

Test Result

Test should pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <[email protected]>

gemini-code-assist

Code Review

This pull request refactors the MultiHeadAttention class, used for multimodal encoders, into a new MMEncoderAttention class, moving its definition to vllm/attention/layers/mm_encoder_attention.py and removing it from vllm/attention/layer.py. All instances and imports of MultiHeadAttention across various model implementations (e.g., AIMV2, BLIP, CLIP, GLM4V, Idefics2, InternViT, MLlama4, MoLMo, SigLIP, Step3-VL, Whisper) and their respective test files have been updated to use MMEncoderAttention. The MMEncoderAttention class now directly integrates Flash Attention backend selection logic and removes a redundant reshape_qkv_to_3d method. However, a review comment points out a critical issue in the torch_sdpa_wrapper within vllm/attention/ops/vit_attn_wrappers.py, where torch.split is incorrectly applied on the sequence length dimension (dim=1) for batched inputs, assuming packed tensors. This causes a dimension mismatch and will lead to errors, with the reviewer suggesting to split along the batch dimension (dim=0) or use an alternative approach for handling batched inputs with SDPA in variable-length attention.

gemini-code-assist · 2025-12-15T08:43:01Z

vllm/attention/ops/vit_attn_wrappers.py

    q_chunks = torch.split(q, lens, dim=1)
    k_chunks = torch.split(k, lens, dim=1)
    v_chunks = torch.split(v, lens, dim=1)
+
+    batch_size, q_len, _, _ = q.shape
+    if cu_seqlens is None:
+        cu_seqlens = torch.arange(
+            0, (batch_size + 1) * q_len, step=q_len, dtype=torch.int32, device=q.device
+        )
    for q_i, k_i, v_i in zip(q_chunks, k_chunks, v_chunks):


The logic in torch_sdpa_wrapper for handling batched inputs appears to be incorrect. The torch.split on dim=1 (lines 104-106) assumes that the input tensors q, k, v are packed (i.e., shape (1, total_tokens, ...)), but they are passed as batched tensors of shape (batch_size, seq_len, ...). This will cause torch.split to fail because sum(lens) will not match q.shape[1].

Additionally, the new block at lines 109-112 for when cu_seqlens is None is also flawed. It computes lens based on cu_seqlens which is computed for a uniform batch, but torch.split will still fail for the same reason.

To fix this, you should probably split along the batch dimension (dim=0) or use a different approach to handle batched inputs with SDPA for varlen attention.

Signed-off-by: Isotr0py <[email protected]>

Isotr0py added 14 commits October 18, 2025 16:15

draft

3829886

Signed-off-by: Isotr0py <[email protected]>

update

85eb93e

Signed-off-by: Isotr0py <[email protected]>

update

348f3a4

Signed-off-by: Isotr0py <[email protected]>

Merge remote-tracking branch 'upstream/main' into refactor-mm-attn

0ffe5a6

Signed-off-by: Isotr0py <[email protected]>

clean

4a003d7

Signed-off-by: Isotr0py <[email protected]>

update usage

b5acd74

Signed-off-by: Isotr0py <[email protected]>

Merge remote-tracking branch 'upstream/main' into refactor-mm-attn

312de33

Signed-off-by: Isotr0py <[email protected]>

update

99e7097

Signed-off-by: Isotr0py <[email protected]>

fix

5d5d6a0

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'main' into refactor-mm-attn

8702d74

fix codex

445367e

Signed-off-by: Isotr0py <[email protected]>

init

0f9de52

Signed-off-by: Isotr0py <[email protected]>

clean

2d90e23

Signed-off-by: Isotr0py <[email protected]>

Merge remote-tracking branch 'origin/refactor-mm-attn' into migrate-vit

fb8179d

Signed-off-by: Isotr0py <[email protected]>

mergify bot added llama Related to Llama models v1 tpu Related to Google TPUs labels Dec 15, 2025

update hunyuan

8fda5ad

Signed-off-by: Isotr0py <[email protected]>

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

Isotr0py added 5 commits December 15, 2025 17:33

fix 3d input

2789392

Signed-off-by: Isotr0py <[email protected]>

update docstring

ca04f76

Signed-off-by: Isotr0py <[email protected]>

fix torch sdpa

22ed637

Signed-off-by: Isotr0py <[email protected]>

fix test

ff672d9

Signed-off-by: Isotr0py <[email protected]>

fix

88e624a

Signed-off-by: Isotr0py <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface #30684

[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface #30684

Uh oh!

Isotr0py commented Dec 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface #30684

Are you sure you want to change the base?

[MM Encoder]: Migrate legacy ViT MultiHeadAttention to new MMEncoderAttention interface #30684

Uh oh!

Conversation

Isotr0py commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface #30684

[MM Encoder]: Migrate legacy ViT `MultiHeadAttention` to new `MMEncoderAttention` interface #30684

Isotr0py commented Dec 15, 2025 •

edited by github-actions bot

Loading