Add support for Qwen3-VL model via Torchax path by muskansh-google · Pull Request #1974 · vllm-project/tpu-inference

muskansh-google · 2026-03-19T19:13:04Z

Description

This PR enables support for Qwen3-VL inference on uLLM using Torchax framework. By default, vLLM model vision encoders are treated as part of input preparation phase and not as part of core compiled model. Due to this limitation, it was extremely difficult to JIT compile the vision encoder and hence this PR performs eager execution of vision tower on TPUs for the time being.

Why this change is being made

Currently, uLLM does not support for Qwen3-VL model. The idea of using Torchax as the backend is to be able to levarage upstream vLLM model implementation. The Pytorch based Qwen3-VL uses dynamic operations and in-place state mutations (specifically for Deepstack features) that conflict with JAX's requirement for static trace graphs. This PR provides monkey-patches and utility wrappers to bridge the gap between vLLM's PyTorch implementation and JAX execution.

Solved Problems & Relevance

1. `tpu_inference/models/vllm/vllm_model_wrapper.py`

ViT Attention Optimization: Registered a custom function for torch_sdpa through Torchax to utilize sharded_flash_attention, significantly improving vision encoder performance on TPUs.
Deepstack Stateless Patching: Overrode _set_deepstack_input_embeds and _get_deepstack_input_embeds to use a stateless dictionary cache (_deepstack_tensors) instead of in-place model mutations.
JIT Side-Channel (State Passing): Packs intermediate Deepstack embeddings into inputs_embeds to pass them safely through JIT boundaries.
Dynamic Argument Wrapping: Added wrap_embed_multimodal_func and wrap_embed_input_ids_func to handle dynamic kwargs and shape conversions (mm_embeds had to be passed as a list of tensors for Qwen3-VL).

Shortcomings and Future Improvements

The inputs_embeds side-channel is a workaround for JAX/JIT signature limitations.

Tests

The changes were verified using the examples/multi_modal_inference.py script on a TPU v6e VM.

Reproduction Commands

Standard single image inference:

python3 -m examples.multi_modal_inference \
  --model Qwen/Qwen3-VL-8B-Instruct \
  --gpu-memory-utilization 0.9 \
  --enable-chunked-prefill False

1. add vllm wrapper for multimodal. 2. modify interface of embed_input_ids with related jax model. 3. modify gather_mm_embeddings to get is_mm_embed for the new interface. 4. register function for torch.sdpa through torchax to use flash attention Signed-off-by: Yuyan Peng <yuyanpeng@google.com>

tpu_inference/runner/persistent_batch_manager.py

tpu_inference/models/vllm/vllm_model_wrapper.py

…thod

…loat16

kwang3939 · 2026-04-01T03:48:16Z

tpu_inference/models/vllm/vllm_model_wrapper.py

-from vllm.config import VllmConfig, set_current_vllm_config
+from torchax.ops.ops_registry import (register_torch_dispatch_op,
+                                         register_torch_function_op)
+from vllm.config import VllmConfig, set_current_vllm_config, set_current_vllm_config


There are two set_current_vllm_config here.

kwang3939 · 2026-04-01T03:59:17Z

examples/multi_modal_inference.py


 model_example_map = {
-    "qwen2_5_vl": run_qwen2_5_vl,
+    "Qwen/Qwen2.5-VL-3B-Instruct": run_qwen_vl,


I think specify Qwen2.5-VL-3B-Instruct might be not general enough in case we want to test Qwen2.5-VL-7B. Could you make this more general?

kwang3939 · 2026-04-01T04:00:30Z

examples/multi_modal_inference.py

+    model_key = args.model
+
+
+    req_data = model_example_map[model_key](questions, modality, args)


Could you add a check to see if model_key is in the map and raise error if not?

kwang3939 · 2026-04-01T04:14:04Z

tpu_inference/layers/vllm/ops/scaled_dot_product_attention.py

+    else:
+        seg_ids = None
+
+    from tpu_inference.layers.common.attention_interface import \


This is already imported at the top of this file.

kwang3939 · 2026-04-01T04:36:22Z

tpu_inference/models/vllm/vllm_model_wrapper.py

+                    vllm_model._deepstack_tensors = {}
+
+                if isinstance(deepstack_input_embeds, dict):
+                    vllm_model._deepstack_tensors.update(deepstack_input_embeds)


One concern here: do we need to reset this _deepstack_tesnors to {} if deepstack_input_embeds is None? Otherwise it may retain the values from the previous forward path.

kwang3939 · 2026-04-01T04:41:53Z

tpu_inference/runner/persistent_batch_manager.py

-                    )
+                # Qwen3-VL uses a different method signature and takes in mm_features as an argument.
+                import inspect
+                takes_mm_features = "mm_features" in inspect.signature(get_mrope_input_positions_fn).parameters


Could you use supports_kw here instead of inspect.signature?
Also, I think this can be moved outside the for loop: https://github.com/muskansh-google/tpu-inference/blob/dd49ffcca0d74b577ea88512bd45742ee04a67ea/tpu_inference/runner/persistent_batch_manager.py#L132.

muskansh-google marked this pull request as ready for review March 20, 2026 17:52

muskansh-google requested review from jrplatin, kwang3939, kyuyeunk, mrjunwan-lang, sixiang-google, vanbasten23, vipannalla and wenxindongwork as code owners March 20, 2026 17:52

muskansh-google mentioned this pull request Mar 23, 2026

[Bug]: TPU Inference issue - Qwen3-VL-8B #1850

Open

1 task

yuyanpeng-google and others added 9 commits March 23, 2026 15:15

Add qwen3-vl changes on top of PR-1952

983d41c

Fix deepstack implementation issues

bd9aaa9

Optimize deepstack implementation

2baca68

Add multi image test functionality to MM Inference script

1a3c34b

Make Qwen3-vl changes to multi_modla_inference

a5456d0

Disable chunked prefill for vision model

83f669d

Update comment

6dd4bfb

Fix error in merge conflct

be6178e

muskansh-google force-pushed the pr-1952 branch from c564119 to be6178e Compare March 23, 2026 15:49

kyuyeunk requested changes Mar 23, 2026

View reviewed changes

Increase gpu-memory-utilization in script

8ec9127

yuyanpeng-google reviewed Mar 24, 2026

View reviewed changes

tpu_inference/models/vllm/vllm_model_wrapper.py Outdated Show resolved Hide resolved

Define VIT SDPA in ops.py

f951838

kyuyeunk reviewed Mar 25, 2026

View reviewed changes

muskansh-google added 5 commits March 30, 2026 13:31

Pass is_multimodal arg in compilation manager class embed_input_id me…

606aab1

…thod

Use model-agnostic logic as condition in persistent_batch_manager.py

6bd1822

Remove usage of handle_oov_mm_token

239fc9b

temp

d482e8f

Merge remote-tracking branch 'upstream/main' into HEAD

afd05c7

muskansh-google force-pushed the pr-1952 branch from b41fe2c to afd05c7 Compare March 30, 2026 16:50

muskansh-google added 5 commits March 30, 2026 16:58

Remove unwanted files

6f9522e

Revert some changes in vllm_model_wrapper

f17cdbf

Define cudagraph_capture_sizes strictly as a list

28e04e9

Safely converted np.ndarray of dtype ml_dtypes.bfloat16 into torch.bf…

ac63766

…loat16

Remove debug statement

dd49ffc

muskansh-google requested a review from kyuyeunk March 31, 2026 05:47

kwang3939 reviewed Apr 1, 2026

View reviewed changes

muskansh-google added 2 commits April 2, 2026 06:01

Merge remote-tracking branch 'upstream/main' into pr-1952

fe3aafa

Remove duplicate method call in vllm model wrapper

ec5f645

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Qwen3-VL model via Torchax path#1974

Add support for Qwen3-VL model via Torchax path#1974
muskansh-google wants to merge 23 commits intovllm-project:mainfrom
muskansh-google:pr-1952

muskansh-google commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwang3939 Apr 1, 2026

Uh oh!

kwang3939 Apr 1, 2026

Uh oh!

kwang3939 Apr 1, 2026

Uh oh!

kwang3939 Apr 1, 2026

Uh oh!

kwang3939 Apr 1, 2026

Uh oh!

kwang3939 Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		model_key = args.model


		req_data = model_example_map[model_key](questions, modality, args)

Conversation

muskansh-google commented Mar 19, 2026

Description

Why this change is being made

Solved Problems & Relevance

1. tpu_inference/models/vllm/vllm_model_wrapper.py

Shortcomings and Future Improvements

Tests

Reproduction Commands

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kwang3939 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

kwang3939 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

kwang3939 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

kwang3939 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

kwang3939 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

kwang3939 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `tpu_inference/models/vllm/vllm_model_wrapper.py`