[Model] Support Gemma 3 Vision by gnguralnick · Pull Request #3429 · mlc-ai/mlc-llm

gnguralnick · 2026-02-23T19:03:21Z

This PR adds support for the Gemma 3 Vision-Language Model architecture (gemma3_v), tested with gemma-3-4b-it.

Architecture overview:

SigLIP vision encoder (new siglip_vision.py, distinct from the existing CLIP encoder — no CLS token, standard GELU, post-layernorm only)
Multimodal projector (RMSNorm + Linear) mapping vision hidden states to the text model's hidden dimension
4x4 average pooling reduces 4096 SigLIP patches to 256 visual tokens
Image embeddings are pre-divided by sqrt(hidden_size) to compensate for Gemma's embedding scaling convention
Fixed 896x896 image resolution (no dynamic tiling)

Notable fixes included in this branch:

Attention kernel workaround for non-standard head dimensions: The TIR _attention_sequence_prefill kernel produces incorrect results on Metal/WebGPU when head_dim is not a multiple of 16 (SigLIP uses head_dim=72). This PR adds a naive matmul+softmax fallback path in op/attention.py when d % 16 != 0, bypassing the TIR kernel entirely. Also fixes a hardcoded target="cuda" that caused the attention kernel to always compile for CUDA regardless of the actual target.
fp16 overflow clamping: Gemma 3 was designed for bfloat16; its RMSNorm weights can exceed the float16 representable range. Added clamping after residual additions in decoder layers when running in float16 mode (ref: HuggingFace #39972).

Other changes:

Dynamic image embed size computation in serve/data.py (replaces hardcoded 576, also fixes phi3_v embed size calculation)
TokenData handling in engine_utils.py for BOI/EOI token injection
gemma3_vision_instruction conversation template
C++ vlm_utils.cc dispatch for gemma3_v resize/pad/crop

gemini-code-assist · 2026-02-23T19:03:52Z

Summary of Changes

Hello @gnguralnick, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the model capabilities by integrating full support for the Gemma 3 Vision-Language Model. It introduces a new SigLIP vision encoder, a multimodal projector, and necessary image processing logic, enabling the system to handle visual inputs alongside text. The changes also include critical bug fixes and optimizations for attention mechanisms and floating-point precision, ensuring robust performance for this new multimodal architecture.

Highlights

Gemma 3 Vision-Language Model Support: Introduced comprehensive support for the Gemma 3 Vision-Language Model (gemma3_v), enabling multimodal capabilities within the system.
SigLIP Vision Encoder Integration: Implemented a new SigLIP vision encoder architecture, featuring distinct pooling, normalization, and a multimodal projector to map vision hidden states to the text model's hidden dimension.
Attention Kernel Workaround: Added a workaround in the attention kernel to provide a naive matmul+softmax fallback for non-standard head dimensions (not multiples of 16), addressing incorrect results on Metal/WebGPU targets.
FP16 Overflow Clamping: Implemented float16 overflow clamping after residual additions in Gemma 3 decoder layers to prevent NaN propagation due to large RMSNorm weights when running in float16 mode.
Dynamic Image Embed Size Calculation: Updated the image embedding size computation to be dynamic for phi3_v and fixed for gemma3_v, ensuring accurate token counts for visual inputs.
BOI/EOI Token Handling: Integrated TokenData handling for Begin-of-Image (BOI) and End-of-Image (EOI) token injection in conversation protocols for Gemma 3 Vision.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

cpp/support/vlm_utils.cc
- Added gemma3_v specific logic for CalculateResizeShape, CalculatePadShape, and CalculateCropShape to handle 896x896 fixed image resolution.
python/mlc_llm/conversation_template/gemma.py
- Registered a new gemma3_vision_instruction conversation template.
python/mlc_llm/interface/gen_config.py
- Included gemma3_vision_instruction in the list of supported conversation templates.
python/mlc_llm/model/gemma3/gemma3_model.py
- Introduced a _clamp_fp16 static method to clamp float16 tensor values.
- Applied _clamp_fp16 after self-attention and MLP residual additions in Gemma3DecoderLayer to prevent overflow.
python/mlc_llm/model/gemma3/gemma3v_loader.py
- Added a new file to define parameter mapping for the Gemma3VForCausalLM model from HuggingFace.
python/mlc_llm/model/gemma3/gemma3v_model.py
- Added a new file implementing the Gemma3VForCausalLM model, including Gemma3VConfig, Gemma3MultiModalProjector, and image embedding logic with SigLIP vision tower.
python/mlc_llm/model/gemma3/gemma3v_quantization.py
- Added a new file to define group and no-quantization methods for the Gemma3VForCausalLM model.
python/mlc_llm/model/model.py
- Imported gemma3v_loader, gemma3v_model, and gemma3v_quantization.
- Registered the gemma3_v model with its configuration, source, and quantization methods.
python/mlc_llm/model/vision/init.py
- Imported SigLIPVisionConfig and SigLIPVisionModel from the new siglip_vision module.
python/mlc_llm/model/vision/image_processing.py
- Added a normalize_siglip method for image normalization specific to SigLIP values (mean=0.5, std=0.5).
python/mlc_llm/model/vision/siglip_vision.py
- Added a new file implementing the SigLIP vision encoder components: SigLIPVisionConfig, SigLIPVisionEmbeddings, SigLIPMLP, SigLIPAttention, SigLIPEncoderLayer, SigLIPEncoder, SigLIPVisionTransformer, and SigLIPVisionModel.
python/mlc_llm/op/attention.py
- Imported math module.
- Modified the attention function to include a fallback path using naive matmul+softmax for head_dim values not divisible by 16.
- Removed hardcoded target="cuda" and used _extern.get_store().target for attention kernel compilation.
python/mlc_llm/protocol/conversation_protocol.py
- Modified as_prompt to conditionally wrap ImageData with TokenData for BOI/EOI tokens when model_type is gemma3_v.
python/mlc_llm/serve/data.py
- Refactored image embed size calculation into a new static method _compute_embed_size.
- Updated _compute_embed_size to dynamically calculate embed size for phi3_v and set a fixed size of 256 for gemma3_v.
python/mlc_llm/serve/engine_utils.py
- Added handling for data.TokenData instances in process_prompts to append their token_ids.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the Gemma 3 Vision-Language Model, including its architecture, parameter loading, and quantization configurations. It also provides crucial workarounds for an attention kernel issue with non-standard head dimensions and for fp16 overflow issues specific to Gemma 3. The changes are well-structured and include necessary updates to conversation templates, data processing, and C++ utilities. My feedback focuses on improving maintainability and code clarity by replacing magic numbers with constants and utilizing standard library functions for custom-implemented logic.

cpp/support/vlm_utils.cc

python/mlc_llm/model/gemma3/gemma3_model.py

python/mlc_llm/model/gemma3/gemma3v_model.py

python/mlc_llm/protocol/conversation_protocol.py

gnguralnick · 2026-02-23T22:28:50Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for the Gemma 3 Vision-Language Model, which is a significant addition. The changes are comprehensive, covering the model architecture, parameter loading, image processing, and necessary adjustments to the conversation and serving pipelines. The inclusion of workarounds for attention kernel issues and float16 overflow demonstrates a thorough approach. My review focuses on improving code maintainability by addressing magic numbers and code duplication. Overall, this is a solid contribution.

cpp/support/vlm_utils.cc

python/mlc_llm/model/gemma3/gemma3_model.py

python/mlc_llm/model/gemma3/gemma3v_model.py

python/mlc_llm/model/vision/image_processing.py

python/mlc_llm/model/vision/siglip_vision.py

MasterJH5574 · 2026-02-25T14:46:16Z

Thank you so much @gnguralnick! Will find a time to review this PR.

babusid · 2026-03-03T19:31:03Z

@gnguralnick

Hey, sorry about the delay on this one. Thanks for the PR!
Quick question for you - is this commit (dab83fa) the last one actually related to this branch's changes / features?

In order to make this PR more digestible for review, would you be able to remove the ~15 or so commits that were merged in on top from main, and rebase this branch on top of main instead? That way, the PR won't include all those changes. It would also remove all the conflicts with the quantization infrastructure change that got merged this morning.

If you're pressed for time, I'm happy to do the rebase for you, just don't want to delete any of your work by mistake.

… config

…ix type hint - Use self.config.vision_config.image_size instead of hardcoded 896 in gemma3v image_preprocess - Refactor normalize/normalize_siglip into shared _normalize_impl to reduce duplication in ImageProcessor - Fix SigLIPEncoder.forward return type hint (Tensor -> Tuple[Tensor, ...])

…igLIP - Move trailing \n into else branch so gemma3_v doesn't get extra newline after EOI - Remove unused image_token_index from Gemma3VConfig - Fix misleading comment on mm_soft_emb_norm (Gemma +1 is fused during weight loading) - Remove redundant gemma3_vision_instruction template (identical to gemma3_instruction) - Simplify SigLIP encoder: remove tuple wrapping, state accumulation, unused logger

Adapts to the quantization refactoring on main that replaced per-model quantization files with make_quantization_functions.

babusid · 2026-03-06T04:23:56Z

Hey @gnguralnick thanks for handling the rebase. Definitely a lot more digestible now. We might have to wait on reviewing this until #3443 lands. It's another simple refactor PR, but since it affects loader logic, it may affect this branch / gemma3V as well. Once it lands, just rebase off main again, and we should be able to review this easier.

gnguralnick · 2026-03-06T19:33:52Z

sounds good!

- test_gemma3v.py: model registration, TVM IR export with VLM composition (vision_tower + language_model + projector), config validation - test_siglip_vision.py: SigLIP vision encoder export with expected parameter components - test_attention_fallback.py: correctness test for naive matmul+softmax fallback when head_dim % 16 != 0 (e.g. SigLIP head_dim=72), compared against numpy reference

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

cpp/support/vlm_utils.cc Show resolved Hide resolved

python/mlc_llm/model/gemma3/gemma3_model.py Show resolved Hide resolved

python/mlc_llm/model/gemma3/gemma3v_model.py Outdated Show resolved Hide resolved

python/mlc_llm/protocol/conversation_protocol.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

gnguralnick mentioned this pull request Feb 25, 2026

[VLM] Dynamic image embed size and Gemma 3 Vision support mlc-ai/web-llm#774

Open

Gabriel Guralnick added 11 commits March 5, 2026 16:48

implementation of gemma 3 with vision

83ca75a

make attention bug workaround more general

49e2376

fix image dtype for webgpu

1ce269f

allow quantization of vision encoder

6ee0fa6

fix lint: isort, black, mypy, and pylint issues

7f2cbc6

address code review: use op.clip, op.nn.avg_pool2d, read BOI/EOI from…

999326f

… config

fix op.clip and op.nn.avg_pool2d: use relax.op with wrap_nested

34e504a

fix typing issues

b5a55e3

remove gemma3v_quantization.py: use centralized factory instead

d9b3ad8

Adapts to the quantization refactoring on main that replaced per-model quantization files with make_quantization_functions.

gnguralnick force-pushed the gemma3v branch from b9ca801 to d9b3ad8 Compare March 5, 2026 23:02

Conversation

gnguralnick commented Feb 23, 2026

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gnguralnick commented Feb 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MasterJH5574 commented Feb 25, 2026

Uh oh!

babusid commented Mar 3, 2026

Uh oh!

babusid commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnguralnick commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

babusid commented Mar 6, 2026 •

edited

Loading