Skip to content

[Model] Support Gemma 3 Vision#3429

Open
gnguralnick wants to merge 12 commits intomlc-ai:mainfrom
gnguralnick:gemma3v
Open

[Model] Support Gemma 3 Vision#3429
gnguralnick wants to merge 12 commits intomlc-ai:mainfrom
gnguralnick:gemma3v

Conversation

@gnguralnick
Copy link

This PR adds support for the Gemma 3 Vision-Language Model architecture (gemma3_v), tested with gemma-3-4b-it.

Architecture overview:

  • SigLIP vision encoder (new siglip_vision.py, distinct from the existing CLIP encoder — no CLS token, standard GELU, post-layernorm only)
  • Multimodal projector (RMSNorm + Linear) mapping vision hidden states to the text model's hidden dimension
  • 4x4 average pooling reduces 4096 SigLIP patches to 256 visual tokens
  • Image embeddings are pre-divided by sqrt(hidden_size) to compensate for Gemma's embedding scaling convention
  • Fixed 896x896 image resolution (no dynamic tiling)

Notable fixes included in this branch:

  • Attention kernel workaround for non-standard head dimensions: The TIR _attention_sequence_prefill kernel produces incorrect results on Metal/WebGPU when head_dim is not a multiple of 16 (SigLIP uses head_dim=72). This PR adds a naive matmul+softmax fallback path in op/attention.py when d % 16 != 0, bypassing the TIR kernel entirely. Also fixes a hardcoded target="cuda" that caused the attention kernel to always compile for CUDA regardless of the actual target.
  • fp16 overflow clamping: Gemma 3 was designed for bfloat16; its RMSNorm weights can exceed the float16 representable range. Added clamping after residual additions in decoder layers when running in float16 mode (ref: HuggingFace #39972).

Other changes:

  • Dynamic image embed size computation in serve/data.py (replaces hardcoded 576, also fixes phi3_v embed size calculation)
  • TokenData handling in engine_utils.py for BOI/EOI token injection
  • gemma3_vision_instruction conversation template
  • C++ vlm_utils.cc dispatch for gemma3_v resize/pad/crop

@gemini-code-assist
Copy link

Summary of Changes

Hello @gnguralnick, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly extends the model capabilities by integrating full support for the Gemma 3 Vision-Language Model. It introduces a new SigLIP vision encoder, a multimodal projector, and necessary image processing logic, enabling the system to handle visual inputs alongside text. The changes also include critical bug fixes and optimizations for attention mechanisms and floating-point precision, ensuring robust performance for this new multimodal architecture.

Highlights

  • Gemma 3 Vision-Language Model Support: Introduced comprehensive support for the Gemma 3 Vision-Language Model (gemma3_v), enabling multimodal capabilities within the system.
  • SigLIP Vision Encoder Integration: Implemented a new SigLIP vision encoder architecture, featuring distinct pooling, normalization, and a multimodal projector to map vision hidden states to the text model's hidden dimension.
  • Attention Kernel Workaround: Added a workaround in the attention kernel to provide a naive matmul+softmax fallback for non-standard head dimensions (not multiples of 16), addressing incorrect results on Metal/WebGPU targets.
  • FP16 Overflow Clamping: Implemented float16 overflow clamping after residual additions in Gemma 3 decoder layers to prevent NaN propagation due to large RMSNorm weights when running in float16 mode.
  • Dynamic Image Embed Size Calculation: Updated the image embedding size computation to be dynamic for phi3_v and fixed for gemma3_v, ensuring accurate token counts for visual inputs.
  • BOI/EOI Token Handling: Integrated TokenData handling for Begin-of-Image (BOI) and End-of-Image (EOI) token injection in conversation protocols for Gemma 3 Vision.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cpp/support/vlm_utils.cc
    • Added gemma3_v specific logic for CalculateResizeShape, CalculatePadShape, and CalculateCropShape to handle 896x896 fixed image resolution.
  • python/mlc_llm/conversation_template/gemma.py
    • Registered a new gemma3_vision_instruction conversation template.
  • python/mlc_llm/interface/gen_config.py
    • Included gemma3_vision_instruction in the list of supported conversation templates.
  • python/mlc_llm/model/gemma3/gemma3_model.py
    • Introduced a _clamp_fp16 static method to clamp float16 tensor values.
    • Applied _clamp_fp16 after self-attention and MLP residual additions in Gemma3DecoderLayer to prevent overflow.
  • python/mlc_llm/model/gemma3/gemma3v_loader.py
    • Added a new file to define parameter mapping for the Gemma3VForCausalLM model from HuggingFace.
  • python/mlc_llm/model/gemma3/gemma3v_model.py
    • Added a new file implementing the Gemma3VForCausalLM model, including Gemma3VConfig, Gemma3MultiModalProjector, and image embedding logic with SigLIP vision tower.
  • python/mlc_llm/model/gemma3/gemma3v_quantization.py
    • Added a new file to define group and no-quantization methods for the Gemma3VForCausalLM model.
  • python/mlc_llm/model/model.py
    • Imported gemma3v_loader, gemma3v_model, and gemma3v_quantization.
    • Registered the gemma3_v model with its configuration, source, and quantization methods.
  • python/mlc_llm/model/vision/init.py
    • Imported SigLIPVisionConfig and SigLIPVisionModel from the new siglip_vision module.
  • python/mlc_llm/model/vision/image_processing.py
    • Added a normalize_siglip method for image normalization specific to SigLIP values (mean=0.5, std=0.5).
  • python/mlc_llm/model/vision/siglip_vision.py
    • Added a new file implementing the SigLIP vision encoder components: SigLIPVisionConfig, SigLIPVisionEmbeddings, SigLIPMLP, SigLIPAttention, SigLIPEncoderLayer, SigLIPEncoder, SigLIPVisionTransformer, and SigLIPVisionModel.
  • python/mlc_llm/op/attention.py
    • Imported math module.
    • Modified the attention function to include a fallback path using naive matmul+softmax for head_dim values not divisible by 16.
    • Removed hardcoded target="cuda" and used _extern.get_store().target for attention kernel compilation.
  • python/mlc_llm/protocol/conversation_protocol.py
    • Modified as_prompt to conditionally wrap ImageData with TokenData for BOI/EOI tokens when model_type is gemma3_v.
  • python/mlc_llm/serve/data.py
    • Refactored image embed size calculation into a new static method _compute_embed_size.
    • Updated _compute_embed_size to dynamically calculate embed size for phi3_v and set a fixed size of 256 for gemma3_v.
  • python/mlc_llm/serve/engine_utils.py
    • Added handling for data.TokenData instances in process_prompts to append their token_ids.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Gemma 3 Vision-Language Model, including its architecture, parameter loading, and quantization configurations. It also provides crucial workarounds for an attention kernel issue with non-standard head dimensions and for fp16 overflow issues specific to Gemma 3. The changes are well-structured and include necessary updates to conversation templates, data processing, and C++ utilities. My feedback focuses on improving maintainability and code clarity by replacing magic numbers with constants and utilizing standard library functions for custom-implemented logic.

@gnguralnick
Copy link
Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Gemma 3 Vision-Language Model, which is a significant addition. The changes are comprehensive, covering the model architecture, parameter loading, image processing, and necessary adjustments to the conversation and serving pipelines. The inclusion of workarounds for attention kernel issues and float16 overflow demonstrates a thorough approach. My review focuses on improving code maintainability by addressing magic numbers and code duplication. Overall, this is a solid contribution.

@MasterJH5574
Copy link
Member

Thank you so much @gnguralnick! Will find a time to review this PR.

@babusid
Copy link
Contributor

babusid commented Mar 3, 2026

@gnguralnick

Hey, sorry about the delay on this one. Thanks for the PR!
Quick question for you - is this commit (dab83fa) the last one actually related to this branch's changes / features?

In order to make this PR more digestible for review, would you be able to remove the ~15 or so commits that were merged in on top from main, and rebase this branch on top of main instead? That way, the PR won't include all those changes. It would also remove all the conflicts with the quantization infrastructure change that got merged this morning.

If you're pressed for time, I'm happy to do the rebase for you, just don't want to delete any of your work by mistake.

Gabriel Guralnick added 11 commits March 5, 2026 16:48
…ix type hint

- Use self.config.vision_config.image_size instead of hardcoded 896 in
  gemma3v image_preprocess
- Refactor normalize/normalize_siglip into shared _normalize_impl to
  reduce duplication in ImageProcessor
- Fix SigLIPEncoder.forward return type hint (Tensor -> Tuple[Tensor, ...])
…igLIP

- Move trailing \n into else branch so gemma3_v doesn't get extra newline after EOI
- Remove unused image_token_index from Gemma3VConfig
- Fix misleading comment on mm_soft_emb_norm (Gemma +1 is fused during weight loading)
- Remove redundant gemma3_vision_instruction template (identical to gemma3_instruction)
- Simplify SigLIP encoder: remove tuple wrapping, state accumulation, unused logger
Adapts to the quantization refactoring on main that replaced per-model
quantization files with make_quantization_functions.
@babusid
Copy link
Contributor

babusid commented Mar 6, 2026

Hey @gnguralnick thanks for handling the rebase. Definitely a lot more digestible now. We might have to wait on reviewing this until #3443 lands. It's another simple refactor PR, but since it affects loader logic, it may affect this branch / gemma3V as well. Once it lands, just rebase off main again, and we should be able to review this easier.

@gnguralnick
Copy link
Author

sounds good!

- test_gemma3v.py: model registration, TVM IR export with VLM
  composition (vision_tower + language_model + projector), config
  validation
- test_siglip_vision.py: SigLIP vision encoder export with expected
  parameter components
- test_attention_fallback.py: correctness test for naive matmul+softmax
  fallback when head_dim % 16 != 0 (e.g. SigLIP head_dim=72), compared
  against numpy reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants