Conversation
Summary of ChangesHello @gnguralnick, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly extends the model capabilities by integrating full support for the Gemma 3 Vision-Language Model. It introduces a new SigLIP vision encoder, a multimodal projector, and necessary image processing logic, enabling the system to handle visual inputs alongside text. The changes also include critical bug fixes and optimizations for attention mechanisms and floating-point precision, ensuring robust performance for this new multimodal architecture. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Gemma 3 Vision-Language Model, including its architecture, parameter loading, and quantization configurations. It also provides crucial workarounds for an attention kernel issue with non-standard head dimensions and for fp16 overflow issues specific to Gemma 3. The changes are well-structured and include necessary updates to conversation templates, data processing, and C++ utilities. My feedback focuses on improving maintainability and code clarity by replacing magic numbers with constants and utilizing standard library functions for custom-implemented logic.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Gemma 3 Vision-Language Model, which is a significant addition. The changes are comprehensive, covering the model architecture, parameter loading, image processing, and necessary adjustments to the conversation and serving pipelines. The inclusion of workarounds for attention kernel issues and float16 overflow demonstrates a thorough approach. My review focuses on improving code maintainability by addressing magic numbers and code duplication. Overall, this is a solid contribution.
|
Thank you so much @gnguralnick! Will find a time to review this PR. |
|
Hey, sorry about the delay on this one. Thanks for the PR! In order to make this PR more digestible for review, would you be able to remove the ~15 or so commits that were merged in on top from main, and rebase this branch on top of main instead? That way, the PR won't include all those changes. It would also remove all the conflicts with the quantization infrastructure change that got merged this morning. If you're pressed for time, I'm happy to do the rebase for you, just don't want to delete any of your work by mistake. |
…ix type hint - Use self.config.vision_config.image_size instead of hardcoded 896 in gemma3v image_preprocess - Refactor normalize/normalize_siglip into shared _normalize_impl to reduce duplication in ImageProcessor - Fix SigLIPEncoder.forward return type hint (Tensor -> Tuple[Tensor, ...])
…igLIP - Move trailing \n into else branch so gemma3_v doesn't get extra newline after EOI - Remove unused image_token_index from Gemma3VConfig - Fix misleading comment on mm_soft_emb_norm (Gemma +1 is fused during weight loading) - Remove redundant gemma3_vision_instruction template (identical to gemma3_instruction) - Simplify SigLIP encoder: remove tuple wrapping, state accumulation, unused logger
Adapts to the quantization refactoring on main that replaced per-model quantization files with make_quantization_functions.
|
Hey @gnguralnick thanks for handling the rebase. Definitely a lot more digestible now. We might have to wait on reviewing this until #3443 lands. It's another simple refactor PR, but since it affects loader logic, it may affect this branch / gemma3V as well. Once it lands, just rebase off main again, and we should be able to review this easier. |
|
sounds good! |
- test_gemma3v.py: model registration, TVM IR export with VLM composition (vision_tower + language_model + projector), config validation - test_siglip_vision.py: SigLIP vision encoder export with expected parameter components - test_attention_fallback.py: correctness test for naive matmul+softmax fallback when head_dim % 16 != 0 (e.g. SigLIP head_dim=72), compared against numpy reference
This PR adds support for the Gemma 3 Vision-Language Model architecture (
gemma3_v), tested withgemma-3-4b-it.Architecture overview:
siglip_vision.py, distinct from the existing CLIP encoder — no CLS token, standard GELU, post-layernorm only)sqrt(hidden_size)to compensate for Gemma's embedding scaling conventionNotable fixes included in this branch:
_attention_sequence_prefillkernel produces incorrect results on Metal/WebGPU whenhead_dimis not a multiple of 16 (SigLIP uses head_dim=72). This PR adds a naive matmul+softmax fallback path inop/attention.pywhend % 16 != 0, bypassing the TIR kernel entirely. Also fixes a hardcodedtarget="cuda"that caused the attention kernel to always compile for CUDA regardless of the actual target.Other changes:
serve/data.py(replaces hardcoded576, also fixes phi3_v embed size calculation)TokenDatahandling inengine_utils.pyfor BOI/EOI token injectiongemma3_vision_instructionconversation templatevlm_utils.ccdispatch for gemma3_v resize/pad/crop