Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the Qwen3-VL model, a multimodal large language model, into the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the implementation for the Qwen3-VL model, including the vision and text components, parameter loading from Huggingface checkpoints, and associated tests. The implementation is comprehensive and follows the repository's JAX-native and NNX-first philosophy. The code is well-structured into separate modules for the model, parameters, and vision encoder. The tests are also thorough, covering both LoRA parameter merging and round-trip weight loading for the vision components.
My review has identified a few critical issues related to model logic, particularly in the deepstack feature injection and a reference to an undefined MoELayer. I've also found some issues in the test coverage and parameter mappings that should be addressed to improve correctness and maintainability. Overall, this is a great contribution, and with these fixes, it will be a solid implementation.
This PR implements Qwen3-VL and (partially) resolves #1063.
What's included:
What's not included yet:
grid_thw(spatial information about image and video inputs) that requires additional handling. Thus sampling is out of scope for now.Correctness check
Since we don't have sampling yet, I checked layerwise matching with transformers. The script is here. The output:
The discrepancy between activations comes from the difference in XLA vs. cuDNN implementation of matmul (discrepancy in the last digit of bfloat16) and is amplified by MLP layers. Despite seemingly large difference in the output logits, I check that the top-1 token matches between JAX and PyTorch implementations.
Reference
Vision encoder is based on:
Text decoder is mostly a copy of
tunix/models/qwen3with RoPE -> mRoPE and deepstack integration.Colab Notebook
Not used to Colab notebooks, but here's a gist showing the usage with image features:
https://gist.github.com/ridcl/9adb25ecf5a843c3cfae1a9285cf4473
Checklist