Skip to content

Add Qwen2 VL#2604

Open
samudraneel05 wants to merge 12 commits intokeras-team:masterfrom
samudraneel05:qwen2-vl
Open

Add Qwen2 VL#2604
samudraneel05 wants to merge 12 commits intokeras-team:masterfrom
samudraneel05:qwen2-vl

Conversation

@samudraneel05
Copy link

Description of the change

Added Qwen 2 VL, with notebooks documenting output matching and numerics verification on the 2B parameter version. Open for review and feedback!

Reference

Fixes #2323
Hugging Face Link: Link

Colab Notebook

Overall check with numerics here
Tokenizer comparison with hf here
Preprocessor comparison here

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @samudraneel05, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Keras Hub's capabilities by integrating the Qwen2-VL multimodal model. It provides a complete framework for handling both visual and textual inputs, allowing for advanced vision-language tasks. The changes encompass the core model architecture, data preprocessing, and tools for converting existing models, making it easier for users to leverage this powerful new model.

Highlights

  • New Model Integration: Introduced the Qwen2-VL multimodal model, including its backbone, causal language model, preprocessor, image converter, and tokenizer, enabling vision-language capabilities within Keras Hub.
  • Vision-Language Processing: Implemented a comprehensive vision processing pipeline, featuring smart image resizing, normalization, patch embedding via a 3D Vision Encoder with Rotary Position Embeddings (RoPE), and a PatchMerger to integrate vision features into the text sequence.
  • Hugging Face Compatibility: Added a dedicated conversion script and updated the preset loader to facilitate seamless conversion of Qwen2-VL model weights and tokenizers from Hugging Face, ensuring interoperability.
  • Robust Testing and Verification: Included extensive unit tests for all new components, covering backbone functionality, vision input handling, image preprocessing, and tokenizer accuracy, along with a checkpoint conversion verification script.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras_hub/api/layers/init.py
    • Imported Qwen2VLImageConverter to expose it in the API.
  • keras_hub/api/models/init.py
    • Imported Qwen2VLBackbone, Qwen2VLCausalLM, Qwen2VLCausalLMPreprocessor, and Qwen2VLTokenizer to make them accessible via the models API.
  • keras_hub/api/tokenizers/init.py
    • Imported Qwen2VLTokenizer to expose it in the tokenizers API.
  • keras_hub/src/models/qwen2_vl/init.py
    • Initialized the Qwen2-VL model components and registered presets for the Qwen2VLBackbone.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py
    • Implemented the Qwen2VLBackbone class, which combines a 3D Vision Encoder with a Qwen2 causal language model decoder and handles vision token replacement.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_backbone_test.py
    • Added unit tests for the Qwen2VLBackbone, covering basic functionality, vision input handling, and model saving.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_causal_lm.py
    • Implemented the Qwen2VLCausalLM for end-to-end causal vision-language modeling, including generate() and call_with_cache() methods for autoregressive inference.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_causal_lm_preprocessor.py
    • Implemented the Qwen2VLCausalLMPreprocessor to handle tokenization, image preprocessing, and assembly of inputs for the Qwen2-VL model.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_causal_lm_preprocessor_test.py
    • Added unit tests for the Qwen2VLCausalLMPreprocessor, covering text-only and image-inclusive preprocessing scenarios.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_image_converter.py
    • Implemented the Qwen2VLImageConverter for smart-resizing, normalizing, and converting images/video frames into flat patch tensors for the vision encoder.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_image_converter_test.py
    • Added unit tests for the Qwen2VLImageConverter, including smart-resize logic, aspect ratio handling, and image normalization.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_presets.py
    • Added a placeholder file for Qwen2-VL preset configurations.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_tokenizer.py
    • Implemented the Qwen2VLTokenizer based on QwenTokenizer, including logic to resolve vision-related special token IDs.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py
    • Implemented the Qwen2VLVisionEncoder, a 3D ViT with Rotary Position Embeddings (RoPE) and a PatchMerger for processing vision inputs.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder_test.py
    • Added unit tests for the Qwen2VLVisionEncoder, verifying patch merging, configuration roundtrip, and rotary embeddings.
  • keras_hub/src/utils/transformers/convert_qwen2_vl.py
    • Added a new conversion script to port Hugging Face Qwen2-VL weights and tokenizer configurations to Keras Hub format.
  • keras_hub/src/utils/transformers/preset_loader.py
    • Updated the PresetLoader to include the new convert_qwen2_vl module, enabling loading of Qwen2-VL presets.
  • tools/checkpoint_conversion/convert_qwen2_vl_checkpoints.py
    • Added a script to convert and verify Hugging Face Qwen2-VL checkpoints to Keras Hub format, including tokenizer, preprocessor, and backbone output verification.
Activity
  • The pull request introduces the Qwen2-VL model, with the author, samudraneel05, providing detailed documentation and verification notebooks for output matching and numerics.
  • The author has confirmed that all necessary unit tests are added, existing code remains functional across all backends (TensorFlow, JAX, PyTorch), and Keras Hub's model and API design guidelines have been followed.
  • The inclusion of multiple Colab notebooks for numerics verification, tokenizer comparison, and preprocessor comparison demonstrates a thorough approach to validating the new model's implementation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Qwen2-VL multimodal model, including its backbone, causal LM task, preprocessor, image converter, and tokenizer. The implementation is comprehensive and well-structured, with thorough testing and a detailed checkpoint conversion script. The code largely adheres to the repository's style guide, particularly in its backend-agnostic implementation and modular design. I have two main suggestions for improvement: one is to populate the presets file to enable from_preset() functionality and testing, and the other is to refactor some duplicated code for scattering vision embeddings to improve maintainability. Overall, this is a high-quality contribution.

Comment on lines +201 to +219
# Scatter vision features into image placeholder positions.
if img_embeddings is not None:
image_mask = ops.equal(
token_ids,
ops.cast(self.backbone.image_token_id, token_ids.dtype),
)
batch_size = ops.shape(x)[0]
seq_len = ops.shape(x)[1]
x_flat = ops.reshape(x, (-1, self.backbone.hidden_dim))
mask_flat = ops.reshape(image_mask, (-1,))
vision_indices = ops.where(mask_flat)
if isinstance(vision_indices, (list, tuple)):
vision_indices = vision_indices[0]
vision_indices = ops.reshape(vision_indices, (-1, 1))
vision_indices = ops.cast(vision_indices, "int32")
x_flat = ops.scatter_update(x_flat, vision_indices, img_embeddings)
x = ops.reshape(
x_flat, (batch_size, seq_len, self.backbone.hidden_dim)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for scattering vision features into the text embeddings is duplicated between this method (call_with_cache) and Qwen2VLBackbone.call. To improve maintainability and adhere to the principles of modularity and reusability, consider refactoring this logic into a helper method within the Qwen2VLBackbone class. This helper could take the text embeddings, token IDs, and vision features as input and return the updated text embeddings. Both Qwen2VLBackbone.call and Qwen2VLCausalLM.call_with_cache could then call this shared method.

References
  1. The style guide emphasizes modularity and reusability. Refactoring duplicated code into a shared helper method aligns with these key principles. (link)

@sachinprasadhs
Copy link
Collaborator

My bad, I got confused with the github handle names in the comment I mentioned about qwen2-VL.
As per the original issue assignee, can you please close this PR and let him finish this PR #2599 as he created it before and he is the original assignee.
Since you already have one PR open for Omni model, you can focus on that completely.

Sorry again for the confusion and inconvenience.

@samudraneel05
Copy link
Author

i've reached out to the original issue assignee to see if we can do a best of both worlds model addition. i'll be closing this PR.

The omni model PR is and has been ready for review for a while!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Qwen2-VL to KerasHub

2 participants