[VLM] Support Qwen3-VL model#3253
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for the Qwen3-VL vision-language model to the GenAI VLM pipeline, enabling stateful inference for this model variant.
Changes:
- Adds Qwen3-VL as a new model type with position embedding interpolation
- Introduces additional language model inputs (deepstack_visual_embeds, visual_pos_masks) for Qwen3-VL
- Extends JSON parameter reading to support nested keys with dot notation
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| vlm_config.hpp | Adds QWEN3_VL enum and configuration fields for position embeddings |
| vlm_config.cpp | Registers qwen3_vl model type and reads new config parameters |
| vision_encoder.cpp | Adds factory logic to instantiate Qwen3-VL vision encoder |
| qwen3_vl/classes.hpp | Defines Qwen3-VL specific encoder and embedder classes with position interpolation |
| qwen3_vl/classes.cpp | Implements position interpolation, spatial merging, and visual masking for Qwen3-VL |
| qwen2vl/classes.hpp | Makes get_rotary_pos_emb virtual and adds merge_text_and_video_image_embeddings utility |
| qwen2vl/classes.cpp | Adds const qualifier to get_rotary_pos_emb |
| processor_config.cpp | Handles Qwen3-VL's alternative config format (shortest_edge/longest_edge) |
| pipeline.cpp | Passes extra language model inputs during generation and includes Qwen3-VL in SDPA check |
| inputs_embedder.hpp | Adds get_lm_extra_inputs interface method |
| inputs_embedder.cpp | Implements factory logic for Qwen3-VL embedder |
| lm_encoding.hpp | Adds lm_extra_inputs parameter to function signature |
| lm_encoding.cpp | Sets extra inputs on language model inference requests |
| json_utils.hpp | Adds support for nested JSON keys with dot notation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
WWB resultsImage inputs (
|
|
@yatarkan Could you please also run WWB with video inputs? Here is the instruction: https://github.com/openvinotoolkit/openvino.genai/tree/master/tools/who_what_benchmark#compare-visual-language-models-with-video-inputs-vlms |
…oat, fix review comments
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 1 comment.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.
You can also share your feedback on Copilot code review. Take the survey.
995fa02
Description
This PR enables Qwen3-VL model in GenAI VLM pipeline.
Supports SDPA + PA backends in VLM pipeline and Continuous Batching pipeline (both
generate()andadd_request()APIs).Depends on
Optimum Intel PRlatest Optimum Intel andtransformers>=4.57.0for model exporting.CVS-175825
Resolves #2998
Checklist: