[OpenVINO] Support Qwen3-Omni MoE with full Talker speech stack #1700
[OpenVINO] Support Qwen3-Omni MoE with full Talker speech stack #1700sgonorov wants to merge 1 commit intohuggingface:mainfrom
Conversation
36e63c3 to
a618ab9
Compare
|
@echarlaix, @popovaan, please take a look at this PR. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| model_type = config.model_type | ||
| if model_type in ["phi4mm", "phi4_multimodal"]: | ||
| if model_type == "qwen3_omni_moe": | ||
| task = "image-text-to-text" |
There was a problem hiding this comment.
let us have "any-to-any" task please
rkazants
left a comment
There was a problem hiding this comment.
Please provide any-to-any sample in PR description
a98964a to
f2fc0ef
Compare
f2fc0ef to
9a84572
Compare
rkazants
left a comment
There was a problem hiding this comment.
Please do not provide code-snippet with OV GenAI.
Provide any-to-any case using optimum-intel API
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds OpenVINO support for qwen3_omni_moe, including export/load plumbing for the Thinker + Talker speech stack and a new OVModelForOmni wrapper for multimodal generation with optional audio output.
Changes:
- Registers Qwen3-Omni MoE across exporter/task routing, model loading, quantization, and pipeline dispatch.
- Implements new OpenVINO model parts and export patchers for Talker, code predictor, code2wav, audio encoder, and Qwen3-Omni-specific vision/language handling.
- Expands tests with a tiny local fixture, CLI/export/quantization coverage, and integration checks for
OVModelForOmni.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
tests/openvino/utils_tests.py |
Adds expected INT8 node counts for qwen3_omni_moe. |
tests/openvino/test_seq2seq.py |
Extends multimodal integration tests and adds OVModelForOmni coverage. |
tests/openvino/test_quantization.py |
Adds quantization/weight-compression cases for qwen3_omni_moe. |
tests/openvino/test_exporters_cli.py |
Adds CLI export coverage and custom loader handling for qwen3_omni_moe. |
tests/openvino/test_export.py |
Adds direct export tests and expected part assertions for Qwen3-Omni MoE. |
tests/openvino/test_decoder.py |
Excludes Qwen3-Omni internal text parts from decoder untested-architecture checks. |
tests/openvino/test_audit_fixes.py |
Adds focused regression tests for dispatch, talker guards, and OVModelForOmni. |
tests/openvino/models/tiny_qwen3_omni_moe.py |
Introduces a tiny synthetic Qwen3-Omni MoE fixture generator. |
tests/openvino/conftest.py |
Registers the tiny Qwen3-Omni MoE fixture for session-wide tests. |
optimum/intel/pipelines/accelerator_utils.py |
Routes supported omni tasks to OVModelForOmni. |
optimum/intel/openvino/modeling_visual_language.py |
Implements Qwen3-Omni MoE runtime wrappers, audio generation flow, and OVModelForOmni. |
optimum/intel/openvino/__init__.py |
Exposes OVModelForOmni from the OpenVINO package. |
optimum/intel/__init__.py |
Re-exports OVModelForOmni and applies a Transformers compatibility patch. |
optimum/exporters/openvino/utils.py |
Treats qwen3_omni_moe as multimodal and copies preprocessor config during export. |
optimum/exporters/openvino/model_patcher.py |
Adds Qwen3-Omni MoE export patchers for vision/audio/talker/code2wav paths. |
optimum/exporters/openvino/model_configs.py |
Registers exporter configs and task aliases for Qwen3-Omni MoE submodels. |
optimum/exporters/openvino/convert.py |
Adds runtime-option handling, tokenizer behavior changes, and auxiliary weight saving. |
optimum/exporters/openvino/__main__.py |
Adjusts quantize CLI task redirection for qwen3_omni_moe. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if model_type in ("qwen3_omni_moe", "qwen3_omni", "qwen2_vl"): | ||
| try: | ||
| import shutil | ||
| source_preprocessor = Path(model_name_or_path) / "preprocessor_config.json" | ||
| dest_preprocessor = Path(output) / "preprocessor_config.json" | ||
|
|
||
| if source_preprocessor.exists(): | ||
| shutil.copy2(source_preprocessor, dest_preprocessor) | ||
| logger.info(f"Copied preprocessor_config.json from source model") |
| if isinstance(audio, (list, tuple)) and len(audio) == 1: | ||
| audio = audio[0] | ||
| if isinstance(audio, tuple): | ||
| audio = audio[0] | ||
|
|
||
| conversation = [{"role": "user", "content": [{"type": "text", "text": text}]}] | ||
| if image is not None: | ||
| conversation[0]["content"].insert(0, {"type": "image"}) | ||
| if audio is not None: | ||
| conversation[0]["content"].insert(0, {"type": "audio"}) | ||
|
|
||
| text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True) | ||
| inputs = processor(images=image, text=text_prompt, audio=audio, return_tensors="pt") |
| if ( | ||
| task is not None | ||
| and (task.startswith("text-generation") or task == "image-text-to-text") | ||
| and (task.startswith("text-generation") or any(t in task for t in _VLM_LANGUAGE_MODEL_TASKS)) |
| padded_feature = padded_feature.unsqueeze(1) | ||
| padded_embed = torch.nn.functional.gelu(self.conv2d1(padded_feature)) | ||
| padded_embed = torch.nn.functional.gelu(self.conv2d2(padded_embed)) | ||
| padded_embed = torch.nn.functional.gelu(self.conv2d3(padded_embed)) | ||
| b, c, f, t = padded_embed.size() | ||
| padded_embed = self.conv_out(padded_embed.permute(0, 3, 1, 2).contiguous().view(b, t, c * f)) | ||
|
|
||
| positional_embedding = ( | ||
| self.positional_embedding.positional_embedding[: padded_embed.shape[1], :] | ||
| .unsqueeze(0) | ||
| .to(padded_embed.dtype) | ||
| ) | ||
| padded_embed = padded_embed + positional_embedding | ||
|
|
||
| # Flatten rather than boolean-index: the latter bakes a data-dependent shape that OV can't trace. | ||
| # Encoder layers run with eager attention during export, so cu_seqlens don't affect the output. | ||
| b, t, d = padded_embed.shape | ||
| hidden_states = padded_embed.reshape(b * t, d) | ||
|
|
||
| for encoder_layer in self.layers: | ||
| layer_outputs = encoder_layer(hidden_states, cu_seqlens) | ||
| hidden_states = layer_outputs[0] | ||
|
|
||
| hidden_states = self.ln_post(hidden_states) | ||
| hidden_states = self.proj1(hidden_states) | ||
| hidden_states = self.act(hidden_states) | ||
| hidden_states = self.proj2(hidden_states) | ||
|
|
||
| hidden_states = hidden_states.reshape(b, t, -1) | ||
| hidden_states = hidden_states * padded_mask_after_cnn.to(hidden_states.dtype).unsqueeze(-1) | ||
| return hidden_states | ||
|
|
| if self.code_predictor is not None and num_code_groups > 1: | ||
| self.code_predictor.reset() | ||
|
|
||
| # HF: inputs_embeds=torch.cat((past_hidden, last_id_hidden), dim=1) | ||
| cp_prefill = torch.cat([hidden_states[:, -1:, :], first_code_embed], dim=1) | ||
| cp_logits, cp_hidden = self.code_predictor( | ||
| inputs_embeds=cp_prefill, | ||
| generation_steps=0, | ||
| ) | ||
|
|
||
| for cp_step in range(num_code_groups - 1): | ||
| cp_next_logits = cp_logits[:, -1, :] | ||
| cp_probs = torch.nn.functional.softmax(cp_next_logits, dim=-1) | ||
| cp_token = torch.multinomial(cp_probs, num_samples=1).squeeze(-1) | ||
| step_codes.append(cp_token.item()) | ||
|
|
||
| cp_embed = self._embed_cp_token(cp_token.unsqueeze(0), cp_step) | ||
| codec_hiddens.append(cp_embed) |
| else: | ||
| logger.warning( | ||
| "code_predictor_codec_embedding.npy not found — " | ||
| "CodePredictor will use degraded fallback for token embedding" |
| "<|vision_bos|>", | ||
| "<|vision_eos|>", | ||
| "<|AUDIO|>", | ||
| "<|IMAGE|>", | ||
| "<|VIDEO|>", |
| "{% else %}" | ||
| "{% for content in message['content'] %}" | ||
| "{% if content['type'] == 'image' %}" | ||
| "{{ '<|vision_start|><|image_pad|><|vision_end|>' }}" |
| # Patch Transformers 5.0 Qwen3OmniMoeTalkerCodePredictorConfig bug | ||
| # Bug: __init__ references self.use_sliding_window and self.max_window_layers before they're set | ||
| if is_transformers_version(">=", "5.0"): | ||
| try: | ||
| from transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe import ( | ||
| Qwen3OmniMoeTalkerCodePredictorConfig, | ||
| ) | ||
|
|
||
| _original_code_predictor_init = Qwen3OmniMoeTalkerCodePredictorConfig.__init__ | ||
|
|
||
| def _patched_code_predictor_init(self, *args, use_sliding_window=False, max_window_layers=28, **kwargs): | ||
| # Set these attributes before calling original __init__ which references them | ||
| self.use_sliding_window = use_sliding_window | ||
| self.max_window_layers = max_window_layers | ||
| _original_code_predictor_init(self, *args, use_sliding_window=use_sliding_window, max_window_layers=max_window_layers, **kwargs) | ||
|
|
||
| Qwen3OmniMoeTalkerCodePredictorConfig.__init__ = _patched_code_predictor_init |
| t = np.linspace(0, 5.0, int(5.0 * 22050), endpoint=False) | ||
| audio_data = 0.5 * np.sin(2 * np.pi * 220 * t) | ||
| return (audio_data, 16000) |
What does this PR do?
Extends the Qwen3-Omni dense work in #1640 to the MoE variant and adds the full Talker / speech-generation path. Exports
Qwen/Qwen3-Omni-30B-A3B-Instruct-class models as 10 sub-models.A new public class
OVModelForOmni(exposed fromoptimum.intel) wraps the Thinker (_OVQwen3OmniMoeForCausalLM) and adds Talker/code-predictor/code2wav orchestration behind aGenerationMixin-compatible interface. Standard text/image/audio/image+audio input combinations still work throughOVModelForVisualCausalLMfor Thinker-only use cases.## Usage
Conversion:
optimum-cli export openvino -m Qwen/Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-30B-A3B-Instruct --trust-remote-codeInference (text + speech output):
All-to-all example:
Notes
MODEL_FOR_TEXT_TO_WAVEFORM; this PR adds the ASR and image-text-to-text task aliases viaTasksManager._CUSTOM_CLASSESsooptimum-clican route correctly.int8sub-model expected-node counts for all 10 components are added to_ARCHITECTURES_TO_EXPECTED_INT8for quantization regression tests.tests/openvino/models/tiny_qwen3_omni_moe.py) builds a locally-synthesized Qwen3-Omni MoE + Talker model and is registered inconftest.pyso CI does not need to pull the full 30B checkpoint.Before submitting