[OpenVINO] Support Qwen3-Omni MoE with full Talker speech stack by sgonorov · Pull Request #1700 · huggingface/optimum-intel

sgonorov · 2026-04-24T09:12:30Z

What does this PR do?

Extends the Qwen3-Omni dense work in #1640 to the MoE variant and adds the full Talker / speech-generation path. Exports Qwen/Qwen3-Omni-30B-A3B-Instruct-class models as 10 sub-models.

A new public class OVModelForOmni (exposed from optimum.intel) wraps the Thinker (_OVQwen3OmniMoeForCausalLM) and adds Talker/code-predictor/code2wav orchestration behind a GenerationMixin-compatible interface. Standard text/image/audio/image+audio input combinations still work through OVModelForVisualCausalLM for Thinker-only use cases.
## Usage

Conversion:

optimum-cli export openvino -m Qwen/Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-30B-A3B-Instruct --trust-remote-code

Inference (text + speech output):

from optimum.intel import OVModelForOmni
from transformers import AutoProcessor                                                                                                                                                                                   
                                                                                                                                                                                                                         
model_path = "./Qwen3-Omni-30B-A3B-Instruct"                                                                                                                                                                             
processor = AutoProcessor.from_pretrained(model_path)                                                                                                                                                                    
model = OVModelForOmni.from_pretrained(model_path)                                                                                                                                                                       
                                                                                                                                                                                                                         
# Thinker-only (text out)                                                                                                                                                                                                
inputs = processor(text="Describe Intel OpenVINO in one sentence.", return_tensors="pt")                                                                                                                                 
out = model.generate(**inputs, max_new_tokens=64)                                                                                                                                                                        
print(processor.batch_decode(out, skip_special_tokens=True)[0])

All-to-all example:

 import openvino_genai as ov_genai
  import openvino as ov                                                                                                                                                                                                    
  import numpy as np                                                                                                                                                                                                       
  import soundfile as sf                                                                                                                                                                                                   
  from PIL import Image                                                                                                                                                                                                    
                                                                                                                                                                                                                           
  # Load Qwen3-Omni model                                                                                                                                                                                                  
  pipe = ov_genai.VLMPipeline("Qwen3-Omni-7B-INT4", "CPU")                                                                                                                                                                 
                                                                                                                                                                                                                           
  # Prepare multimodal inputs                                                                                                                                                                                              
  image = np.array(Image.open("vacation_photo.jpg"))                                                                                                                                                                       
  image_tensor = ov.Tensor(np.expand_dims(image, 0))                                                                                                                                                                       
                                                                                                                                                                                                                           
  audio_data, _ = sf.read("question.wav", dtype="float32")  # "Where was this taken?"                                                                                                                                      
  audio_tensor = ov.Tensor(audio_data)                                                                                                                                                                                     
                                                                                                                                                                                                                           
  # Configure multimodal output (text + speech)                                                                                                                                                                            
  config = ov_genai.GenerationConfig()                                                                                                                                                                                     
  config.return_audio = True                                                                                                                                                                                               
  config.speaker = "f245"  # Female voice                                                                                                                                                                                  
  config.max_new_tokens = 256                                                                                                                                                                                              
                                                                                                                                                                                                                           
  # Generate: (text + image + audio) → (text + speech)                                                                                                                                                                     
  result = pipe.generate(                                                                                                                                                                                                  
      "Listen to the question and answer based on what you see.",                                                                                                                                                          
      images=[image_tensor],                                                                                                                                                                                               
      audios=[audio_tensor],                                                                                                                                                                                               
      generation_config=config                                                                                                                                                                                             
  )                                                                                                                                                                                                                        
                                                                                                                                                                                                                           
  print(result.texts[0])                                                                                                                                                                                                   
  # "Based on the palm trees and architecture, this photo was taken in Miami Beach."
                                                                                                                                                                                                                           
  sf.write("answer.wav", result.speech_outputs[0].data, 24000)                                                                                                                                                             
  # Saves spoken response as audio (24kHz WAV)

Notes

Qwen3-Omni MoE is registered in transformers only under MODEL_FOR_TEXT_TO_WAVEFORM; this PR adds the ASR and image-text-to-text task aliases via TasksManager._CUSTOM_CLASSES so optimum-cli can route correctly.
int8 sub-model expected-node counts for all 10 components are added to _ARCHITECTURES_TO_EXPECTED_INT8 for quantization regression tests.
Tiny test fixture (tests/openvino/models/tiny_qwen3_omni_moe.py) builds a locally-synthesized Qwen3-Omni MoE + Talker model and is registered in conftest.py so CI does not need to pull the full 30B checkpoint.

Before submitting

This PR fixes a typo or improves the docs.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

rkazants · 2026-04-28T12:21:09Z

@echarlaix, @popovaan, please take a look at this PR.

HuggingFaceDocBuilderDev · 2026-04-28T12:22:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rkazants · 2026-04-29T10:29:58Z

        model_type = config.model_type
-        if model_type in ["phi4mm", "phi4_multimodal"]:
+        if model_type == "qwen3_omni_moe":
+            task = "image-text-to-text"


let us have "any-to-any" task please

rkazants

Please provide any-to-any sample in PR description

rkazants

Please do not provide code-snippet with OV GenAI.
Provide any-to-any case using optimum-intel API

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds OpenVINO support for qwen3_omni_moe, including export/load plumbing for the Thinker + Talker speech stack and a new OVModelForOmni wrapper for multimodal generation with optional audio output.

Changes:

Registers Qwen3-Omni MoE across exporter/task routing, model loading, quantization, and pipeline dispatch.
Implements new OpenVINO model parts and export patchers for Talker, code predictor, code2wav, audio encoder, and Qwen3-Omni-specific vision/language handling.
Expands tests with a tiny local fixture, CLI/export/quantization coverage, and integration checks for OVModelForOmni.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`tests/openvino/utils_tests.py`	Adds expected INT8 node counts for `qwen3_omni_moe`.
`tests/openvino/test_seq2seq.py`	Extends multimodal integration tests and adds `OVModelForOmni` coverage.
`tests/openvino/test_quantization.py`	Adds quantization/weight-compression cases for `qwen3_omni_moe`.
`tests/openvino/test_exporters_cli.py`	Adds CLI export coverage and custom loader handling for `qwen3_omni_moe`.
`tests/openvino/test_export.py`	Adds direct export tests and expected part assertions for Qwen3-Omni MoE.
`tests/openvino/test_decoder.py`	Excludes Qwen3-Omni internal text parts from decoder untested-architecture checks.
`tests/openvino/test_audit_fixes.py`	Adds focused regression tests for dispatch, talker guards, and `OVModelForOmni`.
`tests/openvino/models/tiny_qwen3_omni_moe.py`	Introduces a tiny synthetic Qwen3-Omni MoE fixture generator.
`tests/openvino/conftest.py`	Registers the tiny Qwen3-Omni MoE fixture for session-wide tests.
`optimum/intel/pipelines/accelerator_utils.py`	Routes supported omni tasks to `OVModelForOmni`.
`optimum/intel/openvino/modeling_visual_language.py`	Implements Qwen3-Omni MoE runtime wrappers, audio generation flow, and `OVModelForOmni`.
`optimum/intel/openvino/__init__.py`	Exposes `OVModelForOmni` from the OpenVINO package.
`optimum/intel/__init__.py`	Re-exports `OVModelForOmni` and applies a Transformers compatibility patch.
`optimum/exporters/openvino/utils.py`	Treats `qwen3_omni_moe` as multimodal and copies preprocessor config during export.
`optimum/exporters/openvino/model_patcher.py`	Adds Qwen3-Omni MoE export patchers for vision/audio/talker/code2wav paths.
`optimum/exporters/openvino/model_configs.py`	Registers exporter configs and task aliases for Qwen3-Omni MoE submodels.
`optimum/exporters/openvino/convert.py`	Adds runtime-option handling, tokenizer behavior changes, and auxiliary weight saving.
`optimum/exporters/openvino/__main__.py`	Adjusts quantize CLI task redirection for `qwen3_omni_moe`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        if model_type in ("qwen3_omni_moe", "qwen3_omni", "qwen2_vl"):
+            try:
+                import shutil
+                source_preprocessor = Path(model_name_or_path) / "preprocessor_config.json"
+                dest_preprocessor = Path(output) / "preprocessor_config.json"
+
+                if source_preprocessor.exists():
+                    shutil.copy2(source_preprocessor, dest_preprocessor)
+                    logger.info(f"Copied preprocessor_config.json from source model")


+        if isinstance(audio, (list, tuple)) and len(audio) == 1:
+            audio = audio[0]
+        if isinstance(audio, tuple):
+            audio = audio[0]
+
+        conversation = [{"role": "user", "content": [{"type": "text", "text": text}]}]
+        if image is not None:
+            conversation[0]["content"].insert(0, {"type": "image"})
+        if audio is not None:
+            conversation[0]["content"].insert(0, {"type": "audio"})
+
+        text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
+        inputs = processor(images=image, text=text_prompt, audio=audio, return_tensors="pt")


    if (
        task is not None
-        and (task.startswith("text-generation") or task == "image-text-to-text")
+        and (task.startswith("text-generation") or any(t in task for t in _VLM_LANGUAGE_MODEL_TASKS))


+            padded_feature = padded_feature.unsqueeze(1)
+            padded_embed = torch.nn.functional.gelu(self.conv2d1(padded_feature))
+            padded_embed = torch.nn.functional.gelu(self.conv2d2(padded_embed))
+            padded_embed = torch.nn.functional.gelu(self.conv2d3(padded_embed))
+            b, c, f, t = padded_embed.size()
+            padded_embed = self.conv_out(padded_embed.permute(0, 3, 1, 2).contiguous().view(b, t, c * f))
+
+            positional_embedding = (
+                self.positional_embedding.positional_embedding[: padded_embed.shape[1], :]
+                .unsqueeze(0)
+                .to(padded_embed.dtype)
+            )
+            padded_embed = padded_embed + positional_embedding
+
+            # Flatten rather than boolean-index: the latter bakes a data-dependent shape that OV can't trace.
+            # Encoder layers run with eager attention during export, so cu_seqlens don't affect the output.
+            b, t, d = padded_embed.shape
+            hidden_states = padded_embed.reshape(b * t, d)
+
+            for encoder_layer in self.layers:
+                layer_outputs = encoder_layer(hidden_states, cu_seqlens)
+                hidden_states = layer_outputs[0]
+
+            hidden_states = self.ln_post(hidden_states)
+            hidden_states = self.proj1(hidden_states)
+            hidden_states = self.act(hidden_states)
+            hidden_states = self.proj2(hidden_states)
+
+            hidden_states = hidden_states.reshape(b, t, -1)
+            hidden_states = hidden_states * padded_mask_after_cnn.to(hidden_states.dtype).unsqueeze(-1)
+            return hidden_states
+


+            if self.code_predictor is not None and num_code_groups > 1:
+                self.code_predictor.reset()
+
+                # HF: inputs_embeds=torch.cat((past_hidden, last_id_hidden), dim=1)
+                cp_prefill = torch.cat([hidden_states[:, -1:, :], first_code_embed], dim=1)
+                cp_logits, cp_hidden = self.code_predictor(
+                    inputs_embeds=cp_prefill,
+                    generation_steps=0,
+                )
+
+                for cp_step in range(num_code_groups - 1):
+                    cp_next_logits = cp_logits[:, -1, :]
+                    cp_probs = torch.nn.functional.softmax(cp_next_logits, dim=-1)
+                    cp_token = torch.multinomial(cp_probs, num_samples=1).squeeze(-1)
+                    step_codes.append(cp_token.item())
+
+                    cp_embed = self._embed_cp_token(cp_token.unsqueeze(0), cp_step)
+                    codec_hiddens.append(cp_embed)


+            else:
+                logger.warning(
+                    "code_predictor_codec_embedding.npy not found — "
+                    "CodePredictor will use degraded fallback for token embedding"


+    "<|vision_bos|>",
+    "<|vision_eos|>",
+    "<|AUDIO|>",
+    "<|IMAGE|>",
+    "<|VIDEO|>",


+    "{% else %}"
+    "{% for content in message['content'] %}"
+    "{% if content['type'] == 'image' %}"
+    "{{ '<|vision_start|><|image_pad|><|vision_end|>' }}"


+# Patch Transformers 5.0 Qwen3OmniMoeTalkerCodePredictorConfig bug
+# Bug: __init__ references self.use_sliding_window and self.max_window_layers before they're set
+if is_transformers_version(">=", "5.0"):
+    try:
+        from transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe import (
+            Qwen3OmniMoeTalkerCodePredictorConfig,
+        )
+
+        _original_code_predictor_init = Qwen3OmniMoeTalkerCodePredictorConfig.__init__
+
+        def _patched_code_predictor_init(self, *args, use_sliding_window=False, max_window_layers=28, **kwargs):
+            # Set these attributes before calling original __init__ which references them
+            self.use_sliding_window = use_sliding_window
+            self.max_window_layers = max_window_layers
+            _original_code_predictor_init(self, *args, use_sliding_window=use_sliding_window, max_window_layers=max_window_layers, **kwargs)
+
+        Qwen3OmniMoeTalkerCodePredictorConfig.__init__ = _patched_code_predictor_init


+    t = np.linspace(0, 5.0, int(5.0 * 22050), endpoint=False)
+    audio_data = 0.5 * np.sin(2 * np.pi * 220 * t)
+    return (audio_data, 16000)


sgonorov force-pushed the qwen-3-omni-moe-main branch from 36e63c3 to a618ab9 Compare April 27, 2026 09:02

rkazants requested review from echarlaix and popovaan April 28, 2026 12:20

rkazants reviewed Apr 29, 2026

View reviewed changes

sgonorov force-pushed the qwen-3-omni-moe-main branch 2 times, most recently from a98964a to f2fc0ef Compare May 3, 2026 22:39

Qwen 3 Omni MoE support

9a84572

sgonorov force-pushed the qwen-3-omni-moe-main branch from f2fc0ef to 9a84572 Compare May 3, 2026 22:40

rkazants requested a review from Copilot May 5, 2026 05:08

Copilot started reviewing on behalf of rkazants May 5, 2026 05:08 View session

rkazants reviewed May 5, 2026

View reviewed changes

Copilot AI reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenVINO] Support Qwen3-Omni MoE with full Talker speech stack #1700

[OpenVINO] Support Qwen3-Omni MoE with full Talker speech stack #1700
sgonorov wants to merge 1 commit intohuggingface:mainfrom
sgonorov:qwen-3-omni-moe-main

sgonorov commented Apr 24, 2026 •

edited

Loading

Uh oh!

rkazants commented Apr 28, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2026

Uh oh!

rkazants Apr 29, 2026

Uh oh!

rkazants left a comment

Uh oh!

rkazants left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sgonorov commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Notes

Before submitting

Uh oh!

rkazants commented Apr 28, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 28, 2026

Uh oh!

rkazants Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

rkazants left a comment

Choose a reason for hiding this comment

Uh oh!

rkazants left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sgonorov commented Apr 24, 2026 •

edited

Loading