Skip to content

[OpenVINO] Support Qwen3-Omni MoE with full Talker speech stack #1700

Open
sgonorov wants to merge 1 commit intohuggingface:mainfrom
sgonorov:qwen-3-omni-moe-main
Open

[OpenVINO] Support Qwen3-Omni MoE with full Talker speech stack #1700
sgonorov wants to merge 1 commit intohuggingface:mainfrom
sgonorov:qwen-3-omni-moe-main

Conversation

@sgonorov
Copy link
Copy Markdown

@sgonorov sgonorov commented Apr 24, 2026

What does this PR do?

Extends the Qwen3-Omni dense work in #1640 to the MoE variant and adds the full Talker / speech-generation path. Exports Qwen/Qwen3-Omni-30B-A3B-Instruct-class models as 10 sub-models.

A new public class OVModelForOmni (exposed from optimum.intel) wraps the Thinker (_OVQwen3OmniMoeForCausalLM) and adds Talker/code-predictor/code2wav orchestration behind a GenerationMixin-compatible interface. Standard text/image/audio/image+audio input combinations still work through OVModelForVisualCausalLM for Thinker-only use cases.
## Usage

Conversion:

optimum-cli export openvino -m Qwen/Qwen3-Omni-30B-A3B-Instruct Qwen3-Omni-30B-A3B-Instruct --trust-remote-code

Inference (text + speech output):

from optimum.intel import OVModelForOmni
from transformers import AutoProcessor                                                                                                                                                                                   
                                                                                                                                                                                                                         
model_path = "./Qwen3-Omni-30B-A3B-Instruct"                                                                                                                                                                             
processor = AutoProcessor.from_pretrained(model_path)                                                                                                                                                                    
model = OVModelForOmni.from_pretrained(model_path)                                                                                                                                                                       
                                                                                                                                                                                                                         
# Thinker-only (text out)                                                                                                                                                                                                
inputs = processor(text="Describe Intel OpenVINO in one sentence.", return_tensors="pt")                                                                                                                                 
out = model.generate(**inputs, max_new_tokens=64)                                                                                                                                                                        
print(processor.batch_decode(out, skip_special_tokens=True)[0])                                                                                                                                                          

All-to-all example:

 import openvino_genai as ov_genai
  import openvino as ov                                                                                                                                                                                                    
  import numpy as np                                                                                                                                                                                                       
  import soundfile as sf                                                                                                                                                                                                   
  from PIL import Image                                                                                                                                                                                                    
                                                                                                                                                                                                                           
  # Load Qwen3-Omni model                                                                                                                                                                                                  
  pipe = ov_genai.VLMPipeline("Qwen3-Omni-7B-INT4", "CPU")                                                                                                                                                                 
                                                                                                                                                                                                                           
  # Prepare multimodal inputs                                                                                                                                                                                              
  image = np.array(Image.open("vacation_photo.jpg"))                                                                                                                                                                       
  image_tensor = ov.Tensor(np.expand_dims(image, 0))                                                                                                                                                                       
                                                                                                                                                                                                                           
  audio_data, _ = sf.read("question.wav", dtype="float32")  # "Where was this taken?"                                                                                                                                      
  audio_tensor = ov.Tensor(audio_data)                                                                                                                                                                                     
                                                                                                                                                                                                                           
  # Configure multimodal output (text + speech)                                                                                                                                                                            
  config = ov_genai.GenerationConfig()                                                                                                                                                                                     
  config.return_audio = True                                                                                                                                                                                               
  config.speaker = "f245"  # Female voice                                                                                                                                                                                  
  config.max_new_tokens = 256                                                                                                                                                                                              
                                                                                                                                                                                                                           
  # Generate: (text + image + audio) → (text + speech)                                                                                                                                                                     
  result = pipe.generate(                                                                                                                                                                                                  
      "Listen to the question and answer based on what you see.",                                                                                                                                                          
      images=[image_tensor],                                                                                                                                                                                               
      audios=[audio_tensor],                                                                                                                                                                                               
      generation_config=config                                                                                                                                                                                             
  )                                                                                                                                                                                                                        
                                                                                                                                                                                                                           
  print(result.texts[0])                                                                                                                                                                                                   
  # "Based on the palm trees and architecture, this photo was taken in Miami Beach."
                                                                                                                                                                                                                           
  sf.write("answer.wav", result.speech_outputs[0].data, 24000)                                                                                                                                                             
  # Saves spoken response as audio (24kHz WAV)   

Notes

  • Qwen3-Omni MoE is registered in transformers only under MODEL_FOR_TEXT_TO_WAVEFORM; this PR adds the ASR and image-text-to-text task aliases via TasksManager._CUSTOM_CLASSES so optimum-cli can route correctly.
  • int8 sub-model expected-node counts for all 10 components are added to _ARCHITECTURES_TO_EXPECTED_INT8 for quantization regression tests.
  • Tiny test fixture (tests/openvino/models/tiny_qwen3_omni_moe.py) builds a locally-synthesized Qwen3-Omni MoE + Talker model and is registered in conftest.py so CI does not need to pull the full 30B checkpoint.

Before submitting

  • This PR fixes a typo or improves the docs.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@sgonorov sgonorov force-pushed the qwen-3-omni-moe-main branch from 36e63c3 to a618ab9 Compare April 27, 2026 09:02
@rkazants rkazants requested review from echarlaix and popovaan April 28, 2026 12:20
@rkazants
Copy link
Copy Markdown
Collaborator

@echarlaix, @popovaan, please take a look at this PR.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread optimum/exporters/openvino/__main__.py Outdated
model_type = config.model_type
if model_type in ["phi4mm", "phi4_multimodal"]:
if model_type == "qwen3_omni_moe":
task = "image-text-to-text"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us have "any-to-any" task please

Copy link
Copy Markdown
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide any-to-any sample in PR description

@sgonorov sgonorov force-pushed the qwen-3-omni-moe-main branch 2 times, most recently from a98964a to f2fc0ef Compare May 3, 2026 22:39
Copy link
Copy Markdown
Collaborator

@rkazants rkazants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not provide code-snippet with OV GenAI.
Provide any-to-any case using optimum-intel API

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds OpenVINO support for qwen3_omni_moe, including export/load plumbing for the Thinker + Talker speech stack and a new OVModelForOmni wrapper for multimodal generation with optional audio output.

Changes:

  • Registers Qwen3-Omni MoE across exporter/task routing, model loading, quantization, and pipeline dispatch.
  • Implements new OpenVINO model parts and export patchers for Talker, code predictor, code2wav, audio encoder, and Qwen3-Omni-specific vision/language handling.
  • Expands tests with a tiny local fixture, CLI/export/quantization coverage, and integration checks for OVModelForOmni.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
tests/openvino/utils_tests.py Adds expected INT8 node counts for qwen3_omni_moe.
tests/openvino/test_seq2seq.py Extends multimodal integration tests and adds OVModelForOmni coverage.
tests/openvino/test_quantization.py Adds quantization/weight-compression cases for qwen3_omni_moe.
tests/openvino/test_exporters_cli.py Adds CLI export coverage and custom loader handling for qwen3_omni_moe.
tests/openvino/test_export.py Adds direct export tests and expected part assertions for Qwen3-Omni MoE.
tests/openvino/test_decoder.py Excludes Qwen3-Omni internal text parts from decoder untested-architecture checks.
tests/openvino/test_audit_fixes.py Adds focused regression tests for dispatch, talker guards, and OVModelForOmni.
tests/openvino/models/tiny_qwen3_omni_moe.py Introduces a tiny synthetic Qwen3-Omni MoE fixture generator.
tests/openvino/conftest.py Registers the tiny Qwen3-Omni MoE fixture for session-wide tests.
optimum/intel/pipelines/accelerator_utils.py Routes supported omni tasks to OVModelForOmni.
optimum/intel/openvino/modeling_visual_language.py Implements Qwen3-Omni MoE runtime wrappers, audio generation flow, and OVModelForOmni.
optimum/intel/openvino/__init__.py Exposes OVModelForOmni from the OpenVINO package.
optimum/intel/__init__.py Re-exports OVModelForOmni and applies a Transformers compatibility patch.
optimum/exporters/openvino/utils.py Treats qwen3_omni_moe as multimodal and copies preprocessor config during export.
optimum/exporters/openvino/model_patcher.py Adds Qwen3-Omni MoE export patchers for vision/audio/talker/code2wav paths.
optimum/exporters/openvino/model_configs.py Registers exporter configs and task aliases for Qwen3-Omni MoE submodels.
optimum/exporters/openvino/convert.py Adds runtime-option handling, tokenizer behavior changes, and auxiliary weight saving.
optimum/exporters/openvino/__main__.py Adjusts quantize CLI task redirection for qwen3_omni_moe.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +450 to +458
if model_type in ("qwen3_omni_moe", "qwen3_omni", "qwen2_vl"):
try:
import shutil
source_preprocessor = Path(model_name_or_path) / "preprocessor_config.json"
dest_preprocessor = Path(output) / "preprocessor_config.json"

if source_preprocessor.exists():
shutil.copy2(source_preprocessor, dest_preprocessor)
logger.info(f"Copied preprocessor_config.json from source model")
Comment on lines +4387 to +4399
if isinstance(audio, (list, tuple)) and len(audio) == 1:
audio = audio[0]
if isinstance(audio, tuple):
audio = audio[0]

conversation = [{"role": "user", "content": [{"type": "text", "text": text}]}]
if image is not None:
conversation[0]["content"].insert(0, {"type": "image"})
if audio is not None:
conversation[0]["content"].insert(0, {"type": "audio"})

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=text_prompt, audio=audio, return_tensors="pt")
if (
task is not None
and (task.startswith("text-generation") or task == "image-text-to-text")
and (task.startswith("text-generation") or any(t in task for t in _VLM_LANGUAGE_MODEL_TASKS))
Comment on lines +4647 to +4678
padded_feature = padded_feature.unsqueeze(1)
padded_embed = torch.nn.functional.gelu(self.conv2d1(padded_feature))
padded_embed = torch.nn.functional.gelu(self.conv2d2(padded_embed))
padded_embed = torch.nn.functional.gelu(self.conv2d3(padded_embed))
b, c, f, t = padded_embed.size()
padded_embed = self.conv_out(padded_embed.permute(0, 3, 1, 2).contiguous().view(b, t, c * f))

positional_embedding = (
self.positional_embedding.positional_embedding[: padded_embed.shape[1], :]
.unsqueeze(0)
.to(padded_embed.dtype)
)
padded_embed = padded_embed + positional_embedding

# Flatten rather than boolean-index: the latter bakes a data-dependent shape that OV can't trace.
# Encoder layers run with eager attention during export, so cu_seqlens don't affect the output.
b, t, d = padded_embed.shape
hidden_states = padded_embed.reshape(b * t, d)

for encoder_layer in self.layers:
layer_outputs = encoder_layer(hidden_states, cu_seqlens)
hidden_states = layer_outputs[0]

hidden_states = self.ln_post(hidden_states)
hidden_states = self.proj1(hidden_states)
hidden_states = self.act(hidden_states)
hidden_states = self.proj2(hidden_states)

hidden_states = hidden_states.reshape(b, t, -1)
hidden_states = hidden_states * padded_mask_after_cnn.to(hidden_states.dtype).unsqueeze(-1)
return hidden_states

Comment on lines +4678 to +4695
if self.code_predictor is not None and num_code_groups > 1:
self.code_predictor.reset()

# HF: inputs_embeds=torch.cat((past_hidden, last_id_hidden), dim=1)
cp_prefill = torch.cat([hidden_states[:, -1:, :], first_code_embed], dim=1)
cp_logits, cp_hidden = self.code_predictor(
inputs_embeds=cp_prefill,
generation_steps=0,
)

for cp_step in range(num_code_groups - 1):
cp_next_logits = cp_logits[:, -1, :]
cp_probs = torch.nn.functional.softmax(cp_next_logits, dim=-1)
cp_token = torch.multinomial(cp_probs, num_samples=1).squeeze(-1)
step_codes.append(cp_token.item())

cp_embed = self._embed_cp_token(cp_token.unsqueeze(0), cp_step)
codec_hiddens.append(cp_embed)
else:
logger.warning(
"code_predictor_codec_embedding.npy not found — "
"CodePredictor will use degraded fallback for token embedding"
Comment on lines +35 to +39
"<|vision_bos|>",
"<|vision_eos|>",
"<|AUDIO|>",
"<|IMAGE|>",
"<|VIDEO|>",
"{% else %}"
"{% for content in message['content'] %}"
"{% if content['type'] == 'image' %}"
"{{ '<|vision_start|><|image_pad|><|vision_end|>' }}"
Comment thread optimum/intel/__init__.py
Comment on lines +28 to +44
# Patch Transformers 5.0 Qwen3OmniMoeTalkerCodePredictorConfig bug
# Bug: __init__ references self.use_sliding_window and self.max_window_layers before they're set
if is_transformers_version(">=", "5.0"):
try:
from transformers.models.qwen3_omni_moe.configuration_qwen3_omni_moe import (
Qwen3OmniMoeTalkerCodePredictorConfig,
)

_original_code_predictor_init = Qwen3OmniMoeTalkerCodePredictorConfig.__init__

def _patched_code_predictor_init(self, *args, use_sliding_window=False, max_window_layers=28, **kwargs):
# Set these attributes before calling original __init__ which references them
self.use_sliding_window = use_sliding_window
self.max_window_layers = max_window_layers
_original_code_predictor_init(self, *args, use_sliding_window=use_sliding_window, max_window_layers=max_window_layers, **kwargs)

Qwen3OmniMoeTalkerCodePredictorConfig.__init__ = _patched_code_predictor_init
Comment on lines +1148 to +1150
t = np.linspace(0, 5.0, int(5.0 * 22050), endpoint=False)
audio_data = 0.5 * np.sin(2 * np.pi * 220 * t)
return (audio_data, 16000)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants