QWEN-3-Omni convertion and chat demo by sgonorov · Pull Request #3443 · openvinotoolkit/openvino.genai

sgonorov · 2026-03-05T00:19:58Z

QWEN-3-Omni convertion and chat demo

Copilot

Pull request overview

Adds a new Qwen3-Omni(-MoE) OpenVINO IR conversion toolchain and a separate PyTorch-based CLI chat/demo utility under tools/qwen3/.

Changes:

Introduces a multi-part converter for Qwen3-Omni-MOE (thinker/talker/audio/vision/code2wav) with optional weight compression.
Adds OpenVINO stateful/KV-cache patching utilities and cache adapter classes for export/tracing.
Adds a CLI chat app + demo flow for Qwen3-Omni with basic media command parsing and optional audio output.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 26 comments.

Show a summary per file

File	Description
tools/qwen3/qwen3_omni_moe/utils.py	TorchScript cache cleanup helper used after conversions.
tools/qwen3/qwen3_omni_moe/traceable_cache.py	Cache adapter intended to be trace-friendly for export.
tools/qwen3/qwen3_omni_moe/stateful_utils.py	OpenVINO graph patching to fuse cache reorder + make KV-cache stateful.
tools/qwen3/qwen3_omni_moe/ov_model_utils.py	IR naming + (optional) weight compression and saving.
tools/qwen3/qwen3_omni_moe/flat_cache.py	Flat KV-cache adapter used by language wrappers.
tools/qwen3/qwen3_omni_moe/export_utils.py	Torch Export → OpenVINO conversion helper.
tools/qwen3/qwen3_omni_moe/convert_vision_encoder.py	Converts vision patcher and vision merger submodels.
tools/qwen3/qwen3_omni_moe/convert_thinker_language.py	Exports thinker language model (logits/hidden + KV-cache).
tools/qwen3/qwen3_omni_moe/convert_thinker_embedding.py	Converts thinker token embedding module.
tools/qwen3/qwen3_omni_moe/convert_talker.py	Converts talker embedding + talker language model.
tools/qwen3/qwen3_omni_moe/convert_code_predictor.py	Converts talker code predictor submodel.
tools/qwen3/qwen3_omni_moe/convert_code2wav.py	Converts the code-to-waveform vocoder.
tools/qwen3/qwen3_omni_moe/convert_audio_encoder.py	Converts the audio encoder with a forward wrapper.
tools/qwen3/qwen3_omni_moe/convert.py	Main CLI entrypoint orchestrating conversion and config saving.
tools/qwen3/qwen3_omni_moe/constants.py	Central names/constants for inputs/outputs and output filenames.
tools/qwen3/qwen3_omni_moe/init.py	Exposes converter entrypoint.
tools/qwen3/qwen3_omni_moe/README.md	Converter usage docs and output layout.
tools/qwen3/qwen3_chat/model.py	Loads the Qwen3-Omni PyTorch model + processor.
tools/qwen3/qwen3_chat/io_utils.py	Parses media commands and saves generated audio to WAV.
tools/qwen3/qwen3_chat/generate.py	Chat history setup + generation helper.
tools/qwen3/qwen3_chat/demo.py	Text-only smoke-test demo mode.
tools/qwen3/qwen3_chat/chat.py	Interactive CLI chat loop with commands and optional audio saving.
tools/qwen3/qwen3_chat/main.py	Module entrypoint (`python -m tools.qwen3.qwen3_chat`).
tools/qwen3/qwen3_chat/init.py	Package marker.
tools/qwen3/qwen3_chat/README.md	CLI chat usage docs and examples.

Copilot · 2026-03-11T18:02:15Z

tools/qwen3/qwen3_chat/generate.py

+    gen_kwargs: dict[str, Any] = {
+        "streamer": transformers.TextStreamer(
+            processor,
+            skip_prompt=True,
+            skip_special_tokens=False,
+            clean_up_tokenization_spaces=False,
+        ),
+        "thinker_do_sample": False,
+    }


TextStreamer is configured with skip_special_tokens=False, but the final batch_decode() uses skip_special_tokens=True. This can cause the streamed text shown to the user to include special tokens that won’t appear in the final decoded string. Consider setting skip_special_tokens=True for the streamer (or otherwise keeping streaming and final decoding consistent).

Copilot · 2026-03-11T18:02:16Z

tools/qwen3/qwen3_chat/demo.py

+        with torch.no_grad():
+            text, _ = generate_response(
+                model,
+                processor,
+                history,
+                enable_audio=False,
+                speaker=None,
+            )
+
+        print(f"You:  {prompt}")
+        print(f"Qwen: {text}\n")


generate_response() always streams tokens to stdout via TextStreamer, and then run_demo() prints the fully decoded text again (print(f"Qwen: {text}")). This leads to duplicated output in demo mode. Consider adding a flag to disable streaming for demo, or avoid printing the decoded text when streaming is enabled.

Copilot · 2026-03-11T18:02:16Z

tools/qwen3/qwen3_omni_moe/convert_audio_encoder.py

+        cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn)
+        remainder = cnn_len % window_aftercnn


In _forward_wrap_audio_encoder, cnn_len is a 0-d tensor (iterating over aftercnn_lens), so expressions like [window_aftercnn] * (cnn_len // window_aftercnn) will raise TypeError: can't multiply sequence by non-int of type 'Tensor'. Convert cnn_len to a Python int (e.g., cnn_len_int = int(cnn_len.item())) before using it for list repetition / modulo.

Suggested change

cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn)

remainder = cnn_len % window_aftercnn

cnn_len_int = int(cnn_len.item())

cu_chunk_lens += [window_aftercnn] * (cnn_len_int // window_aftercnn)

remainder = cnn_len_int % window_aftercnn

Copilot · 2026-03-11T18:02:16Z

tools/qwen3/qwen3_omni_moe/convert_code2wav.py

+    original_diff = torch.diff
+    torch.diff = _diff_via_slice
+    try:
+        ov_model = ov.convert_model(
+            code2wav,
+            example_input={
+                "codes": torch.ones([1, num_quantizers, 4], dtype=torch.long),
+            },
+        )
+    finally:
+        torch.diff = original_diff


This monkey-patches torch.diff globally during conversion. Even though it is restored in finally, it can still have surprising side effects if other code runs concurrently in the same process. Prefer calling a local replacement function directly (or wrapping only the model code that needs it) instead of reassigning torch.diff at module scope.

Copilot · 2026-03-11T18:02:17Z

tools/qwen3/qwen3_omni_moe/stateful_utils.py

+    key_value_output_names: list[str],
+    batch_dim: int,
+) -> None:
+    from openvino._offline_transformations import apply_make_stateful_transformation


apply_make_stateful_transformation is imported from openvino._offline_transformations, which is a private module. This can break across OpenVINO versions; consider using a public transformation API if available, or at least guarding the import and providing a clear error message about the required OpenVINO version.

Suggested change

from openvino._offline_transformations import apply_make_stateful_transformation

try:

from openvino._offline_transformations import apply_make_stateful_transformation

except ImportError as exc:

raise RuntimeError(

"Stateful model transformation requires 'apply_make_stateful_transformation' from "

"openvino._offline_transformations, which is not available in the installed "

f"OpenVINO distribution (version: {ov.__version__}). "

"Install an OpenVINO version that provides offline transformations, or disable "

"stateful patching for this model."

) from exc

Copilot · 2026-03-11T18:02:21Z

tools/qwen3/qwen3_omni_moe/convert.py

+from pathlib import Path
+from typing import Any
+import argparse


These newly added Python files are missing the standard repository license header (Copyright ... + SPDX-License-Identifier: Apache-2.0) that appears in other tools/ modules (e.g., tools/cacheviz/cacheviz.py). Please add the header to this file (and the other new files in this PR) to keep licensing consistent.

Copilot · 2026-03-11T18:02:21Z

tools/qwen3/qwen3_omni_moe/convert.py

+    model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
+        ckpt,
+        config=config,
+        torch_dtype=torch.float16,
+        low_cpu_mem_usage=True,
+    )


_load_model() always loads the source model with torch_dtype=torch.float16 regardless of --weight_format. As a result, selecting fp32 cannot produce a true FP32 pipeline (and may also affect export accuracy). Consider plumbing weight_format into _load_model() (e.g., use torch.float32 when weight_format == "fp32").

Copilot · 2026-03-11T18:02:22Z

tools/qwen3/qwen3_omni_moe/ov_model_utils.py

+    if suffix_input_names:
+        input_names.extend(suffix_input_names)
+    for inp, name in zip(ov_model.inputs, input_names):
+        inp.get_tensor().set_names({name})
+    for out, name in zip(ov_model.outputs, output_names):
+        out.get_tensor().set_names({name})


set_ov_model_names() uses zip(ov_model.inputs, input_names) / zip(ov_model.outputs, output_names), which silently ignores extra ports or extra names if the counts diverge (easy to hit when the export graph changes). Add an explicit length check/assert before renaming so mismatches fail fast and are easier to debug.

Copilot · 2026-03-11T18:02:22Z

tools/qwen3/qwen3_omni_moe/convert_code2wav.py

+import openvino as ov
+from openvino.frontend.pytorch.patch_model import __make_16bit_traceable
+from transformers import Qwen3OmniForConditionalGeneration
+


__make_16bit_traceable is a private OpenVINO frontend symbol. Depending on private APIs makes this conversion script brittle across OpenVINO versions; consider using a public API (or centralizing this behind a helper with an explicit version check).

Copilot · 2026-03-11T18:02:22Z

tools/qwen3/qwen3_omni_moe/stateful_utils.py

+def fuse_cache_reorder(
+    ov_model: ov.Model,
+    not_kv_inputs: list[str],
+    key_value_input_names: list[str],
+    gather_dim: int,
+) -> None:
+    if model_has_input_output_name(ov_model, BEAM_IDX_NAME):
+        raise ValueError("Model already has fused cache")
+    input_batch = ov_model.input(INPUTS_EMBEDS).get_partial_shape()[0]
+    beam_idx = opset13.parameter(name=BEAM_IDX_NAME, dtype=ov.Type.i32, shape=ov.PartialShape([input_batch]))
+    beam_idx.output(0).get_tensor().add_names({BEAM_IDX_NAME})
+    ov_model.add_parameters([beam_idx])
+    not_kv_inputs.append(ov_model.inputs[-1])


fuse_cache_reorder() annotates not_kv_inputs as list[str], but patch_stateful() passes a list of OpenVINO input ports (ov_model.inputs). This type mismatch is misleading and makes static checking harder; adjust the annotation (or remove the parameter entirely since it isn’t used for anything meaningful).

sgonorov requested a review from Wovchena March 5, 2026 00:19

sgonorov self-assigned this Mar 5, 2026

github-actions bot added the no-match-files label Mar 5, 2026

sgonorov force-pushed the qwen3-omni-support branch from d1ef5db to 3e7c752 Compare March 9, 2026 09:25

Copilot AI review requested due to automatic review settings March 11, 2026 17:50

sgonorov force-pushed the qwen3-omni-support branch from 16cedfd to e2f4104 Compare March 11, 2026 17:50

Copilot started reviewing on behalf of sgonorov March 11, 2026 17:51 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

sgonorov and others added 7 commits March 12, 2026 18:00

QWEN-3-Omni convertion and chat demo

99f3ab8

Fix both errors

cf0d14d

Add streamer

1d27748

sort

a64d078

Demo fixes

e69c310

Fixed error

753a1e6

Fix TypeError

d535b89

sgonorov force-pushed the qwen3-omni-support branch from e2f4104 to d535b89 Compare March 12, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QWEN-3-Omni convertion and chat demo#3443

QWEN-3-Omni convertion and chat demo#3443
sgonorov wants to merge 7 commits intoopenvinotoolkit:masterfrom
sgonorov:qwen3-omni-support

sgonorov commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Copilot AI Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn)
		remainder = cnn_len % window_aftercnn

-        cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn)
-        remainder = cnn_len % window_aftercnn
+        cnn_len_int = int(cnn_len.item())
+        cu_chunk_lens += [window_aftercnn] * (cnn_len_int // window_aftercnn)
+        remainder = cnn_len_int % window_aftercnn

-    from openvino._offline_transformations import apply_make_stateful_transformation
+    try:
+        from openvino._offline_transformations import apply_make_stateful_transformation
+    except ImportError as exc:
+        raise RuntimeError(
+            "Stateful model transformation requires 'apply_make_stateful_transformation' from "
+            "openvino._offline_transformations, which is not available in the installed "
+            f"OpenVINO distribution (version: {ov.__version__}). "
+            "Install an OpenVINO version that provides offline transformations, or disable "
+            "stateful patching for this model."
+        ) from exc

Conversation

sgonorov commented Mar 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants