Skip to content

QWEN-3-Omni convertion and chat demo#3443

Draft
sgonorov wants to merge 7 commits intoopenvinotoolkit:masterfrom
sgonorov:qwen3-omni-support
Draft

QWEN-3-Omni convertion and chat demo#3443
sgonorov wants to merge 7 commits intoopenvinotoolkit:masterfrom
sgonorov:qwen3-omni-support

Conversation

@sgonorov
Copy link
Contributor

@sgonorov sgonorov commented Mar 5, 2026

QWEN-3-Omni convertion and chat demo

@sgonorov sgonorov requested a review from Wovchena March 5, 2026 00:19
@sgonorov sgonorov self-assigned this Mar 5, 2026
@sgonorov sgonorov force-pushed the qwen3-omni-support branch from d1ef5db to 3e7c752 Compare March 9, 2026 09:25
Copilot AI review requested due to automatic review settings March 11, 2026 17:50
@sgonorov sgonorov force-pushed the qwen3-omni-support branch from 16cedfd to e2f4104 Compare March 11, 2026 17:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Qwen3-Omni(-MoE) OpenVINO IR conversion toolchain and a separate PyTorch-based CLI chat/demo utility under tools/qwen3/.

Changes:

  • Introduces a multi-part converter for Qwen3-Omni-MOE (thinker/talker/audio/vision/code2wav) with optional weight compression.
  • Adds OpenVINO stateful/KV-cache patching utilities and cache adapter classes for export/tracing.
  • Adds a CLI chat app + demo flow for Qwen3-Omni with basic media command parsing and optional audio output.

Reviewed changes

Copilot reviewed 24 out of 25 changed files in this pull request and generated 26 comments.

Show a summary per file
File Description
tools/qwen3/qwen3_omni_moe/utils.py TorchScript cache cleanup helper used after conversions.
tools/qwen3/qwen3_omni_moe/traceable_cache.py Cache adapter intended to be trace-friendly for export.
tools/qwen3/qwen3_omni_moe/stateful_utils.py OpenVINO graph patching to fuse cache reorder + make KV-cache stateful.
tools/qwen3/qwen3_omni_moe/ov_model_utils.py IR naming + (optional) weight compression and saving.
tools/qwen3/qwen3_omni_moe/flat_cache.py Flat KV-cache adapter used by language wrappers.
tools/qwen3/qwen3_omni_moe/export_utils.py Torch Export → OpenVINO conversion helper.
tools/qwen3/qwen3_omni_moe/convert_vision_encoder.py Converts vision patcher and vision merger submodels.
tools/qwen3/qwen3_omni_moe/convert_thinker_language.py Exports thinker language model (logits/hidden + KV-cache).
tools/qwen3/qwen3_omni_moe/convert_thinker_embedding.py Converts thinker token embedding module.
tools/qwen3/qwen3_omni_moe/convert_talker.py Converts talker embedding + talker language model.
tools/qwen3/qwen3_omni_moe/convert_code_predictor.py Converts talker code predictor submodel.
tools/qwen3/qwen3_omni_moe/convert_code2wav.py Converts the code-to-waveform vocoder.
tools/qwen3/qwen3_omni_moe/convert_audio_encoder.py Converts the audio encoder with a forward wrapper.
tools/qwen3/qwen3_omni_moe/convert.py Main CLI entrypoint orchestrating conversion and config saving.
tools/qwen3/qwen3_omni_moe/constants.py Central names/constants for inputs/outputs and output filenames.
tools/qwen3/qwen3_omni_moe/init.py Exposes converter entrypoint.
tools/qwen3/qwen3_omni_moe/README.md Converter usage docs and output layout.
tools/qwen3/qwen3_chat/model.py Loads the Qwen3-Omni PyTorch model + processor.
tools/qwen3/qwen3_chat/io_utils.py Parses media commands and saves generated audio to WAV.
tools/qwen3/qwen3_chat/generate.py Chat history setup + generation helper.
tools/qwen3/qwen3_chat/demo.py Text-only smoke-test demo mode.
tools/qwen3/qwen3_chat/chat.py Interactive CLI chat loop with commands and optional audio saving.
tools/qwen3/qwen3_chat/main.py Module entrypoint (python -m tools.qwen3.qwen3_chat).
tools/qwen3/qwen3_chat/init.py Package marker.
tools/qwen3/qwen3_chat/README.md CLI chat usage docs and examples.

Comment on lines +37 to +45
gen_kwargs: dict[str, Any] = {
"streamer": transformers.TextStreamer(
processor,
skip_prompt=True,
skip_special_tokens=False,
clean_up_tokenization_spaces=False,
),
"thinker_do_sample": False,
}
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextStreamer is configured with skip_special_tokens=False, but the final batch_decode() uses skip_special_tokens=True. This can cause the streamed text shown to the user to include special tokens that won’t appear in the final decoded string. Consider setting skip_special_tokens=True for the streamer (or otherwise keeping streaming and final decoding consistent).

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +33
with torch.no_grad():
text, _ = generate_response(
model,
processor,
history,
enable_audio=False,
speaker=None,
)

print(f"You: {prompt}")
print(f"Qwen: {text}\n")
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generate_response() always streams tokens to stdout via TextStreamer, and then run_demo() prints the fully decoded text again (print(f"Qwen: {text}")). This leads to duplicated output in demo mode. Consider adding a flag to disable streaming for demo, or avoid printing the decoded text when streaming is enabled.

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +48
cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn)
remainder = cnn_len % window_aftercnn
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _forward_wrap_audio_encoder, cnn_len is a 0-d tensor (iterating over aftercnn_lens), so expressions like [window_aftercnn] * (cnn_len // window_aftercnn) will raise TypeError: can't multiply sequence by non-int of type 'Tensor'. Convert cnn_len to a Python int (e.g., cnn_len_int = int(cnn_len.item())) before using it for list repetition / modulo.

Suggested change
cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn)
remainder = cnn_len % window_aftercnn
cnn_len_int = int(cnn_len.item())
cu_chunk_lens += [window_aftercnn] * (cnn_len_int // window_aftercnn)
remainder = cnn_len_int % window_aftercnn

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +49
original_diff = torch.diff
torch.diff = _diff_via_slice
try:
ov_model = ov.convert_model(
code2wav,
example_input={
"codes": torch.ones([1, num_quantizers, 4], dtype=torch.long),
},
)
finally:
torch.diff = original_diff
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This monkey-patches torch.diff globally during conversion. Even though it is restored in finally, it can still have surprising side effects if other code runs concurrently in the same process. Prefer calling a local replacement function directly (or wrapping only the model code that needs it) instead of reassigning torch.diff at module scope.

Copilot uses AI. Check for mistakes.
key_value_output_names: list[str],
batch_dim: int,
) -> None:
from openvino._offline_transformations import apply_make_stateful_transformation
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_make_stateful_transformation is imported from openvino._offline_transformations, which is a private module. This can break across OpenVINO versions; consider using a public transformation API if available, or at least guarding the import and providing a clear error message about the required OpenVINO version.

Suggested change
from openvino._offline_transformations import apply_make_stateful_transformation
try:
from openvino._offline_transformations import apply_make_stateful_transformation
except ImportError as exc:
raise RuntimeError(
"Stateful model transformation requires 'apply_make_stateful_transformation' from "
"openvino._offline_transformations, which is not available in the installed "
f"OpenVINO distribution (version: {ov.__version__}). "
"Install an OpenVINO version that provides offline transformations, or disable "
"stateful patching for this model."
) from exc

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
from pathlib import Path
from typing import Any
import argparse
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These newly added Python files are missing the standard repository license header (Copyright ... + SPDX-License-Identifier: Apache-2.0) that appears in other tools/ modules (e.g., tools/cacheviz/cacheviz.py). Please add the header to this file (and the other new files in this PR) to keep licensing consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +60
model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
ckpt,
config=config,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_model() always loads the source model with torch_dtype=torch.float16 regardless of --weight_format. As a result, selecting fp32 cannot produce a true FP32 pipeline (and may also affect export accuracy). Consider plumbing weight_format into _load_model() (e.g., use torch.float32 when weight_format == "fp32").

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +38
if suffix_input_names:
input_names.extend(suffix_input_names)
for inp, name in zip(ov_model.inputs, input_names):
inp.get_tensor().set_names({name})
for out, name in zip(ov_model.outputs, output_names):
out.get_tensor().set_names({name})
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_ov_model_names() uses zip(ov_model.inputs, input_names) / zip(ov_model.outputs, output_names), which silently ignores extra ports or extra names if the counts diverge (easy to hit when the export graph changes). Add an explicit length check/assert before renaming so mismatches fail fast and are easier to debug.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +9
import openvino as ov
from openvino.frontend.pytorch.patch_model import __make_16bit_traceable
from transformers import Qwen3OmniForConditionalGeneration

Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__make_16bit_traceable is a private OpenVINO frontend symbol. Depending on private APIs makes this conversion script brittle across OpenVINO versions; consider using a public API (or centralizing this behind a helper with an explicit version check).

Copilot uses AI. Check for mistakes.
Comment on lines +12 to +24
def fuse_cache_reorder(
ov_model: ov.Model,
not_kv_inputs: list[str],
key_value_input_names: list[str],
gather_dim: int,
) -> None:
if model_has_input_output_name(ov_model, BEAM_IDX_NAME):
raise ValueError("Model already has fused cache")
input_batch = ov_model.input(INPUTS_EMBEDS).get_partial_shape()[0]
beam_idx = opset13.parameter(name=BEAM_IDX_NAME, dtype=ov.Type.i32, shape=ov.PartialShape([input_batch]))
beam_idx.output(0).get_tensor().add_names({BEAM_IDX_NAME})
ov_model.add_parameters([beam_idx])
not_kv_inputs.append(ov_model.inputs[-1])
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fuse_cache_reorder() annotates not_kv_inputs as list[str], but patch_stateful() passes a list of OpenVINO input ports (ov_model.inputs). This type mismatch is misleading and makes static checking harder; adjust the annotation (or remove the parameter entirely since it isn’t used for anything meaningful).

Copilot uses AI. Check for mistakes.
@sgonorov sgonorov force-pushed the qwen3-omni-support branch from e2f4104 to d535b89 Compare March 12, 2026 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants