QWEN-3-Omni convertion and chat demo#3443
QWEN-3-Omni convertion and chat demo#3443sgonorov wants to merge 7 commits intoopenvinotoolkit:masterfrom
Conversation
d1ef5db to
3e7c752
Compare
16cedfd to
e2f4104
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new Qwen3-Omni(-MoE) OpenVINO IR conversion toolchain and a separate PyTorch-based CLI chat/demo utility under tools/qwen3/.
Changes:
- Introduces a multi-part converter for Qwen3-Omni-MOE (thinker/talker/audio/vision/code2wav) with optional weight compression.
- Adds OpenVINO stateful/KV-cache patching utilities and cache adapter classes for export/tracing.
- Adds a CLI chat app + demo flow for Qwen3-Omni with basic media command parsing and optional audio output.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 26 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/qwen3/qwen3_omni_moe/utils.py | TorchScript cache cleanup helper used after conversions. |
| tools/qwen3/qwen3_omni_moe/traceable_cache.py | Cache adapter intended to be trace-friendly for export. |
| tools/qwen3/qwen3_omni_moe/stateful_utils.py | OpenVINO graph patching to fuse cache reorder + make KV-cache stateful. |
| tools/qwen3/qwen3_omni_moe/ov_model_utils.py | IR naming + (optional) weight compression and saving. |
| tools/qwen3/qwen3_omni_moe/flat_cache.py | Flat KV-cache adapter used by language wrappers. |
| tools/qwen3/qwen3_omni_moe/export_utils.py | Torch Export → OpenVINO conversion helper. |
| tools/qwen3/qwen3_omni_moe/convert_vision_encoder.py | Converts vision patcher and vision merger submodels. |
| tools/qwen3/qwen3_omni_moe/convert_thinker_language.py | Exports thinker language model (logits/hidden + KV-cache). |
| tools/qwen3/qwen3_omni_moe/convert_thinker_embedding.py | Converts thinker token embedding module. |
| tools/qwen3/qwen3_omni_moe/convert_talker.py | Converts talker embedding + talker language model. |
| tools/qwen3/qwen3_omni_moe/convert_code_predictor.py | Converts talker code predictor submodel. |
| tools/qwen3/qwen3_omni_moe/convert_code2wav.py | Converts the code-to-waveform vocoder. |
| tools/qwen3/qwen3_omni_moe/convert_audio_encoder.py | Converts the audio encoder with a forward wrapper. |
| tools/qwen3/qwen3_omni_moe/convert.py | Main CLI entrypoint orchestrating conversion and config saving. |
| tools/qwen3/qwen3_omni_moe/constants.py | Central names/constants for inputs/outputs and output filenames. |
| tools/qwen3/qwen3_omni_moe/init.py | Exposes converter entrypoint. |
| tools/qwen3/qwen3_omni_moe/README.md | Converter usage docs and output layout. |
| tools/qwen3/qwen3_chat/model.py | Loads the Qwen3-Omni PyTorch model + processor. |
| tools/qwen3/qwen3_chat/io_utils.py | Parses media commands and saves generated audio to WAV. |
| tools/qwen3/qwen3_chat/generate.py | Chat history setup + generation helper. |
| tools/qwen3/qwen3_chat/demo.py | Text-only smoke-test demo mode. |
| tools/qwen3/qwen3_chat/chat.py | Interactive CLI chat loop with commands and optional audio saving. |
| tools/qwen3/qwen3_chat/main.py | Module entrypoint (python -m tools.qwen3.qwen3_chat). |
| tools/qwen3/qwen3_chat/init.py | Package marker. |
| tools/qwen3/qwen3_chat/README.md | CLI chat usage docs and examples. |
| gen_kwargs: dict[str, Any] = { | ||
| "streamer": transformers.TextStreamer( | ||
| processor, | ||
| skip_prompt=True, | ||
| skip_special_tokens=False, | ||
| clean_up_tokenization_spaces=False, | ||
| ), | ||
| "thinker_do_sample": False, | ||
| } |
There was a problem hiding this comment.
TextStreamer is configured with skip_special_tokens=False, but the final batch_decode() uses skip_special_tokens=True. This can cause the streamed text shown to the user to include special tokens that won’t appear in the final decoded string. Consider setting skip_special_tokens=True for the streamer (or otherwise keeping streaming and final decoding consistent).
| with torch.no_grad(): | ||
| text, _ = generate_response( | ||
| model, | ||
| processor, | ||
| history, | ||
| enable_audio=False, | ||
| speaker=None, | ||
| ) | ||
|
|
||
| print(f"You: {prompt}") | ||
| print(f"Qwen: {text}\n") |
There was a problem hiding this comment.
generate_response() always streams tokens to stdout via TextStreamer, and then run_demo() prints the fully decoded text again (print(f"Qwen: {text}")). This leads to duplicated output in demo mode. Consider adding a flag to disable streaming for demo, or avoid printing the decoded text when streaming is enabled.
| cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn) | ||
| remainder = cnn_len % window_aftercnn |
There was a problem hiding this comment.
In _forward_wrap_audio_encoder, cnn_len is a 0-d tensor (iterating over aftercnn_lens), so expressions like [window_aftercnn] * (cnn_len // window_aftercnn) will raise TypeError: can't multiply sequence by non-int of type 'Tensor'. Convert cnn_len to a Python int (e.g., cnn_len_int = int(cnn_len.item())) before using it for list repetition / modulo.
| cu_chunk_lens += [window_aftercnn] * (cnn_len // window_aftercnn) | |
| remainder = cnn_len % window_aftercnn | |
| cnn_len_int = int(cnn_len.item()) | |
| cu_chunk_lens += [window_aftercnn] * (cnn_len_int // window_aftercnn) | |
| remainder = cnn_len_int % window_aftercnn |
| original_diff = torch.diff | ||
| torch.diff = _diff_via_slice | ||
| try: | ||
| ov_model = ov.convert_model( | ||
| code2wav, | ||
| example_input={ | ||
| "codes": torch.ones([1, num_quantizers, 4], dtype=torch.long), | ||
| }, | ||
| ) | ||
| finally: | ||
| torch.diff = original_diff |
There was a problem hiding this comment.
This monkey-patches torch.diff globally during conversion. Even though it is restored in finally, it can still have surprising side effects if other code runs concurrently in the same process. Prefer calling a local replacement function directly (or wrapping only the model code that needs it) instead of reassigning torch.diff at module scope.
| key_value_output_names: list[str], | ||
| batch_dim: int, | ||
| ) -> None: | ||
| from openvino._offline_transformations import apply_make_stateful_transformation |
There was a problem hiding this comment.
apply_make_stateful_transformation is imported from openvino._offline_transformations, which is a private module. This can break across OpenVINO versions; consider using a public transformation API if available, or at least guarding the import and providing a clear error message about the required OpenVINO version.
| from openvino._offline_transformations import apply_make_stateful_transformation | |
| try: | |
| from openvino._offline_transformations import apply_make_stateful_transformation | |
| except ImportError as exc: | |
| raise RuntimeError( | |
| "Stateful model transformation requires 'apply_make_stateful_transformation' from " | |
| "openvino._offline_transformations, which is not available in the installed " | |
| f"OpenVINO distribution (version: {ov.__version__}). " | |
| "Install an OpenVINO version that provides offline transformations, or disable " | |
| "stateful patching for this model." | |
| ) from exc |
| from pathlib import Path | ||
| from typing import Any | ||
| import argparse |
There was a problem hiding this comment.
These newly added Python files are missing the standard repository license header (Copyright ... + SPDX-License-Identifier: Apache-2.0) that appears in other tools/ modules (e.g., tools/cacheviz/cacheviz.py). Please add the header to this file (and the other new files in this PR) to keep licensing consistent.
| model = Qwen3OmniMoeForConditionalGeneration.from_pretrained( | ||
| ckpt, | ||
| config=config, | ||
| torch_dtype=torch.float16, | ||
| low_cpu_mem_usage=True, | ||
| ) |
There was a problem hiding this comment.
_load_model() always loads the source model with torch_dtype=torch.float16 regardless of --weight_format. As a result, selecting fp32 cannot produce a true FP32 pipeline (and may also affect export accuracy). Consider plumbing weight_format into _load_model() (e.g., use torch.float32 when weight_format == "fp32").
| if suffix_input_names: | ||
| input_names.extend(suffix_input_names) | ||
| for inp, name in zip(ov_model.inputs, input_names): | ||
| inp.get_tensor().set_names({name}) | ||
| for out, name in zip(ov_model.outputs, output_names): | ||
| out.get_tensor().set_names({name}) |
There was a problem hiding this comment.
set_ov_model_names() uses zip(ov_model.inputs, input_names) / zip(ov_model.outputs, output_names), which silently ignores extra ports or extra names if the counts diverge (easy to hit when the export graph changes). Add an explicit length check/assert before renaming so mismatches fail fast and are easier to debug.
| import openvino as ov | ||
| from openvino.frontend.pytorch.patch_model import __make_16bit_traceable | ||
| from transformers import Qwen3OmniForConditionalGeneration | ||
|
|
There was a problem hiding this comment.
__make_16bit_traceable is a private OpenVINO frontend symbol. Depending on private APIs makes this conversion script brittle across OpenVINO versions; consider using a public API (or centralizing this behind a helper with an explicit version check).
| def fuse_cache_reorder( | ||
| ov_model: ov.Model, | ||
| not_kv_inputs: list[str], | ||
| key_value_input_names: list[str], | ||
| gather_dim: int, | ||
| ) -> None: | ||
| if model_has_input_output_name(ov_model, BEAM_IDX_NAME): | ||
| raise ValueError("Model already has fused cache") | ||
| input_batch = ov_model.input(INPUTS_EMBEDS).get_partial_shape()[0] | ||
| beam_idx = opset13.parameter(name=BEAM_IDX_NAME, dtype=ov.Type.i32, shape=ov.PartialShape([input_batch])) | ||
| beam_idx.output(0).get_tensor().add_names({BEAM_IDX_NAME}) | ||
| ov_model.add_parameters([beam_idx]) | ||
| not_kv_inputs.append(ov_model.inputs[-1]) |
There was a problem hiding this comment.
fuse_cache_reorder() annotates not_kv_inputs as list[str], but patch_stateful() passes a list of OpenVINO input ports (ov_model.inputs). This type mismatch is misleading and makes static checking harder; adjust the annotation (or remove the parameter entirely since it isn’t used for anything meaningful).
e2f4104 to
d535b89
Compare
QWEN-3-Omni convertion and chat demo