1- # OpenVINO GenAI Modeling — Qwen3-Omni Usage Guide
1+ # OpenVINO GenAI C++ Modeling Usage Guide
22
3- This directory contains the C++ modeling API for running ** Qwen3-Omni-4B ** inference
3+ This directory contains the C++ modeling API for running supported models such as Qwen3-Omni inference
44with OpenVINO, including text generation, vision understanding, audio understanding,
55and text-to-speech (TTS) synthesis.
66
7- ├── models/ # Model implementations
8-
97```
108modeling/
11- ├── models/qwen3_omni/ # Qwen3-Omni model implementations
12- │ ├── modeling_qwen3_omni.hpp # Text model (thinker) builder
13- │ ├── modeling_qwen3_omni_audio.hpp # Audio (talker/TTS) model builder
14- │ ├── processing_qwen3_omni_audio.hpp # Audio preprocessing (WAV → mel spectrogram)
15- │ ├── processing_qwen3_omni_vl.hpp # Vision-Language processing
16- │ ├── processing_qwen3_omni_vision.hpp # Vision preprocessing (image → pixel values)
17- │ └── whisper_mel_spectrogram.hpp # Whisper-style mel spectrogram extractor
9+ ├── models/ # supported models implementations
1810├── layers/ # Reusable ov::Model building blocks (attention, RMSNorm, etc.)
1911├── ops/ # Custom OpenVINO operations
2012├── weights/ # Weight loading and quantization utilities
2113├── samples/ # Sample executables
22- │ ├── modeling_qwen3_omni.cpp # Case 1: image+text → text
23- │ ├── modeling_qwen3_omni_tts_min.cpp # Cases 2–5: multimodal → text + TTS
24- │ ├── extract_video_frames.cpp # Video frame extraction tool
25- │ └── tools/ # Dev-only Python utilities (see tools/README.md)
14+ │ └── tools/ # Dev-only Python utilities (see tools/README.md)
2615```
2716
28- ## Prerequisites
17+ <details >
18+ <summary >Prerequisites</summary >
2919
30- - ** Model weights** : HuggingFace Qwen3-Omni-4B-Instruct checkpoint directory
20+ - ** Model weights** : HuggingFace Model checkpoint directory
3121 containing ` model-*.safetensors ` , ` config.json ` , ` tokenizer.json ` , and
3222 ` preprocessor_config.json ` .
3323- ** OpenVINO** : Source-built OpenVINO (2026.1.0+).
@@ -39,8 +29,8 @@ modeling/
3929### Environment Setup (Windows)
4030
4131``` bat
42- set OV_DIR=C:\work\ws_tmp\ openvino.xzhan34
43- set GENAI_DIR=C:\work\ws_tmp\ openvino.genai.xzhan34
32+ set OV_DIR=<path\to\ openvino>
33+ set GENAI_DIR=<path\to\ openvino.genai>
4434
4535REM OpenVINO runtime DLLs and openvino_genai DLL
4636set PATH=%OV_DIR%\bin\intel64\RelWithDebInfo;%GENAI_DIR%\build-master\openvino_genai;%PATH%
@@ -50,48 +40,72 @@ set PYTHONPATH=%GENAI_DIR%\thirdparty\openvino_tokenizers\python;%OV_DIR%\bin\in
5040set OPENVINO_LIB_PATHS=%OV_DIR%\bin\intel64\RelWithDebInfo
5141```
5242
53- ## Sample Executables
43+ ### Environment Setup (Linux)
44+
45+ ``` bash
46+ export OV_DIR=< path/to/openvino>
47+ export GENAI_DIR=< path/to/openvino.genai>
48+
49+ # OpenVINO runtime libraries and openvino_genai library
50+ export LD_LIBRARY_PATH=$OV_DIR /bin/intel64/RelWithDebInfo:$GENAI_DIR /build-master/openvino_genai:$LD_LIBRARY_PATH
51+
52+ # Source-built OpenVINO Python bindings + openvino_tokenizers Python package
53+ export PYTHONPATH=$GENAI_DIR /thirdparty/openvino_tokenizers/python:$OV_DIR /bin/intel64/RelWithDebInfo/python:$PYTHONPATH
54+ export OPENVINO_LIB_PATHS=$OV_DIR /bin/intel64/RelWithDebInfo
55+ ```
56+
57+ </details >
58+
59+ <details >
60+ <summary >Sample Executables</summary >
5461
5562### Case 1: Image + Text → Text (` modeling_qwen3_omni ` )
5663
5764Loads the Qwen3-Omni text and vision models from safetensors, preprocesses an image,
5865runs vision encoding and autoregressive text decoding.
5966
60- ```
67+ ** Windows:**
68+ ``` bat
6169modeling_qwen3_omni.exe ^
62- --model-dir D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
70+ --model-dir path\to\model ^
6371 --image path\to\image.jpg ^
6472 --prompt "Describe this image in detail." ^
6573 --device CPU ^
6674 --precision fp32 ^
6775 --output-tokens 64
6876```
6977
78+ ** Linux:**
79+ ``` bash
80+ ./modeling_qwen3_omni \
81+ --model-dir path/to/model \
82+ --image path/to/image.jpg \
83+ --prompt " Describe this image in detail." \
84+ --device CPU \
85+ --precision fp32 \
86+ --output-tokens 64
87+ ```
88+
7089** Required arguments:**
7190
7291| Argument | Description |
7392| ---| ---|
7493| ` --model-dir PATH ` | HuggingFace model directory with safetensors and config files |
7594| ` --image PATH ` | Input image file (JPEG, PNG, etc.) |
7695
77- ** Optional arguments: **
96+ </ details >
7897
79- | Argument | Default | Description |
80- | ---| ---| ---|
81- | ` --prompt TEXT ` | ` "What can you see" ` | User text prompt |
82- | ` --device NAME ` | ` CPU ` | OpenVINO device (` CPU ` , ` GPU ` , ` GPU.1 ` , etc.) |
83- | ` --precision MODE ` | ` mixed ` | Inference precision mode (see below) |
84- | ` --output-tokens N ` | ` 64 ` | Maximum number of tokens to generate |
85- | ` --dump-dir PATH ` | * (none)* | Directory to dump intermediate tensors for debugging |
86- | ` --dump-ir-dir PATH ` | * (none)* | Directory to save compiled IR models |
98+ <details >
99+ <summary >Cases 2–5: Multimodal → Text + TTS</summary >
87100
88101### Cases 2–5: Multimodal → Text + TTS (` modeling_qwen3_omni_tts_min ` )
89102
90103Supports image, audio, video inputs with text-to-speech output. Uses positional arguments.
91104
92- ```
105+ ** Windows:**
106+ ``` bat
93107modeling_qwen3_omni_tts_min.exe ^
94- D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
108+ path\to\model ^
95109 <CASE_ID> ^
96110 "<TEXT_PROMPT>" ^
97111 output.wav ^
@@ -103,20 +117,25 @@ modeling_qwen3_omni_tts_min.exe ^
103117 [VIDEO_FRAMES_DIR]
104118```
105119
106- ** Positional arguments (in order):**
120+ ** Linux:**
121+ ``` bash
122+ ./modeling_qwen3_omni_tts_min \
123+ path/to/model \
124+ < CASE_ID> \
125+ " <TEXT_PROMPT>" \
126+ output.wav \
127+ [IMAGE_PATH] \
128+ [AUDIO_PATH] \
129+ [DEVICE] \
130+ [MAX_NEW_TOKENS] \
131+ [PRECISION] \
132+ [VIDEO_FRAMES_DIR]
133+ ```
107134
108- | # | Argument | Required | Description |
109- | ---| ---| ---| ---|
110- | 1 | ` MODEL_DIR ` | Yes | HuggingFace model directory |
111- | 2 | ` CASE_ID ` | Yes | Test case identifier (2, 3, 4, or 5) |
112- | 3 | ` TEXT_PROMPT ` | Yes | User text prompt |
113- | 4 | ` WAV_OUT ` | Yes | Output WAV file path for synthesized speech |
114- | 5 | ` IMAGE_PATH ` | No | Input image (use ` none ` to skip) |
115- | 6 | ` AUDIO_PATH ` | No | Input audio WAV file (use ` none ` to skip) |
116- | 7 | ` DEVICE ` | No | OpenVINO device (default: ` CPU ` ) |
117- | 8 | ` MAX_NEW_TOKENS ` | No | Max generation tokens (default: ` 64 ` ) |
118- | 9 | ` PRECISION ` | No | Precision mode (default: ` fp32 ` ) |
119- | 10 | ` VIDEO_FRAMES_DIR ` | No | Directory of extracted video frames (use ` none ` to skip) |
135+ </details >
136+
137+ <details >
138+ <summary >Test Cases & Examples</summary >
120139
121140## Test Cases
122141
@@ -130,9 +149,10 @@ modeling_qwen3_omni_tts_min.exe ^
130149
131150### Example: Case 2 — Image Description with TTS
132151
152+ ** Windows:**
133153``` bat
134154modeling_qwen3_omni_tts_min.exe ^
135- D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
155+ path\to\model ^
136156 2 ^
137157 "Describe this image and provide a speech response." ^
138158 case2_output.wav ^
@@ -143,11 +163,26 @@ modeling_qwen3_omni_tts_min.exe ^
143163 fp32
144164```
145165
166+ ** Linux:**
167+ ``` bash
168+ ./modeling_qwen3_omni_tts_min \
169+ path/to/model \
170+ 2 \
171+ " Describe this image and provide a speech response." \
172+ case2_output.wav \
173+ path/to/image.jpg \
174+ none \
175+ CPU \
176+ 32 \
177+ fp32
178+ ```
179+
146180### Example: Case 3 — Audio Understanding with TTS
147181
182+ ** Windows:**
148183``` bat
149184modeling_qwen3_omni_tts_min.exe ^
150- D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
185+ path\to\model ^
151186 3 ^
152187 "What sound do you hear in the audio? Answer in one short sentence." ^
153188 case3_output.wav ^
@@ -158,10 +193,25 @@ modeling_qwen3_omni_tts_min.exe ^
158193 fp32
159194```
160195
196+ ** Linux:**
197+ ``` bash
198+ ./modeling_qwen3_omni_tts_min \
199+ path/to/model \
200+ 3 \
201+ " What sound do you hear in the audio? Answer in one short sentence." \
202+ case3_output.wav \
203+ none \
204+ path/to/audio.wav \
205+ CPU \
206+ 32 \
207+ fp32
208+ ```
209+
161210### Example: Case 5 — Full Multimodal (Image + Video + Audio + Text)
162211
163- Requires pre-extracted video frames (use ` extract_video_frames.exe ` ):
212+ Requires pre-extracted video frames (use ` extract_video_frames ` ):
164213
214+ ** Windows:**
165215``` bat
166216REM Step 1: Extract video frames
167217extract_video_frames.exe ^
@@ -171,7 +221,7 @@ extract_video_frames.exe ^
171221
172222REM Step 2: Run Case 5
173223modeling_qwen3_omni_tts_min.exe ^
174- D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
224+ path\to\model ^
175225 5 ^
176226 "Describe the scene in the image, video, and audio." ^
177227 case5_output.wav ^
@@ -183,6 +233,33 @@ modeling_qwen3_omni_tts_min.exe ^
183233 frames_dir
184234```
185235
236+ ** Linux:**
237+ ``` bash
238+ # Step 1: Extract video frames
239+ ./extract_video_frames \
240+ --video path/to/video.mp4 \
241+ --output-dir frames_dir \
242+ --max-frames 4
243+
244+ # Step 2: Run Case 5
245+ ./modeling_qwen3_omni_tts_min \
246+ path/to/model \
247+ 5 \
248+ " Describe the scene in the image, video, and audio." \
249+ case5_output.wav \
250+ path/to/image.jpg \
251+ path/to/audio.wav \
252+ CPU \
253+ 32 \
254+ fp32 \
255+ frames_dir
256+ ```
257+
258+ </details >
259+
260+ <details >
261+ <summary >Precision Modes</summary >
262+
186263## Precision Modes
187264
188265Control inference precision and KV-cache compression via the ` --precision ` argument:
@@ -199,14 +276,19 @@ Control inference precision and KV-cache compression via the `--precision` argum
199276
200277Aliases: ` fp32_kv8 ` → ` inf_fp32_kv_int8 ` , ` fp16_kv8 ` → ` inf_fp16_kv_int8 ` , etc.
201278
279+ </details >
280+
281+ <details >
282+ <summary >Automated Case Comparison</summary >
283+
202284## Automated Case Comparison (` tools/qwen3_omni_case_compare.py ` )
203285
204286Runs all cases across multiple devices and precision modes, generating a JSON report
205287with performance metrics and text outputs for comparison.
206288
207289``` bat
208290python tools/qwen3_omni_case_compare.py ^
209- --model-dir D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
291+ --model-dir path\to\model ^
210292 --image path\to\image.jpg ^
211293 --test-audio path\to\audio.wav ^
212294 --video path\to\video.mp4 ^
@@ -224,27 +306,10 @@ python tools/qwen3_omni_case_compare.py ^
224306 --cpp-only
225307```
226308
227- ** Key arguments: **
309+ </ details >
228310
229- | Argument | Description |
230- | ---| ---|
231- | ` --model-dir ` | HuggingFace model directory |
232- | ` --image ` | Default image for Cases 1–4 |
233- | ` --test-audio ` | Default audio for Cases 3–4 |
234- | ` --video ` | Video file for Case 5 (frames extracted automatically) |
235- | ` --case5-image ` | Case 5 specific image (falls back to ` --image ` ) |
236- | ` --case5-audio ` | Case 5 specific audio (falls back to ` --test-audio ` ) |
237- | ` --case5-prompt-file ` | Text file containing the Case 5 prompt |
238- | ` --cpp-bin ` | Path to ` modeling_qwen3_omni ` executable (Case 1) |
239- | ` --cpp-tts-bin ` | Path to ` modeling_qwen3_omni_tts_min ` executable (Cases 2–5) |
240- | ` --out-json ` | Output JSON report path |
241- | ` --devices ` | Comma-separated devices: ` CPU ` , ` GPU ` , ` GPU.1 ` |
242- | ` --precisions ` | Comma-separated precision modes |
243- | ` --cases ` | Run specific cases only (e.g., ` --cases 1,5 ` ) |
244- | ` --max-new-tokens ` | Token generation limit per case |
245- | ` --max-video-frames ` | Max video frames to extract for Case 5 |
246- | ` --timeout ` | Per-case timeout in seconds (default: 600) |
247- | ` --cpp-only ` | Skip Python reference inference, run C++ cases only |
311+ <details >
312+ <summary >C++ Modeling API Overview</summary >
248313
249314## C++ Modeling API Overview
250315
@@ -290,3 +355,5 @@ auto text_request = compiled_text.create_infer_request();
290355// ... feed input_ids, attention_mask, visual_embeds, position_ids ...
291356// ... decode loop with argmax sampling ...
292357```
358+
359+ </details>
0 commit comments