Skip to content

Commit c9fad36

Browse files
committed
README: add collapsible sections, Linux examples, and generalize paths
Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com>
1 parent 63de1f1 commit c9fad36

File tree

1 file changed

+138
-71
lines changed

1 file changed

+138
-71
lines changed

src/cpp/src/modeling/README.md

Lines changed: 138 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,23 @@
1-
# OpenVINO GenAI Modeling — Qwen3-Omni Usage Guide
1+
# OpenVINO GenAI C++ Modeling Usage Guide
22

3-
This directory contains the C++ modeling API for running **Qwen3-Omni-4B** inference
3+
This directory contains the C++ modeling API for running supported models such as Qwen3-Omni inference
44
with OpenVINO, including text generation, vision understanding, audio understanding,
55
and text-to-speech (TTS) synthesis.
66

7-
├── models/ # Model implementations
8-
97
```
108
modeling/
11-
├── models/qwen3_omni/ # Qwen3-Omni model implementations
12-
│ ├── modeling_qwen3_omni.hpp # Text model (thinker) builder
13-
│ ├── modeling_qwen3_omni_audio.hpp # Audio (talker/TTS) model builder
14-
│ ├── processing_qwen3_omni_audio.hpp # Audio preprocessing (WAV → mel spectrogram)
15-
│ ├── processing_qwen3_omni_vl.hpp # Vision-Language processing
16-
│ ├── processing_qwen3_omni_vision.hpp # Vision preprocessing (image → pixel values)
17-
│ └── whisper_mel_spectrogram.hpp # Whisper-style mel spectrogram extractor
9+
├── models/ # supported models implementations
1810
├── layers/ # Reusable ov::Model building blocks (attention, RMSNorm, etc.)
1911
├── ops/ # Custom OpenVINO operations
2012
├── weights/ # Weight loading and quantization utilities
2113
├── samples/ # Sample executables
22-
│ ├── modeling_qwen3_omni.cpp # Case 1: image+text → text
23-
│ ├── modeling_qwen3_omni_tts_min.cpp # Cases 2–5: multimodal → text + TTS
24-
│ ├── extract_video_frames.cpp # Video frame extraction tool
25-
│ └── tools/ # Dev-only Python utilities (see tools/README.md)
14+
│ └── tools/ # Dev-only Python utilities (see tools/README.md)
2615
```
2716

28-
## Prerequisites
17+
<details>
18+
<summary>Prerequisites</summary>
2919

30-
- **Model weights**: HuggingFace Qwen3-Omni-4B-Instruct checkpoint directory
20+
- **Model weights**: HuggingFace Model checkpoint directory
3121
containing `model-*.safetensors`, `config.json`, `tokenizer.json`, and
3222
`preprocessor_config.json`.
3323
- **OpenVINO**: Source-built OpenVINO (2026.1.0+).
@@ -39,8 +29,8 @@ modeling/
3929
### Environment Setup (Windows)
4030

4131
```bat
42-
set OV_DIR=C:\work\ws_tmp\openvino.xzhan34
43-
set GENAI_DIR=C:\work\ws_tmp\openvino.genai.xzhan34
32+
set OV_DIR=<path\to\openvino>
33+
set GENAI_DIR=<path\to\openvino.genai>
4434
4535
REM OpenVINO runtime DLLs and openvino_genai DLL
4636
set PATH=%OV_DIR%\bin\intel64\RelWithDebInfo;%GENAI_DIR%\build-master\openvino_genai;%PATH%
@@ -50,48 +40,72 @@ set PYTHONPATH=%GENAI_DIR%\thirdparty\openvino_tokenizers\python;%OV_DIR%\bin\in
5040
set OPENVINO_LIB_PATHS=%OV_DIR%\bin\intel64\RelWithDebInfo
5141
```
5242

53-
## Sample Executables
43+
### Environment Setup (Linux)
44+
45+
```bash
46+
export OV_DIR=<path/to/openvino>
47+
export GENAI_DIR=<path/to/openvino.genai>
48+
49+
# OpenVINO runtime libraries and openvino_genai library
50+
export LD_LIBRARY_PATH=$OV_DIR/bin/intel64/RelWithDebInfo:$GENAI_DIR/build-master/openvino_genai:$LD_LIBRARY_PATH
51+
52+
# Source-built OpenVINO Python bindings + openvino_tokenizers Python package
53+
export PYTHONPATH=$GENAI_DIR/thirdparty/openvino_tokenizers/python:$OV_DIR/bin/intel64/RelWithDebInfo/python:$PYTHONPATH
54+
export OPENVINO_LIB_PATHS=$OV_DIR/bin/intel64/RelWithDebInfo
55+
```
56+
57+
</details>
58+
59+
<details>
60+
<summary>Sample Executables</summary>
5461

5562
### Case 1: Image + Text → Text (`modeling_qwen3_omni`)
5663

5764
Loads the Qwen3-Omni text and vision models from safetensors, preprocesses an image,
5865
runs vision encoding and autoregressive text decoding.
5966

60-
```
67+
**Windows:**
68+
```bat
6169
modeling_qwen3_omni.exe ^
62-
--model-dir D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
70+
--model-dir path\to\model ^
6371
--image path\to\image.jpg ^
6472
--prompt "Describe this image in detail." ^
6573
--device CPU ^
6674
--precision fp32 ^
6775
--output-tokens 64
6876
```
6977

78+
**Linux:**
79+
```bash
80+
./modeling_qwen3_omni \
81+
--model-dir path/to/model \
82+
--image path/to/image.jpg \
83+
--prompt "Describe this image in detail." \
84+
--device CPU \
85+
--precision fp32 \
86+
--output-tokens 64
87+
```
88+
7089
**Required arguments:**
7190

7291
| Argument | Description |
7392
|---|---|
7493
| `--model-dir PATH` | HuggingFace model directory with safetensors and config files |
7594
| `--image PATH` | Input image file (JPEG, PNG, etc.) |
7695

77-
**Optional arguments:**
96+
</details>
7897

79-
| Argument | Default | Description |
80-
|---|---|---|
81-
| `--prompt TEXT` | `"What can you see"` | User text prompt |
82-
| `--device NAME` | `CPU` | OpenVINO device (`CPU`, `GPU`, `GPU.1`, etc.) |
83-
| `--precision MODE` | `mixed` | Inference precision mode (see below) |
84-
| `--output-tokens N` | `64` | Maximum number of tokens to generate |
85-
| `--dump-dir PATH` | *(none)* | Directory to dump intermediate tensors for debugging |
86-
| `--dump-ir-dir PATH` | *(none)* | Directory to save compiled IR models |
98+
<details>
99+
<summary>Cases 2–5: Multimodal → Text + TTS</summary>
87100

88101
### Cases 2–5: Multimodal → Text + TTS (`modeling_qwen3_omni_tts_min`)
89102

90103
Supports image, audio, video inputs with text-to-speech output. Uses positional arguments.
91104

92-
```
105+
**Windows:**
106+
```bat
93107
modeling_qwen3_omni_tts_min.exe ^
94-
D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
108+
path\to\model ^
95109
<CASE_ID> ^
96110
"<TEXT_PROMPT>" ^
97111
output.wav ^
@@ -103,20 +117,25 @@ modeling_qwen3_omni_tts_min.exe ^
103117
[VIDEO_FRAMES_DIR]
104118
```
105119

106-
**Positional arguments (in order):**
120+
**Linux:**
121+
```bash
122+
./modeling_qwen3_omni_tts_min \
123+
path/to/model \
124+
<CASE_ID> \
125+
"<TEXT_PROMPT>" \
126+
output.wav \
127+
[IMAGE_PATH] \
128+
[AUDIO_PATH] \
129+
[DEVICE] \
130+
[MAX_NEW_TOKENS] \
131+
[PRECISION] \
132+
[VIDEO_FRAMES_DIR]
133+
```
107134

108-
| # | Argument | Required | Description |
109-
|---|---|---|---|
110-
| 1 | `MODEL_DIR` | Yes | HuggingFace model directory |
111-
| 2 | `CASE_ID` | Yes | Test case identifier (2, 3, 4, or 5) |
112-
| 3 | `TEXT_PROMPT` | Yes | User text prompt |
113-
| 4 | `WAV_OUT` | Yes | Output WAV file path for synthesized speech |
114-
| 5 | `IMAGE_PATH` | No | Input image (use `none` to skip) |
115-
| 6 | `AUDIO_PATH` | No | Input audio WAV file (use `none` to skip) |
116-
| 7 | `DEVICE` | No | OpenVINO device (default: `CPU`) |
117-
| 8 | `MAX_NEW_TOKENS` | No | Max generation tokens (default: `64`) |
118-
| 9 | `PRECISION` | No | Precision mode (default: `fp32`) |
119-
| 10 | `VIDEO_FRAMES_DIR` | No | Directory of extracted video frames (use `none` to skip) |
135+
</details>
136+
137+
<details>
138+
<summary>Test Cases & Examples</summary>
120139

121140
## Test Cases
122141

@@ -130,9 +149,10 @@ modeling_qwen3_omni_tts_min.exe ^
130149

131150
### Example: Case 2 — Image Description with TTS
132151

152+
**Windows:**
133153
```bat
134154
modeling_qwen3_omni_tts_min.exe ^
135-
D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
155+
path\to\model ^
136156
2 ^
137157
"Describe this image and provide a speech response." ^
138158
case2_output.wav ^
@@ -143,11 +163,26 @@ modeling_qwen3_omni_tts_min.exe ^
143163
fp32
144164
```
145165

166+
**Linux:**
167+
```bash
168+
./modeling_qwen3_omni_tts_min \
169+
path/to/model \
170+
2 \
171+
"Describe this image and provide a speech response." \
172+
case2_output.wav \
173+
path/to/image.jpg \
174+
none \
175+
CPU \
176+
32 \
177+
fp32
178+
```
179+
146180
### Example: Case 3 — Audio Understanding with TTS
147181

182+
**Windows:**
148183
```bat
149184
modeling_qwen3_omni_tts_min.exe ^
150-
D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
185+
path\to\model ^
151186
3 ^
152187
"What sound do you hear in the audio? Answer in one short sentence." ^
153188
case3_output.wav ^
@@ -158,10 +193,25 @@ modeling_qwen3_omni_tts_min.exe ^
158193
fp32
159194
```
160195

196+
**Linux:**
197+
```bash
198+
./modeling_qwen3_omni_tts_min \
199+
path/to/model \
200+
3 \
201+
"What sound do you hear in the audio? Answer in one short sentence." \
202+
case3_output.wav \
203+
none \
204+
path/to/audio.wav \
205+
CPU \
206+
32 \
207+
fp32
208+
```
209+
161210
### Example: Case 5 — Full Multimodal (Image + Video + Audio + Text)
162211

163-
Requires pre-extracted video frames (use `extract_video_frames.exe`):
212+
Requires pre-extracted video frames (use `extract_video_frames`):
164213

214+
**Windows:**
165215
```bat
166216
REM Step 1: Extract video frames
167217
extract_video_frames.exe ^
@@ -171,7 +221,7 @@ extract_video_frames.exe ^
171221
172222
REM Step 2: Run Case 5
173223
modeling_qwen3_omni_tts_min.exe ^
174-
D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
224+
path\to\model ^
175225
5 ^
176226
"Describe the scene in the image, video, and audio." ^
177227
case5_output.wav ^
@@ -183,6 +233,33 @@ modeling_qwen3_omni_tts_min.exe ^
183233
frames_dir
184234
```
185235

236+
**Linux:**
237+
```bash
238+
# Step 1: Extract video frames
239+
./extract_video_frames \
240+
--video path/to/video.mp4 \
241+
--output-dir frames_dir \
242+
--max-frames 4
243+
244+
# Step 2: Run Case 5
245+
./modeling_qwen3_omni_tts_min \
246+
path/to/model \
247+
5 \
248+
"Describe the scene in the image, video, and audio." \
249+
case5_output.wav \
250+
path/to/image.jpg \
251+
path/to/audio.wav \
252+
CPU \
253+
32 \
254+
fp32 \
255+
frames_dir
256+
```
257+
258+
</details>
259+
260+
<details>
261+
<summary>Precision Modes</summary>
262+
186263
## Precision Modes
187264

188265
Control inference precision and KV-cache compression via the `--precision` argument:
@@ -199,14 +276,19 @@ Control inference precision and KV-cache compression via the `--precision` argum
199276

200277
Aliases: `fp32_kv8``inf_fp32_kv_int8`, `fp16_kv8``inf_fp16_kv_int8`, etc.
201278

279+
</details>
280+
281+
<details>
282+
<summary>Automated Case Comparison</summary>
283+
202284
## Automated Case Comparison (`tools/qwen3_omni_case_compare.py`)
203285

204286
Runs all cases across multiple devices and precision modes, generating a JSON report
205287
with performance metrics and text outputs for comparison.
206288

207289
```bat
208290
python tools/qwen3_omni_case_compare.py ^
209-
--model-dir D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
291+
--model-dir path\to\model ^
210292
--image path\to\image.jpg ^
211293
--test-audio path\to\audio.wav ^
212294
--video path\to\video.mp4 ^
@@ -224,27 +306,10 @@ python tools/qwen3_omni_case_compare.py ^
224306
--cpp-only
225307
```
226308

227-
**Key arguments:**
309+
</details>
228310

229-
| Argument | Description |
230-
|---|---|
231-
| `--model-dir` | HuggingFace model directory |
232-
| `--image` | Default image for Cases 1–4 |
233-
| `--test-audio` | Default audio for Cases 3–4 |
234-
| `--video` | Video file for Case 5 (frames extracted automatically) |
235-
| `--case5-image` | Case 5 specific image (falls back to `--image`) |
236-
| `--case5-audio` | Case 5 specific audio (falls back to `--test-audio`) |
237-
| `--case5-prompt-file` | Text file containing the Case 5 prompt |
238-
| `--cpp-bin` | Path to `modeling_qwen3_omni` executable (Case 1) |
239-
| `--cpp-tts-bin` | Path to `modeling_qwen3_omni_tts_min` executable (Cases 2–5) |
240-
| `--out-json` | Output JSON report path |
241-
| `--devices` | Comma-separated devices: `CPU`, `GPU`, `GPU.1` |
242-
| `--precisions` | Comma-separated precision modes |
243-
| `--cases` | Run specific cases only (e.g., `--cases 1,5`) |
244-
| `--max-new-tokens` | Token generation limit per case |
245-
| `--max-video-frames` | Max video frames to extract for Case 5 |
246-
| `--timeout` | Per-case timeout in seconds (default: 600) |
247-
| `--cpp-only` | Skip Python reference inference, run C++ cases only |
311+
<details>
312+
<summary>C++ Modeling API Overview</summary>
248313

249314
## C++ Modeling API Overview
250315

@@ -290,3 +355,5 @@ auto text_request = compiled_text.create_infer_request();
290355
// ... feed input_ids, attention_mask, visual_embeds, position_ids ...
291356
// ... decode loop with argmax sampling ...
292357
```
358+
359+
</details>

0 commit comments

Comments
 (0)