README: add collapsible sections, Linux examples, and generalize paths

xzhan34 · xzhan34 · commit c9fad367f1eb · 2026-03-16T15:12:05.000+08:00
Signed-off-by: Zhang, Xiaolin &lt;xiaolin.zhang@intel.com&gt;
diff --git a/src/cpp/src/modeling/README.md b/src/cpp/src/modeling/README.md
@@ -1,33 +1,23 @@
-# OpenVINO GenAI Modeling — Qwen3-Omni Usage Guide
+# OpenVINO GenAI C++ Modeling Usage Guide
 
-This directory contains the C++ modeling API for running **Qwen3-Omni-4B** inference
+This directory contains the C++ modeling API for running supported models such as Qwen3-Omni inference
 with OpenVINO, including text generation, vision understanding, audio understanding,
 and text-to-speech (TTS) synthesis.
 
-├── models/    # Model implementations
-
 ```
 modeling/
-├── models/qwen3_omni/    # Qwen3-Omni model implementations
-│   ├── modeling_qwen3_omni.hpp          # Text model (thinker) builder
-│   ├── modeling_qwen3_omni_audio.hpp    # Audio (talker/TTS) model builder
-│   ├── processing_qwen3_omni_audio.hpp  # Audio preprocessing (WAV → mel spectrogram)
-│   ├── processing_qwen3_omni_vl.hpp     # Vision-Language processing
-│   ├── processing_qwen3_omni_vision.hpp # Vision preprocessing (image → pixel values)
-│   └── whisper_mel_spectrogram.hpp      # Whisper-style mel spectrogram extractor
+├── models/    # supported models implementations
 ├── layers/               # Reusable ov::Model building blocks (attention, RMSNorm, etc.)
 ├── ops/                  # Custom OpenVINO operations
 ├── weights/              # Weight loading and quantization utilities
 ├── samples/              # Sample executables
-│   ├── modeling_qwen3_omni.cpp          # Case 1: image+text → text
-│   ├── modeling_qwen3_omni_tts_min.cpp  # Cases 2–5: multimodal → text + TTS
-│   ├── extract_video_frames.cpp         # Video frame extraction tool
-│   └── tools/                           # Dev-only Python utilities (see tools/README.md)
+│   └── tools/              # Dev-only Python utilities (see tools/README.md)
 ```
 
-## Prerequisites
+<details>
+<summary>Prerequisites</summary>
 
-- **Model weights**: HuggingFace Qwen3-Omni-4B-Instruct checkpoint directory
+- **Model weights**: HuggingFace Model checkpoint directory
   containing `model-*.safetensors`, `config.json`, `tokenizer.json`, and
   `preprocessor_config.json`.
 - **OpenVINO**: Source-built OpenVINO (2026.1.0+).
@@ -39,8 +29,8 @@ modeling/
 ### Environment Setup (Windows)
 
 ```bat
-set OV_DIR=C:\work\ws_tmp\openvino.xzhan34
-set GENAI_DIR=C:\work\ws_tmp\openvino.genai.xzhan34
+set OV_DIR=<path\to\openvino>
+set GENAI_DIR=<path\to\openvino.genai>
 
 REM OpenVINO runtime DLLs and openvino_genai DLL
 set PATH=%OV_DIR%\bin\intel64\RelWithDebInfo;%GENAI_DIR%\build-master\openvino_genai;%PATH%
@@ -50,48 +40,72 @@ set PYTHONPATH=%GENAI_DIR%\thirdparty\openvino_tokenizers\python;%OV_DIR%\bin\in
 set OPENVINO_LIB_PATHS=%OV_DIR%\bin\intel64\RelWithDebInfo
 ```
 
-## Sample Executables
+### Environment Setup (Linux)
+
+```bash
+export OV_DIR=<path/to/openvino>
+export GENAI_DIR=<path/to/openvino.genai>
+
+# OpenVINO runtime libraries and openvino_genai library
+export LD_LIBRARY_PATH=$OV_DIR/bin/intel64/RelWithDebInfo:$GENAI_DIR/build-master/openvino_genai:$LD_LIBRARY_PATH
+
+# Source-built OpenVINO Python bindings + openvino_tokenizers Python package
+export PYTHONPATH=$GENAI_DIR/thirdparty/openvino_tokenizers/python:$OV_DIR/bin/intel64/RelWithDebInfo/python:$PYTHONPATH
+export OPENVINO_LIB_PATHS=$OV_DIR/bin/intel64/RelWithDebInfo
+```
+
+</details>
+
+<details>
+<summary>Sample Executables</summary>
 
 ### Case 1: Image + Text → Text (`modeling_qwen3_omni`)
 
 Loads the Qwen3-Omni text and vision models from safetensors, preprocesses an image,
 runs vision encoding and autoregressive text decoding.
 
-```
+**Windows:**
+```bat
 modeling_qwen3_omni.exe ^
-    --model-dir  D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
+    --model-dir  path\to\model ^
     --image      path\to\image.jpg ^
     --prompt     "Describe this image in detail." ^
     --device     CPU ^
     --precision  fp32 ^
     --output-tokens 64
 ```
 
+**Linux:**
+```bash
+./modeling_qwen3_omni \
+    --model-dir  path/to/model \
+    --image      path/to/image.jpg \
+    --prompt     "Describe this image in detail." \
+    --device     CPU \
+    --precision  fp32 \
+    --output-tokens 64
+```
+
 **Required arguments:**
 
 | Argument | Description |
 |---|---|
 | `--model-dir PATH` | HuggingFace model directory with safetensors and config files |
 | `--image PATH` | Input image file (JPEG, PNG, etc.) |
 
-**Optional arguments:**
+</details>
 
-| Argument | Default | Description |
-|---|---|---|
-| `--prompt TEXT` | `"What can you see"` | User text prompt |
-| `--device NAME` | `CPU` | OpenVINO device (`CPU`, `GPU`, `GPU.1`, etc.) |
-| `--precision MODE` | `mixed` | Inference precision mode (see below) |
-| `--output-tokens N` | `64` | Maximum number of tokens to generate |
-| `--dump-dir PATH` | *(none)* | Directory to dump intermediate tensors for debugging |
-| `--dump-ir-dir PATH` | *(none)* | Directory to save compiled IR models |
+<details>
+<summary>Cases 2–5: Multimodal → Text + TTS</summary>
 
 ### Cases 2–5: Multimodal → Text + TTS (`modeling_qwen3_omni_tts_min`)
 
 Supports image, audio, video inputs with text-to-speech output. Uses positional arguments.
 
-```
+**Windows:**
+```bat
 modeling_qwen3_omni_tts_min.exe ^
-    D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
+    path\to\model ^
     <CASE_ID> ^
     "<TEXT_PROMPT>" ^
     output.wav ^
@@ -103,20 +117,25 @@ modeling_qwen3_omni_tts_min.exe ^
     [VIDEO_FRAMES_DIR]
 ```
 
-**Positional arguments (in order):**
+**Linux:**
+```bash
+./modeling_qwen3_omni_tts_min \
+    path/to/model \
+    <CASE_ID> \
+    "<TEXT_PROMPT>" \
+    output.wav \
+    [IMAGE_PATH] \
+    [AUDIO_PATH] \
+    [DEVICE] \
+    [MAX_NEW_TOKENS] \
+    [PRECISION] \
+    [VIDEO_FRAMES_DIR]
+```
 
-| # | Argument | Required | Description |
-|---|---|---|---|
-| 1 | `MODEL_DIR` | Yes | HuggingFace model directory |
-| 2 | `CASE_ID` | Yes | Test case identifier (2, 3, 4, or 5) |
-| 3 | `TEXT_PROMPT` | Yes | User text prompt |
-| 4 | `WAV_OUT` | Yes | Output WAV file path for synthesized speech |
-| 5 | `IMAGE_PATH` | No | Input image (use `none` to skip) |
-| 6 | `AUDIO_PATH` | No | Input audio WAV file (use `none` to skip) |
-| 7 | `DEVICE` | No | OpenVINO device (default: `CPU`) |
-| 8 | `MAX_NEW_TOKENS` | No | Max generation tokens (default: `64`) |
-| 9 | `PRECISION` | No | Precision mode (default: `fp32`) |
-| 10 | `VIDEO_FRAMES_DIR` | No | Directory of extracted video frames (use `none` to skip) |
+</details>
+
+<details>
+<summary>Test Cases & Examples</summary>
 
 ## Test Cases
 
@@ -130,9 +149,10 @@ modeling_qwen3_omni_tts_min.exe ^
 
 ### Example: Case 2 — Image Description with TTS
 
+**Windows:**
 ```bat
 modeling_qwen3_omni_tts_min.exe ^
-    D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
+    path\to\model ^
     2 ^
     "Describe this image and provide a speech response." ^
     case2_output.wav ^
@@ -143,11 +163,26 @@ modeling_qwen3_omni_tts_min.exe ^
     fp32
 ```
 
+**Linux:**
+```bash
+./modeling_qwen3_omni_tts_min \
+    path/to/model \
+    2 \
+    "Describe this image and provide a speech response." \
+    case2_output.wav \
+    path/to/image.jpg \
+    none \
+    CPU \
+    32 \
+    fp32
+```
+
 ### Example: Case 3 — Audio Understanding with TTS
 
+**Windows:**
 ```bat
 modeling_qwen3_omni_tts_min.exe ^
-    D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
+    path\to\model ^
     3 ^
     "What sound do you hear in the audio? Answer in one short sentence." ^
     case3_output.wav ^
@@ -158,10 +193,25 @@ modeling_qwen3_omni_tts_min.exe ^
     fp32
 ```
 
+**Linux:**
+```bash
+./modeling_qwen3_omni_tts_min \
+    path/to/model \
+    3 \
+    "What sound do you hear in the audio? Answer in one short sentence." \
+    case3_output.wav \
+    none \
+    path/to/audio.wav \
+    CPU \
+    32 \
+    fp32
+```
+
 ### Example: Case 5 — Full Multimodal (Image + Video + Audio + Text)
 
-Requires pre-extracted video frames (use `extract_video_frames.exe`):
+Requires pre-extracted video frames (use `extract_video_frames`):
 
+**Windows:**
 ```bat
 REM Step 1: Extract video frames
 extract_video_frames.exe ^
@@ -171,7 +221,7 @@ extract_video_frames.exe ^
 
 REM Step 2: Run Case 5
 modeling_qwen3_omni_tts_min.exe ^
-    D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
+    path\to\model ^
     5 ^
     "Describe the scene in the image, video, and audio." ^
     case5_output.wav ^
@@ -183,6 +233,33 @@ modeling_qwen3_omni_tts_min.exe ^
     frames_dir
 ```
 
+**Linux:**
+```bash
+# Step 1: Extract video frames
+./extract_video_frames \
+    --video path/to/video.mp4 \
+    --output-dir frames_dir \
+    --max-frames 4
+
+# Step 2: Run Case 5
+./modeling_qwen3_omni_tts_min \
+    path/to/model \
+    5 \
+    "Describe the scene in the image, video, and audio." \
+    case5_output.wav \
+    path/to/image.jpg \
+    path/to/audio.wav \
+    CPU \
+    32 \
+    fp32 \
+    frames_dir
+```
+
+</details>
+
+<details>
+<summary>Precision Modes</summary>
+
 ## Precision Modes
 
 Control inference precision and KV-cache compression via the `--precision` argument:
@@ -199,14 +276,19 @@ Control inference precision and KV-cache compression via the `--precision` argum
 
 Aliases: `fp32_kv8` → `inf_fp32_kv_int8`, `fp16_kv8` → `inf_fp16_kv_int8`, etc.
 
+</details>
+
+<details>
+<summary>Automated Case Comparison</summary>
+
 ## Automated Case Comparison (`tools/qwen3_omni_case_compare.py`)
 
 Runs all cases across multiple devices and precision modes, generating a JSON report
 with performance metrics and text outputs for comparison.
 
 ```bat
 python tools/qwen3_omni_case_compare.py ^
-    --model-dir     D:\models\Qwen3-Omni-4B-Instruct-multilingual ^
+    --model-dir     path\to\model ^
     --image         path\to\image.jpg ^
     --test-audio    path\to\audio.wav ^
     --video         path\to\video.mp4 ^
@@ -224,27 +306,10 @@ python tools/qwen3_omni_case_compare.py ^
     --cpp-only
 ```
 
-**Key arguments:**
+</details>
 
-| Argument | Description |
-|---|---|
-| `--model-dir` | HuggingFace model directory |
-| `--image` | Default image for Cases 1–4 |
-| `--test-audio` | Default audio for Cases 3–4 |
-| `--video` | Video file for Case 5 (frames extracted automatically) |
-| `--case5-image` | Case 5 specific image (falls back to `--image`) |
-| `--case5-audio` | Case 5 specific audio (falls back to `--test-audio`) |
-| `--case5-prompt-file` | Text file containing the Case 5 prompt |
-| `--cpp-bin` | Path to `modeling_qwen3_omni` executable (Case 1) |
-| `--cpp-tts-bin` | Path to `modeling_qwen3_omni_tts_min` executable (Cases 2–5) |
-| `--out-json` | Output JSON report path |
-| `--devices` | Comma-separated devices: `CPU`, `GPU`, `GPU.1` |
-| `--precisions` | Comma-separated precision modes |
-| `--cases` | Run specific cases only (e.g., `--cases 1,5`) |
-| `--max-new-tokens` | Token generation limit per case |
-| `--max-video-frames` | Max video frames to extract for Case 5 |
-| `--timeout` | Per-case timeout in seconds (default: 600) |
-| `--cpp-only` | Skip Python reference inference, run C++ cases only |
+<details>
+<summary>C++ Modeling API Overview</summary>
 
 ## C++ Modeling API Overview
 
@@ -290,3 +355,5 @@ auto text_request = compiled_text.create_infer_request();
 // ... feed input_ids, attention_mask, visual_embeds, position_ids ...
 // ... decode loop with argmax sampling ...
 ```
+
+</details>