Optimize performance of llm inference module by ZiniuLin · Pull Request #36 · xipingyan/openvino.genai

ZiniuLin · 2025-12-31T05:26:53Z

Main Change:

Optimize performance of llm inference module through remove unnecessary operations.

Test Result:

Module pipeline average inference time can be reduced from 489ms to 485ms.

Optimize performance of llm inference module through remove unnecessary operations. Signed-off-by: Ziniu Lin <ziniu.lin@intel.com>

…129) * move WeightParameter to separate source files * consolidate source files and refine folder name * add forward() method in each layer block classes to replace operator() to keep consistence * add code for Qwen3MLP and Qwen3DecoderLayer (including MLP and RMSNorm, no Attention block) * add code to implement Qwen3Attention class * improve qwen3 attention ULT tests coverage * add causal mask support in Qwen3Attention class * add more tensor manipulation interfaces * improve interface abstraction of ops/tensor/shape objects * refine ULT tests by moving ref/common functions to test_utils.cpp/test_utils.hpp * remove unreasonable assumption of "hidden_size_ == num_heads_ * head_dim_" * add append_kv_cache support and run qwen3-0.6b bf16 GGUF model correctly * move rope sin/cos construction from decode layer loop to model level * optimize causal mask calculation * Add safetensors support for new modeling API - Implement SafetensorsWeightSource (WeightSource interface) - Implement SafetensorsWeightFinalizer (WeightFinalizer interface) - Converts bf16/f16 weights to f32 for compatibility - Add create_model_with_modeling_api() in safetensors_modeling.cpp - Use OV_GENAI_USE_MODELING_API env var to switch implementations - Both building_blocks and modeling API paths produce correct output * Use native ScaledDotProductAttention for SDPA optimization - Replace manual matmul+softmax+matmul with ov::op::v13::ScaledDotProductAttention - Add build_kv_causal_mask() for dynamic causal mask generation with KV cache - Add less_equal() operator for mask comparison - Throughput is on par with legacy building_blocks implementation * Enable ENABLE_SAFETENSORS by default * use modeling api to support SmolLM3-3B model * Add MoE validation test for result checking fusedGEMM3MoECompressed matching The matching order is MOE->MOECompressed->MOE3GemmFusedCompressed Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * refactor code to use unified create_xxx_model functions * Export fuse_moe_3gemm_compressed to devapi Add moe_layer_internal in test_moe_layer for directly calling create new make_int4_weight_moe to reorder for gpu moe layout request Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * Restore original files for moe_3gemm_fused_compressed Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * feat(loaders): add core multi-format loader interfaces Add foundational interfaces for unified model loading architecture: Core interfaces: - model_loader.hpp: IModelLoader abstract interface with ModelFormat enum - model_config.hpp/cpp: Unified ModelConfig with from_gguf/from_hf_json - loader_registry.hpp/cpp: LoaderRegistry singleton with auto-detection - model_builder.hpp/cpp: ModelBuilder factory for architecture-specific models - weight_name_mapper.hpp/cpp: Weight name normalization across formats - loaders.hpp: Unified include header Loader stubs (headers only, implementation in next commits): - gguf/gguf_loader.hpp: GGUF format loader interface - safetensors/safetensors_loader.hpp: HuggingFace safetensors loader Note: OpenVINO IR support planned for future releases. Part of multi-format loader architecture. * feat(loaders): implement GGUFLoader with gguf_utils integration Implement GGUF format loader that wraps existing gguf_utils code: New files: - loaders/gguf/gguf_loader.cpp: Full GGUFLoader implementation Features: - supports(): Checks for .gguf extension - load_config(): Reads GGUF metadata, converts to ModelConfig - create_weight_source(): Wraps gguf::GGUFWeightSource - create_weight_finalizer(): Wraps gguf::GGUFWeightFinalizer with qtypes - load_tokenizer(): Uses create_tokenizer_from_config() gguf_utils updates: - GGUFWeightSource now supports canonical weight names - Uses WeightNameMapper for GGUF->canonical name conversion - Maintains backward compatibility with direct GGUF names Part of multi-format loader architecture. * feat(loaders): implement SafetensorsLoader with safetensors_utils integration Implement HuggingFace Safetensors format loader: New files: - loaders/safetensors/safetensors_loader.cpp: Full SafetensorsLoader implementation Features: - supports(): Checks for config.json + model.safetensors - load_config(): Reads HFConfig from config.json, converts to ModelConfig - create_weight_source(): Uses safetensors::SafetensorsWeightSource - create_weight_finalizer(): Uses safetensors::SafetensorsWeightFinalizer - load_tokenizer(): Returns nullptr (handled by Tokenizer class) safetensors_utils updates: - SafetensorsWeightSource now supports canonical weight names - Uses WeightNameMapper for HF->canonical name conversion - Added legacy constructor for backward compatibility - Maintains compatibility with safetensors_modeling.cpp Part of multi-format loader architecture. * feat(loaders): integrate unified loader with model loading pipeline Add unified model loading entry point with environment variable control: Entry point (utils.cpp): - Add use_unified_loader() helper (OV_GENAI_USE_UNIFIED_LOADER env var) - When enabled, uses LoaderRegistry to detect format and build model - Falls back to legacy path on error or for unsupported formats - Preserves full backward compatibility ModelBuilder integration: - Add build() method that uses modeling API - Register Qwen3 architecture builder Deprecated APIs: - Mark create_from_gguf() as [[deprecated]] with migration guide - Mark create_from_safetensors() as [[deprecated]] with migration guide CMakeLists updates: - Add conditional compilation for loaders module - Respect ENABLE_GGUF and ENABLE_SAFETENSORS flags Part of multi-format loader architecture. * refactor(modeling): add build_qwen3_model self-registration to qwen3_dense.cpp Move model builder function to model file following vLLM pattern: Changes: - Add build_qwen3_model() function to qwen3_dense.cpp - Add static self-registration at module initialization - ModelBuilder::instance() registration happens automatically Benefits: - Each model file is self-contained - Adding new models doesn't require modifying model_builder.cpp - Follows vLLM's model loading pattern Note: Qwen2 support pending validation (commented out for now). Part of multi-format loader architecture. * docs(loaders): add README with supported formats and TODO list Add documentation for multi-format loader module: Contents: - Supported formats table (GGUF, Safetensors tested; OpenVINO IR TODO) - Supported architectures (Qwen3 tested; Qwen2, LLaMA pending) - Environment variables reference - Usage example - Testing configurations - TODO list with priorities This documents the current state and future work items. * Support qwen3 moe (#18) * load tokenizer skip whole gguf weights loading * support qwen3 moe 2.4b * feat(loaders): add SmolLM3 model builder to unified loader * Fix unit test build issue and moving test_moe_layer to ops_unit_test Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * Support qwen3 moe in modeling api (#25) * support qwen3 moe in modeling api * add utils: append_kv_cache * feat: implement C++ RTN quantization with building blocks - Add rtn_quantize.hpp with INT4/INT8 symmetric quantization algorithm - Add GGUF_TYPE_INFLIGHT_INT4_SYM and GGUF_TYPE_INFLIGHT_INT8_SYM types - Implement make_inflight_int4_weights_sym() and make_inflight_int8_weights_sym() - Integrate compressed weights into make_weights_subgraph() switch * feat: add safetensors in-flight compression loading - Add InFlightCompressionConfig struct for quantization settings - Add InFlightCompressionConfig::to_gguf_type() for mode conversion - Add create_from_safetensors_compressed() for compressed loading - Simplify create_qtype_entries() by removing redundant parameter - Integrate compression with environment variable activation * perf: optimize RTN quantization with type dispatch outside loops * Enable safetensor in-flight Q4_1 support Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * add basic config/ops changes for youtu-llm * add youtu-llm support with modeling api * support chat template in greedy_causal_lm.exe as youtu-llm needs it to generate correct output * add initial code for qwen3vl * add qwen3vl vision encoder * add qwen3vl text decoder * add qwen3vl fusion injector and input planner * support weights loading and model creation * add a new qwen3vl test sample and finish E2E integration * fix tensor naming * print basic performance metrics * refine qwen3vl source files org * use static link for modeling sample and refine sample name Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * Free moe load mem (#33) * fixrmsbug * Add modeling moe test (#36) * add initial code for building zimage model * fix ULT tests failures * add sample test for z-image model * fix tokenizer missing due to diffusion HF models have different tokenizer json path * add debug dump code, force fp32 inference and fix bugs * fix bugs to generate correct image * disable debug dump by default * add comments in sample test * feat: Implement Zero-Copy safetensors loading Key changes: - MmapHolder class to manage mmap lifetime via RAII - SafetensorsData now stores mmap references instead of copying data - load_safetensors_file() maps files without memcpy - SafetensorsWeightSource provides get_shared_buffer() API - SafetensorsWeightFinalizer uses SharedBuffer for Constant creation Memory improvements: - Peak memory reduced from 16.3 GB to 8.6 GB (47% reduction) - create_weight_source: 7.7 GB -> 17 MB - ModelBuilder::build: 7.7 GB -> 12 MB - Only compile_model copies weights to usm_host (+8.4 GB) Overhead ratio improved from 2.12x to 1.12x * feat: Add OV_GENAI_USE_ZERO_COPY env var to control zero-copy mode * feat: Support zero-copy mode with legacy MODELING_API path * fix: Correct is_zero_copy_mode() to check mmap info availability * fix: zero-copy default depends on modeling API status - When OV_GENAI_USE_MODELING_API is not set or 0, zero-copy defaults to disabled (building_blocks path doesn't support zero-copy) - When OV_GENAI_USE_MODELING_API=1, zero-copy defaults to enabled - Explicit OV_GENAI_USE_ZERO_COPY setting always takes precedence - Updated README documentation * update tests method * Replace FullyConnected with MatMul for IR serialization compatibility Use standard MatMul with transpose_b=true instead of internal FullyConnected op. MatMul is serializable to IR, GPU will convert it back to FullyConnected at compile time. * Add OV_GENAI_SAVE_OV_MODEL environment variable support - Refactor environment variable checking into reusable is_env_truthy() helper - Add should_save_ov_model_from_env() to check OV_GENAI_SAVE_OV_MODEL env var - Model saving can now be triggered via parameter or environment variable * add initial support for dflash pipeline * Use internal RoPE op for optimal GPU performance - Replace manual RoPE pattern construction with op::internal::RoPE - This enables GPU plugin's optimized rope_opt kernel - Previously, the element-wise pattern was not recognized by RoPEFusion - Reduces OpenCL enqueues by ~30% and improves TTFT by ~24% * fix the Qwen3VLTextAttention.MatchesReferenceNoRope test bug * fix the Qwen3VLTextAttention.MatchesReferenceNoRope test bug Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * add perf metrics print * add debug log and force draft model to use fp32 * add ULT test for draft model inference * force to use fp32 for inference. it's slow, but solve draft model nan issue, and draft_accepted=54 * Enable qwen3-moe model with in-flight Q4_1 comrpession for safetensor refine test_moe_layer to support Q4_1 quantization weight generation Add Qwen3-MOE-4x0.6B-2.4B-Writing-Thunder-V1.2 to model test Refine auto_test to do env config for each test with extra env Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * Fix Qwen3-30b-a3b error (#43) * fix qwen3-30b-a3b error * feat(safetensors): use Tensor(tensor, holder) to keep mmap alive in get_tensor() Key improvement: - Use ov::Tensor(view_tensor, shared_ptr<void>) constructor to bind mmap holder lifetime to the tensor itself - This eliminates the need for separate get_shared_buffer() path - When Constant(tensor) creates SharedBuffer<Tensor>, it holds the tensor which now holds the mmap_holder, keeping mmap memory valid Simplified SafetensorsWeightFinalizer: - Removed special handling for SafetensorsWeightSource - Now uses generic get_tensor() path for all weight sources - The tensor returned by get_tensor() already holds mmap lifetime * Clean up safetensors utils: remove unused code and simplify interfaces * fix compiling errors * Enable safetensor in-flight compression with modeling api Implemented comprehensive custom weight selection for inflight quantization Refine MOE weight fusion that support zerocopy mode and before in-flight compression Refine in-flight RTN quantization algorithm to support 3D tensor Refined the modeling API to support returning multiple tensors from weight finalization Add moe subgraph and common dequant subgrah for in-flght compression Add auto test case qwen3-2.4B and qwen3-30B with modeling api path Fix qwen3-VL model no text output issue for autotest * Provides flexible control over which weights to quantize using multiple strategies: * - Pattern-based (wildcard matching) * - Layer-based (layer index ranges) * - Type-based (attention, mlp, embeddings, etc.) * - Explicit lists (include/exclude specific weights) * - Size-based (minimum/maximum weight size) * * Selection priority (highest to lowest): * 1. Size thresholds (applied first) * 2. Explicit exclude list * 3. Explicit include list * 4. Exclude patterns * 5. Include patterns * 6. Layer range * 7. Type-based flags * Environment variables: * OV_GENAI_INFLIGHT_QUANT_MODE: Quantization mode (INT4_SYM, INT4_ASYM, INT8_SYM, INT8_ASYM) * OV_GENAI_INFLIGHT_QUANT_GROUP_SIZE: Group size for quantization (default: 128) * OV_GENAI_INFLIGHT_QUANT_INCLUDE: Comma-separated include patterns * OV_GENAI_INFLIGHT_QUANT_EXCLUDE: Comma-separated exclude patterns * OV_GENAI_INFLIGHT_QUANT_LAYER_RANGE: Layer range (e.g., "10-20") * OV_GENAI_INFLIGHT_QUANT_WEIGHT_NAMES: Comma-separated explicit weight names * OV_GENAI_INFLIGHT_QUANT_MIN_SIZE: Minimum weight size in bytes * OV_GENAI_INFLIGHT_QUANT_MAX_SIZE: Maximum weight size in bytes Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * Enable qwen3-VL in-flight q4_1 compression only do quantization for LLM backbone to keep precision add qwen3-dense and wen3-vl-8B in-flight compression test cases into autotest Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * Enable INT8 asymmetric in-flight quantization and channel-wise support - Implement asymmetric INT8 quantization (RTN) in gguf_utils. - Add support for channel-wise quantization across INT4/INT8 symmetric and asymmetric modes. - Update modeling API (Qwen3 MoE) and ops to support group size configuration. - Enhance safetensors weight finalizer to handle INT8 asymmetric dequantization and 3D weights. - Add new test cases in auto_tests.py for channel-wise and INT8 asymmetric in-flight quantization. Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * Optimize RTN quantization (~2-4x faster model loading) Significant performance optimization for in-flight weight compression (RTN) by implementing AVX2 intrinsics and multi-threading. Changes: - Implement AVX2 intrinsics for `find_min_max` and quantization kernels (INT4/INT8). - Parallelize quantization loops over output channels using `ov::parallel_for`. - Refactor to use `InputType` templates, eliminating per-element branching. - Fix logic error in `int8_sym` where scale was calculated but not stored. Performance impact (Total Load + Run Time): - Qwen3-4B (INT4 Asym): 46.43s -> 15.96s (~2.9x faster) - Qwen3-4B (INT4 Sym): 43.92s -> 12.40s (~3.5x faster) - Qwen3-4B (INT4 Channel-wise): 48.60s -> 12.24s (~4x faster) - Qwen3-30B (INT4 Asym): 14m16s -> 6m19s (~2.2x faster) - Qwen3-MOE (Modeling API): ~3x faster duration (30.93s -> 10.16s) Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * add baseline for comparsion * Add NNCF-compatible in-flight quantization strategy - Add backup_mode (INT8_ASYM default) for sensitive layers (embeddings, lm_head) - Set backup_mode=primary_mode to quantize all layers with same mode - Set backup_mode=NONE to skip quantizing sensitive layers - INT8 modes use per-channel quantization (group_size=-1) - INT4 modes use group-wise quantization (default group_size=128) - Add verbose logging option via OV_GENAI_INFLIGHT_QUANT_VERBOSE - Follows NNCF default behavior: embeddings and last layer use INT8_ASYM * Update modeling_dflash.cpp v2, update target kv cache handing * Optimize memory usage in Safetensors weight finalizer - Delay creation of `ov::op::v0::Constant` until after determining if quantization is required. - Prevents holding the original full-precision weight in an OpenVINO Node when it is slated for quantization, reducing peak memory usage during model loading. - Reuse fetched tensor reference and shape information to avoid redundant `get_tensor` calls. Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * add new ops/configs/utils for wan2.1 t2v pipeline * add wan2.1 transformer modeling code * fix ULT failure * add wan2.1 VAE modeling code * fix ULT failures * add wan umt5 text encoder modeling code Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * fix diffusion safetensor load failure * fix frame currption (need to use lower resolution to avoid OOM) * use fp32 for timestep * rename source files * Refactor Qwen3 MoE weight loading for performance optimization - Refactor `Qwen3MoE` architecture to manage expert weights individually via `std::vector<WeightParameter*>`, eliminating the overhead of pre-fusing large MoE tensors. - Update `SafetensorsWeightFinalizer` to detect and process individual 2D expert weights directly - Remove obsolete in-memory MoE fusion logic to avoid copy tensor data in mmap mode Performance Impact (Qwen3-30B-A3B-Instruct-2507, INT4-asym gs128): - Throughput increased from ~5.6 t/s to ~26.2 t/s (~4.7x speedup). - Latency (TTFT) reduced from ~3.39s to ~2.40s (~30% reduction). - Total test time reduced from ~6m23s to ~3m49s. Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * add wan2.1 layered dit modeling code and sample test * fix create model failures * fix layered dit inference failure Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * free memory after model infer finished * Qwen3-moe int4 performance alignment with openvino-IR 1.Enable router weight quantizaiton with in-flight compression aligned to openvino-IR default behavior 2.Fixed double convert issue for router matmul in moe3gemm_fused_compressed that leads to FC compressed not used 3.Introduced compressed_type tracking in QuantizedWeight to distinguish different quantization. 4.Centralized logic for detecting sensitive layers to ensure consistent fallback quantization strategies 5.Added a dedicated 2D overload for quantize_q41 for gate_inp quantization and make_dequant_subgrah to cover int4 quantization scenario in unit test 6.Add "all" option in autotest Test result: MoE in-flight int4 performance imporved by ~10% Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * add deepseek ocr2 utils functions and tests * Enable configurable inflight quantization for modeling API and samples - Update samples to support CLI arguments for quantization modes (INT4/INT8, Sym/Asym) as format [quant_mode, groupsize, backup_mode]. - Refactor `SafetensorsWeightFinalizer` and loaders to accept explicit `QuantizationConfig`, improving upon env-var only configuration. - Add support for 1D tensor quantization. - Add quantization statistics logging to `SafetensorsWeightFinalizer`. - Add `quantization_utils.hpp` for shared CLI parsing logic. - Qwen3-VL model can be quantized by separated configs for text and vision. - refine auto_test to use cli args for quantization config. Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * fix ULT failure * add code for sam vit modeling * add code for LLM as vision encoder modeling * add code for projector and packager modeling * add code for deepseek ocr2 text part modeling * fix ULT tess failure (still has small diff against reference values) * add an E2E sample test app for deepseek-ocr2 * fix weights binding errors Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * fix the new tests bug * fix moe->int4Router test bug! * fix the Xe2 Z-Image viSA bug * Refactor safetensors utils and add performance instrumentation Refactor: Moved quantization_utils.hpp to safetensors_utils/ and updated inclusions in modeling_qwen3_vl.cpp and modeling_zimage.cpp. Performance: Added detailed timing statistics (Fetch, Quantize, Graph construction) to SafetensorsWeightFinalizer and individual file loading times in SafetensorsLoader. Cleanup: Wrapped verbose weight name logging in safetensors_modeling.cpp behind a debug flag to reduce console noise. Refine Zimage model compiling: only keep DiT model to fp32 due to fp16 overflow issue Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * dump wan dit full model and layered models * add python test for wan2.1 layered dit ir models * force wan2.1 dit to use public opset to construct rope * fix tensor conversion failure * modeling: Fix KV cache Variable naming for NPUW compatibility Change Variable naming from 'model.layers[N].self_attn.key_cache' to 'past_key_values.N.keypresent.N.key' format to match NPUW's StatefulToStateless pass regex expectations. * fix the zimage_dit_dummy_test error(in ARL but bmg also have error: parsing vISA inline assembly failed) * modeling: Fix SDPA mask and scale for NPU compatibility Add build_kv_causal_mask_with_attention() helper that properly handles attention_mask integration with KV cache causal masking. This ensures the StatefulToStateless pass can correctly process SDPA operations. Key changes: - Implement build_kv_causal_mask_with_attention() in llm ops - Pass scale parameter explicitly to SDPA op instead of relying on default value. This is required for NPU decomposition pass to use the correct scale, especially for models like deepseek_sam_vit that use non-standard scale (1.0f instead of 1/sqrt(head_dim)) - Fix regex bug in kv_cache.cpp: \_d+ -> \d+ * modeling: Unify Qwen3 forward API with attention_mask Simplify Qwen3 model API by removing duplicate forward overloads and unifying on attention_mask parameter for NPU/NPUW compatibility. Key changes: - Remove causal_mask versions of forward() in favor of attention_mask - Update Qwen3Attention, Qwen3DecoderLayer, Qwen3Model, Qwen3ForCausalLM - Simplified forward() constructs all-ones attention_mask internally - Update build_qwen3_model() to use attention_mask input parameter - Update unit tests to provide attention_mask input * Fix output name collision in WanDIT Layered Block Group and update test harnesses Root Cause: The input parameter was also named "hidden_states". This collision caused the inference request to return the unmodified input tensor when queried, effectively bypassing the block group computation. Updated both C++ and Python runners to retrieve the new output name "hidden_states_out". Added debugging helpers (generate_fixed_data, i64 support in tensor conversion) to the C++ sample. Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * refine ZImageDit.test * refactor: restructure models directory following HuggingFace transformers convention - Group model files by model family into subdirectories: - qwen3/, qwen3_moe/, qwen3_vl/, deepseek_v2/, deepseek_ocr2/ - smollm3/, wan/, zimage/, youtu/ - Rename files with modeling_ prefix for model files - Rename files with processing_ prefix for utility files - Update all #include paths accordingly - Align with HuggingFace transformers organization principle: Each model variant (Dense/MoE/VL) has its own independent directory Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> * feat(ops): add nn ops for Qwen3-TTS - Activation functions: relu, sigmoid, tanh_activation - 1D Convolution: conv1d and conv_transpose1d with optional bias - Layout-aware bias reshape for conv1d (auto-detects NCL vs NLC from output shape) - Batch normalization: batch_norm - Pooling operations: adaptive_avg_pool1d, avg_pool1d, max_pool1d - Part of ops dependency patch for Qwen3-TTS integration * feat(ops): add math ops for Qwen3-TTS - sin/cos: trigonometric operations for SnakeBeta activation function - reduce_sum: supports both single axis and multiple axes reduction with keepdim option - Part of ops dependency patch for Qwen3-TTS integration * feat(ops): add tensor ops for Qwen3-TTS - split: supports both equal number of splits and variable split sizes - flip: axis-based tensor reversal operation - Part of ops dependency patch for Qwen3-TTS integration * test(ops): add unit tests for Qwen3-TTS ops Add comprehensive unit tests for new ops in qwen3_tts_ops_test.cpp: NN Ops (7 tests): - relu: element-wise max(0, x) - sigmoid: 1/(1+exp(-x)) activation - tanh_activation: hyperbolic tangent - conv1d (basic): 1D convolution without bias - conv1d (with bias): 1D convolution with bias - conv_transpose1d: transposed convolution shape verification - batch_norm: batch normalization (inference mode) - avg_pool1d: average pooling Math Ops (4 tests): - sin: element-wise sine - cos: element-wise cosine - reduce_sum: sum reduction along single axis - reduce_sum: sum reduction along multiple axes Tensor Ops (4 tests): - split (equal): split into equal parts - split (variable): split into variable-sized parts - flip: reverse along axis (2D) - flip: reverse along axis (3D) All tests include reference implementations and use appropriate tolerance levels (k_tol_exact, k_tol_transcendental, k_tol_default). * feat(qwen3-tts): add common headers and base modules Add the qwen3_tts module directory under models/ with modular file structure: Common header (modeling_qwen3_tts.hpp): - Qwen3TTSTalkerConfig: 28-layer mRoPE attention model config - Qwen3TTSCodePredictorConfig: 5-layer predictor config - SpeechDecoderConfig: RVQ + PreTransformer + ConvNeXt decoder config - KV cache output structures for generation Module-specific files (following existing project conventions): - modeling_qwen3_tts_talker.hpp/cpp: Talker module declarations and factories - modeling_qwen3_tts_code_predictor.hpp/cpp: CodePredictor module - modeling_qwen3_tts_speech_decoder.hpp/cpp: SpeechDecoder module All factory functions have placeholder implementations to be filled in subsequent patches (Patch 3-5). Patch 2 of the atomic Qwen3-TTS integration series. * feat(qwen3-tts): implement Talker module with mRoPE and KV cache - Add Qwen3TTSTextProjection for text embedding projection (896->2048) - Add Qwen3TTSTalkerAttention with mRoPE and GQA support (16 Q heads, 8 KV) - Add Qwen3TTSTalkerMLP with SwiGLU activation - Add Qwen3TTSTalkerDecoderLayer and Qwen3TTSTalkerModel (28 layers) - Add Qwen3TTSTalkerForConditionalGeneration with codec_head - Implement factory functions for embedding, prefill, and decode models - Support both forward_no_cache and forward_with_cache variants - mRoPE config: mrope_section={24,20,20}, rope_theta=1e6 * test(qwen3-tts): add Talker module unit tests - Add TextProjection reference test with SiLU activation - Add SwiGLU MLP reference test - Add Embedding model structure and reference tests - Add Codec-only embedding model test - Verify graph structure, input/output shapes, and numerical correctness * feat(qwen3-tts): implement Code Predictor module - Add Qwen3TTSCodePredictorAttention with standard RoPE and GQA (16Q/8KV) - Add Qwen3TTSCodePredictorMLP with SwiGLU activation - Add Qwen3TTSCodePredictorDecoderLayer (5-layer transformer) - Add Qwen3TTSCodePredictorModel with hidden_size=1024 - Add Qwen3TTSCodePredictorForConditionalGeneration with 15 codec_embeddings and 15 lm_heads - Implement factory functions: - create_qwen3_tts_code_predictor_model (full model) - create_qwen3_tts_code_predictor_ar_model (single step generation) - create_qwen3_tts_code_predictor_codec_embed_model (sum of 15 embeddings) - create_qwen3_tts_code_predictor_single_codec_embed_model (single layer embedding) * test(qwen3-tts): add Code Predictor module unit tests * feat(qwen3-tts): add Speech Decoder module implementation Implement the 12Hz Speech Decoder that converts RVQ codec tokens to audio: - RVQDequantizer: converts 16-layer RVQ codes to continuous embeddings - PreTransformerAttention: sliding window attention with RoPE - PreTransformerMLP: SwiGLU feedforward network with LayerScale - PreTransformerDecoderLayer: transformer layer with LayerScale - PreTransformer: 8-layer transformer for RVQ embedding processing - SnakeBetaActivation: x + 1/beta * sin^2(x * alpha) activation - ConvNeXtBlock: used in pre-decoder upsampling - ResidualUnit: dilated causal conv with SnakeBeta activation - DecoderBlock: transposed conv upsample + residual units - SpeechDecoderModel: complete decoder pipeline Audio generation pipeline: 1. RVQ dequantization (16 codebooks -> embeddings) 2. Pre-conv channel expansion (512 -> 1024) 3. Pre-transformer (8 layers, sliding window attention) 4. Pre-decoder upsample (2x2 with ConvNeXt) 5. Decoder blocks (8x5x4x3 = 480x upsample) 6. Final SnakeBeta + conv to audio Total upsample: 4 * 480 = 1920x Output: 24kHz audio from 12.5Hz codec tokens * test(qwen3-tts): add Speech Decoder module unit tests Add comprehensive unit tests for Speech Decoder components: - SnakeBeta activation (x + 1/beta * sin^2(x * alpha)) - PreTransformerMLP (SwiGLU feedforward) - RVQ Dequantizer (first codebook + multi-codebook sum) - ConvNeXt block (depthwise + pointwise convs) - Residual unit (dilated causal conv) - Decoder block (transposed conv upsample) - Audio length calculation (1920x upsample factor) All 8 tests passing on GPU backend. * Fixed the single-token validation issue in the target model. Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * use the new models folder structure Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * tokenizer: enhance the detection of first time tokenizer conversion Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * fix ULT test failure * run the first 4 layers inference * run full model with all layers * avoid to trigger 2 times safetensors files loading in PA * enable FP8 quantization to INT4 asym/sym and enable moe weight loader Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> * add env variable OV_GENAI_QWEN3_NEXT_NUM_LAYERS (0~48) to control layer number in modeling * Disable router quantization by default and add shared_expert_gate detection - Change quantize_routers default from true to false (routers are small weights, quantization is usually not beneficial) - Add shared_expert_gate pattern to router detection in QuantizationSelector - This fixes MatMul dimension mismatch when shared_expert_gate gets incorrectly quantized with INT4 in Qwen3-Next models * f16 inference precision set Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * add a qwen3-next ult test case for compile and run Qwen3NextLinearAttention * change ULT test case to use int4 weights * add a new Qwen3NextGatedDeltaNet ULT test * Fix double model loading for safetensors and reject hybrid attention models for PA backend 1. pipeline.cpp: - Add is_hybrid_attention_model() to detect models with hybrid attention (e.g., qwen3_next which uses both linear attention and SDPA) - Extend can_try_auto_pa_backend() to check for safetensors files - Skip PA backend for hybrid attention models to avoid double loading 2. paged_attention_transformations.cpp: - Add runtime check for linear attention states (linear_states.*.conv/recurrent) - Throw clear error message when hybrid attention model is detected - Improve ASSERT message for key_cache/value_cache parameter validation This fixes the issue where Qwen3-Next models caused double loading because SDPAToPagedAttention pass cannot handle linear attention states that also use beam_idx for reordering. * Fix build error in Linux * Fix build error in Linux Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * add initial modeling source files for qwen3.5 9b dense model * add ULT tests for qwen3.5 9b dense modeling * fix qwen3.5 Qwen3_5CacheRuntime ULT failures * run all qwen3.5 ULT tests on GPU device * update qwen3.5 modeling code based on the latest hf transformers updates * update qwen3.5 sample test app to support dummy e2e test with random dummy weights * use the config parameters from real qwen3.5 9b model * use int4 quantization to run dummy model * Fix qwen3.5 dense model vlm mode reshape crash in dequant subgraph when weight dim is not divisible by group_size When in_features (e.g. 4304) is not evenly divisible by quantization group_size (e.g. 128), the dequant subgraph incorrectly computed group_size = in_features / num_groups via integer division (4304/34=126), creating a 3D tensor with fewer elements than the original weight (34*126=4284 != 4304) and causing a reshape failure. Add a flat dequantization path using Gather-based scale/zero_point expansion for non-divisible cases, keeping the existing 3D grouped path for evenly-divisible cases. * Add Qwen3NextGatedDeltaNet2 with fused LinearAttention op Replace TensorIterator recurrent loop with ops::linear_attention in a new Qwen3NextGatedDeltaNet2 class. All other logic (projections, conv1d, gating, normalization, output) is unchanged from v1. Add unit tests for graph construction, GPU compilation, stateful prefill/decode, and numerical equivalence against v1. Add benchmark test comparing GatedDeltaNet TensorIterator vs LinearAttention op Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * add initial source files for qwen3.5 moe modeling * improve qwen3.5 sample cmd line helper * fix qwen3.5 moe ult test failure * remove legacy cmd line options * improve cmd line options of qwen3.5 sample test * add performance data * Optimize MoE XML size using postponed_constant + constant_fold - Add postponed_constant rt_info to Concat nodes in MoE expert weight functions - Use constant_fold() on Transpose nodes to produce pure Const for scales/zps - Enables OpenVINO serializer to fold Concat(Const, Const, ...) into single merged Const Results: - Qwen3-MoE: XML -24.4%, Const nodes -42.6% - Qwen3-Next: XML -96.3% (215MB->8MB), Const nodes -98.7% (298k->4k) Modified files: - modeling_qwen3_moe.cpp: 9 functions with postponed_constant - modeling_qwen3_next.cpp: 9 functions with postponed_constant - safetensors_weight_finalizer.cpp: constant_fold Transpose for scales/zps * Use fused LinearAttention op in Qwen3_5 GatedDeltaNet by default TensorIterator fallback via OV_GENAI_USE_LINEAR_ATTENTION_OP=0 Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * Add Qwen3_5 LinearAttention vs TensorIterator comparison tests and move q/k normalization into TensorIterator branch Make use_linear_attention_op() non-static so env var can be toggled at runtime for testing Move q/k L2 normalization + scaling into TensorIterator branch only; LinearAttention op path passes raw q/k (op handles normalization internally) Add qwen3_5_linear_attention_op unit test with 5 tests: 1. graph structure verification 2. stateful variable registration 3. numerical equivalence 4. prefill+decode state carry-over 5. GPU performance benchmarking Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> * add more qwen3.5 attention ULT tests * Disable vision encoder quantization by default INT4 FullyConnectedCompressed has severe performance regression (~115x slower) on small batch sizes typical of vision encoders (e.g., 64 tokens for 256x256 image). Profiling results: - INT4 vision encode: 41,838 ms - FP32 vision encode: 360 ms Vision encoder characteristics that make INT4 unsuitable: - Small batch size (64 tokens vs thousands in LLM decode) - Single-shot execution (no amortization across decode iterations) - Small model size (233MB) where INT4 memory savings are minimal * Fix quantization mode parsing: use exact match instead of substring The previous substring-based parsing (ind(sym)) incorrectly matched int4_asym as INT4_SYM because asym contains sym as a substring. Changed to exact string comparison to correctly parse quantization modes. Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * Add tokenizer fallback for dummy model testing - Catch tokenizer.encode() failures and fall back to dummy tokenization - For VL mode, detect image_token_id mismatch and fall back to dummy tokenization - This allows testing with dummy models that have incompatible tokenizer configs * enable constant folding to prepare moe inputs * Update moe3gemm_fused_compressed for shared expert * remove fallback leftover * Integrate to modeling test Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * Update qwen3_next Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * feat(qwen3_next): replace GatedDeltaNet with GatedDeltaNet2 for optimized linear attention Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * fix(qwen3_next): swap beta/g arg order in ops::linear_attention to match kernel convention The LinearAttention OCL kernel expects input[3]=g (decay gate) and input[4]=beta (delta update gate), but ops::linear_attention() was passing them in reversed order (beta at index 3, g at index 4). This caused the state decay factor exp(beta) to grow unboundedly and the delta update to be inverted, producing incorrect outputs. Swap the args in the OutputVector to match the kernel layout. * test(qwen3_next): add V1 vs V2 numerical equivalence E2E test Add GatedDeltaNetV1vsV2.PrefillAndMultiStepDecodeMatchOnGPU test that verifies GatedDeltaNet (TensorIterator-based V1) and GatedDeltaNet2 (LinearAttention kernel-based V2) produce identical outputs through prefill (seq_len=8) and 5 multi-step decode steps with stateful recurrent state carry-over. Uses f32 weights with tiny increments (1e-8f) to avoid overflow, NaN-aware comparison, and tolerance threshold k_tol_linear_attn. Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * test(qwen3_next): add V1 vs V2 numerical equivalence E2E test * Update build_block for shares_expert fusion * Fixed _shared_intermediate_size incorrect issue * fix tensors name mismatch issue (model.layers.N -> model.layers[N]) * qwen3.5 moe 35B real model -- fix part1 Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> (cherry picked from commit 5dceb43e0f54bae7c5f7721cc728125ffa48452d) * enable qwen3.5 greedy_causal_llm text mode Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> (cherry picked from commit 89b88a90eeafe4e231a927c0e75b31461e73da5c) * add --cache-model feature to avoid build model graph every time * fix cached model read issue * consolidate qwen3.5 attention tests * refine tests name and add linear attn basic op and fused op result check * refactor qwen3.5 moe ult tests * add cpu ref to check output correctness * fix qwen3.5 moe ult tests failures * align qwen3.5 linear attention ult test configs with real model * use compressed int4 weights and attention mask * dump ov ir model files in qwen3.5 attention ult test * optimize GDN model graph structure * tests: cap mismatch logs and add match stats * Unify Qwen3.5 quantization control via env vars * safetensors: support shard names with embedded .safetensors suffix * qwen3.5: allow explicit head_dim when hidden_size is not head-divisible * Update qwen3_5 layers option and batch scripts * fix(qwen3_5): disable vision INT4 quantization by default GPU compile_model hangs with many INT4 dequant subgraphs in the vision encoder. Disable vision quantization by default (matching Qwen3 VL behavior). Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> * feat(qwen3_5): add --max-pixels CLI parameter Add --max-pixels CLI parameter to limit vision input pixels. Use --max-pixels N to override max_pixels from preprocessor_config.json. Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> * feat(qwen3_5): add OV_GENAI_VISION_DEVICE env var Allow overriding the device for vision model compilation via the OV_GENAI_VISION_DEVICE environment variable. Useful for running vision on CPU while text on GPU, or vice versa. Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> * perf: cache pos_embed weight in vision IR to avoid reloading safetensors When using --cache-model, the pos_embed weight needed for vision preprocessing was forcing a full load of all 14 safetensors shards. Instead of saving pos_embed to a separate binary file, embed it as a Constant->Result node in the vision IR so it is serialized into the vision .bin alongside other vision weights. On load, extract the constant and remove the extra Result before compile_model. Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> * Align modeling_qwen3_5 prompt handling with chat templates * Honor generation config stop tokens in modeling_qwen3_5 * fix: align qwen3.5 prefill position_ids semantics (upstream c281a2de89) * test: cover qwen3.5 VL text-position offset (upstream a08936f6e4) * fix: align qwen3.5 repeated-generate past_len tracking (upstream 3b8f0948af) * fix: guard qwen3.5 position tracking in mixed-length batch decode (upstream 9d9b012dcf) * fix: strengthen qwen3.5 input checks in forward path (upstream 42791a34fd) * fix the deepseek_ocr2 result bug Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: implement OmniCode2WavMappedWeightSource for speech decoder - Support 3 weight mapping strategies: key name translation, codebook sharing, synthetic identity weights - Handle 652 total weights with complex mapping between safetensors keys and model parameter names - Map all 16 layers codebook.embed to shared code_embedding.weight - Generate synthetic identity/zero weights for pre_conv, pre_transformer projection layers Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: add Python bridge for audio mel-spectrogram and vision preprocessing - processing_qwen3_omni_audio.cpp: integrate Python bridge call to extract mel spectrogram features from audio WAV via subprocess - processing_qwen3_omni_bridge.py: bridge script that loads audio, computes WhisperFeatureExtractor mel spectrogram, and writes binary tensor output - processing_qwen3_omni_vision.cpp: add grid_thw accessor for mRoPE position computation in audio+image path Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: implement pure C++ WhisperFeatureExtractor mel spectrogram Replace Python bridge dependency with native C++ audio pipeline: - WAV file loading (16-bit PCM, mono, supports resampling to 16kHz) - STFT with Hann window (400-sample frame, 160-sample hop) - 128-band mel filterbank (loaded from whisper mel_filters.npz) - Log-mel spectrogram with 30-second padding/truncation - Zero-copy ov::Tensor output for direct OpenVINO inference Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: full TTS pipeline with C++ audio and Case 4 audio+image support modeling_qwen3_omni_tts_min.cpp: - Complete TTS pipeline: talker prefill/AR code predictor speech decoder - Native C++ audio pipeline: WAVmelaudio encoder (replaces Python bridge) - Case 4 fix: enable vision model in audio+image path (was skipped when has_audio=true), pass actual image token count to build_audio_prompt(), use grid_thw for mRoPE position planning - Memory optimization: release audio encoder resources after inference to avoid OOM when vision encoder loads next - Streaming text embeddings during AR generation with deepstack fusion modeling_qwen3_omni.cpp: - Add tts_pad_token_id and tts_eos_token_id config usage for base sample Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * fix conflict * fix merged duplicated codes * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiping Yan <xiping.yan@intel.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiping Yan <xiping.yan@intel.com> * qwen3-omni-4b: add precision mode support and fix TTS audio quality under int8 Add PrecisionMode infrastructure to the TTS binary: - PrecisionMode enum (fp32, inf_fp16_kv_int8, inf_fp32_kv_fp32_w_int4_asym, etc.) - compile_props_for_precision() / set_text_model_precision() helpers - User-specified precision applies to text/vision/audio models via argv[9] - Output DEVICE and PRECISION_MODE as stdout KV lines for tooling Fix int8 TTS audio quality degradation: - Root cause: run_min_tts() was applying user-specified precision (e.g. inference_precision=f16, kv_cache_precision=u8) to all TTS models (talker, code predictor, speech decoder). These small models are sensitive to reduced precision, producing degraded audio -- wrong length (370K vs 276K samples), quieter (mean_abs 646 vs 1370), and 9x slower (427s vs 46s). - Fix: TTS models now always compile with inference_precision=f32, independent of the precision mode used for the main text model. Note: int4 mode (inf_fp32_kv_fp32_w_int4_asym) was unaffected because its compile props already matched fp32. Update case comparison tool (qwen3_omni_case_compare.py): - Add --devices (comma-separated) and --precisions (comma-separated) flags - Matrix loop order: precision (outer) -> device (middle) -> case (inner) - Add --timeout (default 600s) with subprocess.TimeoutExpired handling - Per-precision summary table and grand summary at end - WAV filenames include device/precision to avoid collisions - Detailed per-case output (prompt, C++ status, text, wav, perf one-liner) - Fix Case 1 total_ms N/A by computing from preprocess+vision+ttft+decode Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: force speech decoder to CPU to fix GPU audio noise The BigVGAN speech decoder produces noise-like audio when running on GPU. The Intel GPU plugin downcasts operations to f16 despite inference_precision=f32 being set (it is only a hint). The SnakeBeta activation (x + 1/beta * sin^2(x * alpha)) involves exp() and sin() operations that accumulate significant f16 rounding errors across the 4-stage upsampling chain (8x5x4x3 = 480x), causing a systematic ~10x amplitude loss in the output waveform. Evidence from matrix test (fp32 and int4 precision modes): Case 2: CPU max_abs=13293, GPU max_abs=1434 (ratio 9.27x) Case 3: CPU max_abs=12537, GPU max_abs=1253 (ratio 10.01x) Case 4: CPU max_abs=14139, GPU max_abs=1588 (ratio 8.90x) Sample counts are identical (same codec tokens), confirming the issue is isolated to the speech decoder, not the talker or code predictor. Fix: compile the speech decoder model with device="CPU" unconditionally. This is safe because the speech decoder runs only once at the end of TTS (not in the autoregressive loop), so the performance impact is negligible. Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b(refactor): remove dead Python bridge code from TTS binary - modeling_qwen3_omni_tts_min.cpp: remove unused flatten_json_values(), tensor_from_bridge_json(), find_python_executable() and nlohmann/json include (replaced by native C++ mel spectrogram pipeline) - qwen3_omni_case_compare.py: remove QWEN3_OMNI_BRIDGE_DIR and PYTHON_EXECUTABLE env var setup (TTS binary no longer uses Python bridge) Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: add ASR/TTS sub-component performance instrumentation Add fine-grained timing instrumentation to the TTS pipeline in modeling_qwen3_omni_tts_min.cpp to measure individual sub-component performance across precisions (FP32, INT8, INT4) on CPU. C++ changes (modeling_qwen3_omni_tts_min.cpp): - Extend TtsRunResult with 6 new fields: model_compile_ms, talker_prefill_ms, talker_decode_ms, code_predictor_ms, speech_decoder_ms, codec_frames - Add steady_clock timers around: model compile, talker prefill, per-frame talker decode and code predictor (accumulators), speech decoder inference - Output 6 new KV lines: TTS_MODEL_COMPILE_MS, TTS_TALKER_PREFILL_MS, TTS_TALKER_DECODE_MS, TTS_CODE_PREDICTOR_MS, TTS_SPEECH_DECODER_MS, TTS_CODEC_FRAMES - Force speech decoder to CPU to avoid GPU f16 downcast audio noise Python changes: - Update parse_kv_stdout in qwen3_omni_case_compare.py to capture the 6 new TTS sub-component fields - Add standalone benchmark script (qwen3_omni_asr_tts_bench.py) that runs Cases 2/3/4 across FP32/INT8/INT4 on CPU, collects ASR (AUDIO_ENCODE_MS) and TTS sub-component timing, and produces summary tables + JSON output Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-vl: add video input preprocessing support Add preprocess_video() to Qwen3VL and Qwen3Omni vision preprocessors for handling multi-frame video inputs: - Accept vector of u8 frames ([H,W,3] or [1,H,W,3]) - Smart resize all frames to a shared resolution using video min/max pixel limits - Temporal padding to align with temporal_patch_size - Patchify video frames with correct temporal-spatial ordering - Build position embeddings and rotary cos/sin for video grid_thw - Qwen3Omni delegates to Qwen3VL base implementation Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: add video input support to TTS sample pipeline Extend modeling_qwen3_omni_tts_min sample to support video inputs: - Add build_video_prompt() for video-only prompt construction - Add build_audio_prompt() video_tokens parameter for combined prompts - Add load_video_frames() to load frames from a directory of images - Add smart_nframes() for intelligent frame sampling (mirrors Python logic) - Add linspace_indices() for uniform frame index sampling - Wire video preprocessing through the vision encoder pipeline Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: add Case 5 (image+video+audio+text) to case compare tool Add multi-modal Case 5 test scenario to qwen3_omni_case_compare.py: - Add --video, --case5-audio, --case5-image, --case5-prompt-file CLI args - Add --max-video-frames option to control memory usage (default: 8) - Auto-extract video frames via extract_video_frames.py for C++ binary - Support combined image + video + audio + text input conversation - Track case5-specific resources in the output report JSON Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * qwen3-omni-4b: Add video frame extraction tool for Qwen3-Omni C++ pipeline Add extract_video_frames utility that extracts sampled video frames from a video file and saves them as numbered PNG images. This is used by the C++ modeling_qwen3_omni_tts_min binary which loads frames via stb_image. The frame sampling logic mirrors the Python Qwen3 Omni video pipeline (qwen_omni_utils/vision_process.py) to ensure consistent frame selection between Python and C++ inference paths. - Add extract_video_frames.cpp using OpenCV VideoCapture for frame extraction - Update CMakeLists.txt: add extract_video_frames target with proper OpenCV imported targets and post-build DLL copy step for Windows - Update qwen3_omni_case_compare.py to use the C++ binary directly Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> * Update openvino.genai/src/cpp/src/modeling/samples/CMakeLists.txt Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Xiping Yan <xiping.yan@intel.com> * Update openvino.genai/src/cpp/src/modeling/samples/modeling_qwen3_omni_tts_min.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: xzhan34 <xiaolin.zhang@intel.com> * Update openvino.genai/src/cpp/src/modeling/samples/extract_video_frames.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: xzhan34 <xiaolin.zhang@intel.com> * ov-master-rebase: replace ov::element::undefined with ov::element::dynamic OpenVINO 2026.1.0 removed ov::element::undefined from the element type enum. The replacement is ov::element::dynamic, which serves the same purpose as a sentinel value indicating an unspecified element type. Update quantization_selector.hpp default parameter values and quantization_selector.cpp comparison to use ov::element::dynamic. Build error (MSVC C2039): quantization_selector.hpp(62): error C2039: 'undefined': is not a member of 'ov::element' quantization_selector.hpp(79): error C2039: 'undefined': is not a member of 'ov::element' quantization_selector.cpp(40): error C2039: 'undefined': is not a member of 'ov::element' * ov-master-rebase: adapt to const-correct Tensor::data<T>() in OpenVINO 2026.1.0 In OpenVINO 2026.1.0, calling Tensor::data<T>() on a const ov::Tensor now returns 'const T*' instead of 'T*'. This is a breaking API change that enforces const-correctness. Fix three call sites: - sequence_group.hpp: change 'int64_t* position_ids_data' to 'const int64_t*' when reading from a const tensor - speculative_decoding_stateful.cpp: change 'float* logits_data' to 'const float*' in the sample_token lambda - lora/adapter.cpp: add const_cast<char*> when passing tensor data to safetensors_file_init(), which is a C API expecting void* but only reads the data Build errors (MSVC C2440): sequence_group.hpp(240): error C2440: 'initializing': cannot convert from 'const int64_t *' to 'int64_t *' speculative_decoding_stateful.cpp(341): error C2440: 'initializing': cannot convert from 'const float *' to 'float *' adapter.cpp(105): error C2440: 'initializing': cannot convert from 'const char *' to 'char *' * ov-master-rebase: make custom allocator deallocate() noexcept for OV 2026.1.0 OpenVINO 2026.1.0 changed ov::Allocator's template constructor to use SFINAE with 'has_noexcept_deallocate_v<T>', requiring that any custom allocator's deallocate() method be marked noexcept. Without this, the template constructor is excluded from overload resolution and the compiler cannot convert custom allocator types to ov::Allocator. Fix three allocator implementations: - py_image_generation_pipelines.cpp: TorchTensorAllocator::deallocate made noexcept (body emptied since PyTorch manages its own memory) - samples/cpp/image_generation/load_image.cpp: SharedImageAllocator deallocate made noexcept, wrap in explicit ov::Allocator() constructor - samples/cpp/visual_language_chat/load_image.cpp: same fix as above Build errors (MSVC C2440): py_image_generation_pipelines.cpp(250): error C2440: cannot convert from 'TorchTensorAllocator' to 'const ov::Allocator &' load_image.cpp(43): error C2440: '<function-style-cast>': cannot convert from 'SharedImageAllocator' to 'ov::Tensor' load_image.cpp(57): error C2440: '<function-style-cast>': cannot convert from 'SharedImageAllocator' to 'ov::Tensor' * ov-master-rebase: add ENABLE_NEW_ARCH_OPS option to conditionally exclude new-arch code Introduce ENABLE_NEW_ARCH_OPS cmake option (default ON) to allow building openvino.genai against upstream OpenVINO master which lacks new-arch custom operators (LinearAttention, MOE3GemmFusedCompressed, FusedMLP, PlaceholderExtension). When ENABLE_NEW_ARCH_OPS=OFF: - cmake/features.cmake: define the new option - src/cpp/CMakeLists.txt: exclude model sources (src/modeling/models/), gguf building blocks, gguf/safetensors modeling, and gguf loader from compilation; define ENABLE_NEW_ARCH_OPS preprocessor macro when ON - CMakeLists.txt: skip ov_ops_tests subdirectory (tests require fused_mlp.hpp, moe_3gemm_fused_compressed.hpp which are new-arch only) - src/cpp/src/modeling/CMakeLists.txt: skip modeling tests and modeling samples (they link against new-arch model object files) Build errors addressed: error C1083: Cannot open include file: 'openvino/op/fused_mlp.hpp' error C1083: Cannot open include file: 'openvino/op/moe_3gemm_fused_compressed.hpp' error C1083: Cannot open include file: 'openvino/op/placeholder_extension.hpp' error LNK2019: unresolved external symbol (modeling model classes) error LNK1120: N unresolved externals (modeling samples) * ov-master-rebase: guard new-arch header includes and calls for master OV compatibility Add compile-time guards for source files that reference new-arch OpenVINO headers not present in master OpenVINO 2026.1.0. modeling/ops/ops.cpp: - Use __has_include() to conditionally include linear_attn.hpp and moe_3gemm_fused_compressed.hpp - When headers are available (new-arch build), use full implementations - When headers are absent (master build), provide stub functions that throw descriptive errors at runtime utils.cpp: - Guard #include of gguf_modeling.hpp and safetensors_modeling.hpp with #ifdef ENABLE_NEW_ARCH_OPS preprocessor checks - Guard calls to create_from_gguf() and create_from_safetensors() with ENABLE_NEW_ARCH_OPS, providing descriptive error messages when disabled Build errors addressed: ops.cpp: error C1083: Cannot open include file: 'openvino/op/linear_attn.hpp' ops.cpp: error C1083: Cannot open include file: 'openvino/op/moe_3gemm_fused_compressed.hpp' utils.cpp: error C1083: Cannot open include file (gguf_modeling.hpp) error LNK2019: unresolved external symbols from gguf/safetensors code * ov-master-rebase: enable modeling_qwen3_omni, modeling_qwen3_omni_tts_min, extract_video_frames targets in build-master mode (ENABLE_NEW_ARCH_OPS=OFF) Problem: When building with ENABLE_NEW_ARCH_OPS=OFF (build-master scenario against master OpenVINO 2026.1.0), the three sample targets modeling_qwen3_omni, modeling_qwen3_omni_tts_min, and extract_video_frames were not built because: 1. modeling/CMakeLists.txt guarded add_subdirectory(samples) with both ENABLE_SAMPLES AND ENABLE_NEW_ARCH_OPS 2. All model sources under modeling/models/*.cpp were blanket-excluded, causing 48 linker errors (unresolved externals) for qwen3_omni, qwen3_vl, and qwen3_tts symbols Changes: - src/cpp/src/modeling/CMakeLists.txt: Remove ENABLE_NEW_ARCH_OPS guard from samples subdirectory inclusion (now only requires ENABLE_SAMPLES) - src/cpp/src/modeling/samples/CMakeLists.txt: Wrap the 8 non-essential targets (modeling_qwen3_vl, modeling_qwen3_5, modeling_deepseek_ocr2, modeling_zimage, modeling_wan_t2v, modeling_wan_layered_dit, modeling_dflash, modeling_qwen3_tts) in if(ENABLE_NEW_ARCH_OPS) guards; leave the 3 needed targets always built - src/cpp/CMakeLists.txt: After the blanket removal of modeling/models/*.cpp, re-add qwen3_omni/*.cpp, qwen3_tts/*.cpp, and qwen3_vl/processing_qwen3_vl.cpp back into SOURCE_FILES so the object library provides the required symbols Verified: build-master succeeds with 0 errors, all 3 exe targets produced. * ov-master-rebase: vision model divide-by-zero and tokenizer DLL loading in build-master Two runtime bugs fixed for build-master (ENABLE_NEW_ARCH_OPS=OFF): 1. Vision model INT4 quantization (divide-by-zero crash): The shared SafetensorsWeightFinalizer applied INT4_ASYM quantization to both text and vision models. Vision encoder weights must NOT be quantized - INT4 weights cause STATUS_INTEGER_DIVIDE_BY_ZERO (0xC0000094) in the CPU plugin during vision inference. Fix: use a separate non-quantizing finalizer for create_qwen3_omni_vision_model. 2. Tokenizer DLL not found (core.cpp:193 exception): The tokenizers_dll_name CMake variable was only defined inside the if(ENABLE_NEW_ARCH_OPS) guard block. When ENABLE_NEW_ARCH_OPS=OFF, the post-build copy for the always-built targets (modeling_qwen3_omni, modeling_qwen3_omni_tts_min) used an empty variable, so openvino_tokenizers.dll was never copied next to the exe. Fix: define tokenizers_dll_name before the always-built targets section. --------- Signed-off-by: Mi, Yanfeng <yanfeng.mi@intel.com> Signed-off-by: Zhang, Xiaolin <xiaolin.zhang@intel.com> Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: Xiping Yan <xiping.yan@intel.com> Signed-off-by: xzhan34 <xiaolin.zhang@intel.com> Co-authored-by: Li, Liang A <liang.a.li@intel.com> Co-authored-by: Chuansheng Liu <chuansheng.liu@intel.com> Co-authored-by: Mi, Yanfeng <yanfeng.mi@intel.com> Co-authored-by: Yina Chen <yina.chen@intel.com> Co-authored-by: liqianhao111 <qianhao.li@intel.com> Co-authored-by: rnwang04 <ruonan1.wang@intel.com> Co-authored-by: Ruonan Wang <rnwang@foxmail.com> Co-authored-by: Wang Yang <yang4.wang@intel.com> Co-authored-by: River.Li <river.li@intel.com> Co-authored-by: chenhu-wang <chenhu.wang@intel.com> Co-authored-by: xiping.yan <xiping.yan@intel.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Optimize performance of llm inference module

25bb6a5

Optimize performance of llm inference module through remove unnecessary operations. Signed-off-by: Ziniu Lin <ziniu.lin@intel.com>

github-actions bot added the no-match-files label Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize performance of llm inference module#36

Optimize performance of llm inference module#36
ZiniuLin wants to merge 1 commit intoxipingyan:master_modular_genaifrom
ZiniuLin:optimize_llm_module

ZiniuLin commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZiniuLin commented Dec 31, 2025

Main Change:

Test Result:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant