-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
OpenVINO Version
2026.0.0-20965-c6d6a13a886-releases/2026/0
Operating System
Windows System
Device used for inference
NPU
Framework
None
Model used
Qwen/Qwen3-0.6B (28 layers, INT4 grouped quantization, exported to OV format for NPU)
Issue description
When constructing an [LLMPipeline] with a Qwen3-0.6B 28-layer INT4 model as the NPU draft device in OpenVINO GenAI heterogeneous speculative decoding (GPU target + NPU draft), the VPUX compiler aborts with
LLVM ERROR: Failed to infer result type(s)
during model compilation. The as_convolution optimization pass produces a degenerate tensor<1x0x1x1xf16> input shape for self_attn.v_proj in layer 0, which is irreconcilable with the filter shape tensor<1x8x1x1xf16>. The resulting IE.Convolution node fails MLIR type inference, triggering SIGABRT. The process exits immediately; the failure is not catchable via Python exception handling.
Issue #28171 is the closest existing open issue (same error class, Windows LNL NPU, still unresolved)
Tested precision: INT4 per-group (group_size=128) — this is the distinguishing factor vs. channel-wise INT4 which is listed as supported
Suggestion: Intel add explicit validation guard in as_convolution pass for 0-channel input tensors produced by fc_decomposed canonicalization of per-group INT4 weights
Environment
Field Value
Hardware Intel Core Ultra 7 258V (Lunar Lake, 8P cores, 8 logical)
NPU Intel AI Boost (integrated, Lunar Lake)
GPU Intel Arc 140V (Xe2, 16 GB shared LPDDR5X)
Memory 32 GB LPDDR5X-8533 unified
OS Windows 11 Pro, Build 26200
NPU Driver 32.0.100.4514 (dated 2025-12-17)
GPU Driver 32.0.101.6987
OpenVINO 2026.0.0-20965-c6d6a13a886-releases/2026/0
OpenVINO GenAI 2026.0.0.0-2820-dab5b993a38
Python 3.x, Windows venv
Model Qwen/Qwen3-0.6B (28 layers, INT4 grouped quantization, exported to OV format for NPU)
Context Heterogeneous speculative decoding: GPU target (Qwen3-14B INT4) + NPU draft
Step-by-step reproduction
- Export Qwen3-0.6B to NPU format using optimum-intel or [ov.save_model]:
The model used was exported to [openvino-int4-npu/] format (INT4 grouped quantization, Qwen3-0.6B 28L). The export itself completes without error.
- Construct a heterogeneous speculative decoding pipeline:
import openvino_genai as ov_genai
from openvino_genai import LLMPipeline, SchedulerConfig
TARGET_PATH = "models/qwen3-14b/openvino-int4-gpu/"
DRAFT_NPU_PATH = "models/qwen3-0.6b/openvino-int4-npu/"
scheduler = SchedulerConfig()
scheduler.cache_size = 3 # GB
pipeline = LLMPipeline(
TARGET_PATH,
"GPU",
draft_model=ov_genai.draft_model(DRAFT_NPU_PATH, "NPU"),
scheduler_config=scheduler,
)
- Observed result:
Process aborts with the following output before Python returns from the [LLMPipeline] constructor. No Python exception is raised. Exit code: 1.
Relevant log output
[ERROR] 00:12:40.147 [vpux-compiler] Got Diagnostic at loc(fused<{name =
"__module.model.layers.0.self_attn.v_proj/ov_ext::linear/MatMul",
type = "MatMul"}>["__module.model.layers.0.self_attn.v_proj/ov_ext::linear/MatMul",
"fc_decomposed", "matmul_0", "as_convolution"]) :
Channels count of input tensor shape and filter shape must be the same: 0 != 8
loc(fused<{name = "__module.model.layers.0.self_attn.v_proj/ov_ext::linear/MatMul",
type = "MatMul"}>["__module.model.layers.0.self_attn.v_proj/ov_ext::linear/
MatMul", "fc_decomposed", "matmul_0", "as_convolution"]):
error: Channels count of input tensor shape and filter shape must be the same: 0 != 8
LLVM ERROR: Failed to infer result type(s):
"IE.Convolution"(...) {} : (tensor<1x0x1x1xf16>, tensor<1x8x1x1xf16>) -> ( ??? )Issue submission checklist
- I'm reporting an issue. It's not a question.
- I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
- There is reproducer code and related data files such as images, videos, models, etc.