add onnx dequantize linear channel wise pattern for NPU #30265

bopeng1234 · 2025-04-22T02:21:56Z

Details:

when DequantizeLinear OP met requirements of NPU (sym INT4, channel-wised Quantization) the performance raised a lot

  // For NPU Optimization
  // reference: https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/text_generation#npu-support
  //  1. model must be exported with symmetric INT4 quantization.
  //  2. The quantized LLM MatMul Op need to use Channel-wised quantization for better performance. (block_size = K)
  // with the limitation, the NPU vpux-plugin compiler could use mix-precision matmul with i4 weight input.
  // it raised the performance a lot in NPU.

This PR is going to optimize the decomposition OPs in ONNX frontend (DequantizeLinear) to match the NPU optimize pattern, before this PR, the INT4 phi3 onnx model using NPUW runs:
1st token 8000ms, 2nd token 9040ms, 3rd+ token avg 4000ms

apply this PR, the same model could reach to:
1st token 3400ms, 2nd token 3400ms, 3rd+ token avg 250ms

16x speed up in NPUW for phi3 model

bopeng1234 · 2025-04-22T02:23:33Z

@gkrivor created this new one for NPU, run with onnx model (QDQ, dequantize linear OP)

bopeng1234 requested review from a team as code owners April 22, 2025 02:21

github-actions bot added category: ONNX FE OpenVINO ONNX FrontEnd category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Apr 22, 2025

add onnx dequantize linear channel wise pattern for NPU

e19047c

bopeng1234 force-pushed the dev_npu_onnx_qdq branch from 82ddecd to e19047c Compare April 22, 2025 02:22

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Apr 22, 2025

bopeng1234 mentioned this pull request Apr 22, 2025

[intel-npu] refine microsoft matmulnbits op, to fit NPU mix precision pattern #29671

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add onnx dequantize linear channel wise pattern for NPU #30265

add onnx dequantize linear channel wise pattern for NPU #30265

bopeng1234 commented Apr 22, 2025

bopeng1234 commented Apr 22, 2025

add onnx dequantize linear channel wise pattern for NPU #30265

Are you sure you want to change the base?

add onnx dequantize linear channel wise pattern for NPU #30265

Conversation

bopeng1234 commented Apr 22, 2025

Details:

bopeng1234 commented Apr 22, 2025