This repository demonstrates the optimization of the Qwen2.5-1.5B-Instruct model using post-training quantization (PTQ) techniques. The optimization process is divided into these workflows:
- Quark Quantization for AMD NPU
- PTQ + AOT for QNN NPU
- This process extends the QDQ flow and compiling specifically for Qualcomm NPUs
- Int4 Quantization for QNN GPU
- OpenVINO for Intel® CPU/GPU/NPU
- This process uses OpenVINO specific passes like
OpenVINOOptimumConversion,OpenVINOIoUpdateandOpenVINOEncapsulation
- This process uses OpenVINO specific passes like
- Float downcasting for NVIDIA TRT for RTX GPU
- DML for general GPU
- This process uses ModelBuilder
For some python packages, users need to install visual studio 2022 or visual studio 2022 build tools with c++ development tools modules.
This workflow produces an ONNX QDQ model that is agnostic to the target hardware and accelerator, making it suitable for general inference.
The model is optimized using weight-only quantization and activation quantization for efficient deployment. The process includes:
-
Weight Rotation (QuaRot)
- Reduces outliers from weights and hidden states to enhance quantization efficiency.
-
4-bit Per-Channel Symmetric Quantization (GPTQ)
- Reduces transformer layer size while preserving accuracy.
-
ONNX Graph Capture
- Exports the model to ONNX for further optimization.
-
4-bit Block-wise Quantization
- Applies weight-only quantization to the embedding layer and language modeling head.
-
16-bit Activation Quantization
- Uses 16-bit activations to balance precision and efficiency.
The final output is a QDQ model with 4-bit weights and 16-bit activations. This model also leverages GroupQueryAttention (GQA) for efficient long-context processing and long-sequence generation.
NPUs require precompiled graphs, meaning the model must use static input shapes. However, text generation involves two distinct processing stages:
- Prefill (Prompt Processing): Processes multiple tokens simultaneously.
- Token Generation (Iteration): Processes one token at a time.
To support both efficiently, we create two model instances:
- Prefill model: Optimized for batch processing.
- Token generation model: Optimized for one-token-at-a-time inference.
When Quantization Dataset Sequence Length is 1024, it needs about 20GB GPU Memory. So adjust according to your hardware. The default length from IHV is 4096, it would need A100 to convert.
This process extends the QDQ Model with 4-bit Weights & 16-bit Activations by compiling it specifically for Qualcomm NPUs using the QNN Execution Provider.
To maximize efficiency while supporting dynamic input handling:
- Embedding Layer & Language Model Head → Executed on CPU (handles dynamic input).
- Transformer Layers → Executed on NPU (requires static input shapes).
- Weight Sharing → Prefill & token generation models reuse weights to minimize memory usage.
⚠️ Note: GQA is an ONNX Runtime contrib operator and must be executed on the CPU. The model graph is partitioned into CPU (GQA nodes) and NPU (other nodes) for execution.
Once optimized, the model is compiled for Qualcomm NPUs using ONNX Runtime QNNExecutionProvider. The steps include:
- Split the Quantized Model → Divide into three parts:
- Embedding Layer
- Transformer Layers
- Language Model Head
- Set Static Input Shapes:
- (1, 64) for prefill (batch size, sequence length).
- (1, 1) for token generation.
- Compile using QNNExecutionProvider:
- Leverages weight sharing across the prefill and token generation models.
This workflow is configured using the qnn_config.json file. It contains all of the quantization and compilation steps. It requires two separate Python environments described below.
- python=3.10
- CUDA=12.1
- cudnn=9.2.0
Quantization is resource-intensive and requires GPU acceleration. In an x64 Python environment with Olive installed, install the required packages:
# Install common dependencies
pip install -r requirements.txt
# Install ONNX Runtime GPU packages
pip install "onnxruntime-gpu>=1.21.0" "onnxruntime-genai-cuda>=0.6.0"
# AutoGPTQ: Install from source (stable package may be slow for weight packing)
# Disable CUDA extension build (not required)
# Linux
export BUILD_CUDA_EXT=0
# Windows
# set BUILD_CUDA_EXT=0
# Install AutoGPTQ from source
pip install --no-build-isolation git+https://github.com/PanQiWei/AutoGPTQ.git
⚠️ Only set up the environment and install the packages. Do not run theolive runcommand at this point.
Model compilation using QNN Execution Provider requires a Python environment with onnxruntime-qnn installed. In a separate Python environment with Olive installed, install the required packages:
# Install ONNX Runtime QNN
pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
pip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-depsReplace /path/to/qnn/env/bin in qnn_config.json with the path to the directory containing your QNN environment's Python executable. This path can be found by running the following command in the environment:
# Linux
command -v python
# Windows
# where pythonThis command will return the path to the Python executable. Set the parent directory of the executable as the /path/to/qnn/env/bin in the config file.
Activate the Quantization Python Environment and run the workflow:
olive run --config qnn_config.jsonOlive will run the AOT compilation step in the AOT Compilation Python Environment specified in the config file using a subprocess. All other steps will run in the Quantization Python Environment natively.
✅ Optimized model saved in: ./model
⚠️ If optimization fails due to out of memory, please removecalibration_providersin config file.
⚠️ If optimization fails during context binary generation (EPContextBinaryGenerator pass, usually exit status 3221225477), rerun the command. The process will resume from the last completed step.
The optimized model can be used for inference using ONNX Runtime QNNExecutionProvider and ONNX Runtime GenAI. Inference must be run on a Windows Copilot+ PC with a Qualcomm NPU.
pip install -r https://raw.githubusercontent.com/microsoft/onnxruntime/refs/heads/main/requirements.txt
pip install -U --pre --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple onnxruntime-qnn --no-deps
pip install "onnxruntime-genai>=0.7.0rc2"Execute the provided inference_sample.ipynb notebook.
These workflows performs quantization with Optimum Intel®. It performs the optimization pipeline:
- HuggingFace Model -> Quantized OpenVINO model -> Quantized encapsulated ONNX OpenVINO IR model