ONNX Quantization Framework built on top of
⚠️ This project is under active development.
Install directly from PyPI:
pip install onnx-quantizeHere's a minimal example to quantize an ONNX model:
from onnx_quantize import QConfig, QuantType, QWeightArgs, QActivationArgs, quantize
import onnx
# Load your model
model = onnx.load("your_model.onnx")
# Define quantization configuration
qconfig = QConfig(
weights=QWeightArgs(
dtype=QuantType.QInt8,
symmetric=True,
strategy="tensor", # or "channel"
),
input_activations=QActivationArgs(
dtype=QuantType.QInt8,
symmetric=False,
is_static=False, # Dynamic quantization
),
)
# Quantize the model
qmodel = quantize(model, qconfig)
# Save the quantized model
onnx.save(qmodel, "qmodel.onnx")- Static Quantization: Calibration-based quantization with activation statistics
- Dynamic Quantization: Runtime quantization for activations
- Weights-Only Quantization: Quantize only model weights, keeping activations in FP32
- Fine-grained Control: Separate configuration for weights, input activations, and output activations
Supports multiple quantization data types:
- INT4 / UINT4: 4-bit quantization
- INT8 / UINT8: 8-bit quantization (default)
- Tensor-wise: Single scale/zero-point per tensor
- Per-channel: Separate scale/zero-point per output channel
- Group: Configurable group size for finer-grained quantization
- RTN (Round-To-Nearest): Default quantization method with MSE optimization support
- GPTQ: Advanced weight quantization with Hessian-based error correction
- Symmetric/Asymmetric: Control zero-point usage
- Reduce Range: Use reduced range for better numerical stability for some hardware
- Clip Ratio: Percentile-based clipping for outlier handling
- MSE Optimization: Minimize mean squared error when computing quantization parameters
Currently supports quantization for:
- MatMul: Matrix multiplication operations
- Gemm: General matrix multiplication
from onnx_quantize import QConfig, QuantType, QWeightArgs, quantize
import onnx
model = onnx.load("model.onnx")
qconfig = QConfig(
weights=QWeightArgs(
dtype=QuantType.QUInt4,
symmetric=True,
strategy="tensor"
)
)
qmodel = quantize(model, qconfig)
onnx.save(qmodel, "model_w4.onnx")from onnx_quantize import QConfig, QuantType, QWeightArgs, quantize
import onnx
model = onnx.load("model.onnx")
qconfig = QConfig(
weights=QWeightArgs(
dtype=QuantType.QInt8,
symmetric=True,
group_size=128,
strategy="group"
)
)
qmodel = quantize(model, qconfig)
onnx.save(qmodel, "model_group.onnx")from onnx_quantize import QConfig, QuantType, QWeightArgs, QActivationArgs, quantize
import onnx
import numpy as np
model = onnx.load("model.onnx")
# Prepare calibration data
calibration_data = np.random.randn(100, 224, 224, 3).astype(np.float32)
qconfig = QConfig(
weights=QWeightArgs(
dtype=QuantType.QInt8,
symmetric=True,
strategy="tensor"
),
input_activations=QActivationArgs(
dtype=QuantType.QUInt8,
symmetric=False,
is_static=True # Static quantization
),
calibration_data=calibration_data
)
qmodel = quantize(model, qconfig)
onnx.save(qmodel, "model_static.onnx")from onnx_quantize import QConfig, QuantType, QWeightArgs, QActivationArgs, quantize
import onnx
import numpy as np
model = onnx.load("model.onnx")
# Prepare calibration data for static quantization
calibration_data = np.random.randn(100, 224, 224, 3).astype(np.float32)
qconfig = QConfig(
weights=QWeightArgs(
dtype=QuantType.QInt8,
symmetric=True,
strategy="tensor"
),
input_activations=QActivationArgs(
dtype=QuantType.QUInt8,
symmetric=False,
is_static=True
),
output_activations=QActivationArgs(
dtype=QuantType.QUInt8,
symmetric=False,
is_static=True
),
calibration_data=calibration_data
)
qmodel = quantize(model, qconfig)
onnx.save(qmodel, "model_full_quant.onnx")The goal is to provide a flexible and extensible quantization framework using modern ONNX tooling (ONNXScript and ONNX IR), with capabilities comparable to Neural Compressor.
