Bug Fixes
- Fix ONNX 1.19 compatibility issues with CuPy during ONNX INT4 AWQ quantization. ONNX 1.19 uses ml_dtypes.int4 instead of numpy.int8 which caused CuPy failures.
New Features
- Add support for ONNX Mixed Precision Weight-only quantization using INT4 and INT8 precisions. Refer quantization example for GenAI LLMs.
- Add support for some diffusion models' quantization on Windows. Refer example script for details.
- Add Perplexity and KL-Divergence accuracy benchmarks.
New Features
- Model Optimizer for Windows now supports NvTensorRtRtx execution-provider.
New Features
- New LLM models like DeepSeek etc. are supported with ONNX INT4 AWQ quantization on Windows. Refer Windows Support Matrix for details about supported features and models.
- Model Optimizer for Windows now supports ONNX INT8 and FP8 quantization (W8A8) of SAM2 and Whisper models. Check example scripts for getting started with quantizing these models.
New Features
- This is the first official release of Model Optimizer for Windows
- ONNX INT4 Quantization: :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` now supports ONNX INT4 quantization for DirectML and TensorRT* deployment. See :ref:`Support_Matrix` for details about supported features and models.
- LLM Quantization with Olive: Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer Olive example.
- DirectML Deployment Guide: Added DML deployment guide. Refer :ref:`Onnxruntime_Deployment` deployment guide for details.
- MMLU Benchmark for Accuracy Evaluations: Introduced MMLU benchmarking for accuracy evaluation of ONNX models on DirectML (DML).
- Published quantized ONNX models collection: Published quantized ONNX models at HuggingFace NVIDIA collections.
* This version includes experimental features such as TensorRT deployment of ONNX INT4 models, PyTorch quantization and sparsity. These are currently unverified on Windows.