·
2516 commits
to develop
since this release
Post-training Quantization:
Features:
- (PyTorch) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce quality loss.
- (PyTorch, Experimental) Introduced TorchFunctionMode support for MinMax, FastBiasCorrection, SmoothQuant, WeightCompression algorithms.
Fixes:
- Fixed occasional failures of the weights compression algorithm on ARM CPUs.
- Fixed GPTQ fails with per-channel int4 weights compression.
- Fixed weight compression fails for models with fp8 weights.
- (PyTorch, Experimental) Fixed weights compression for float16/bfloat16 models.
- (PyTorch, Experimental) Fixed several memory leak issues: non-detached tensors, extracted modules & graph building with gradients.
Improvements:
- Reduced the run time and peak memory of the mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in the mixed precision case is about 20-40%; peak memory reduction is about 20%.
- The NNCF hardware config has been extended with the
narrow_range
parameter, enabling more combinations of quantization configurations in the MinMax quantization algorithm. - (TorchFX, Experimental) Added quantization support for TorchFX models exported with dynamic shapes.
- (TorchFX, Experimental) The constant folding step is removed from the
quantize_pt2e
function and thetransform_for_annotation
method of theOpenVINOQuantizer
to align with thetorch.ao
quantization implementation. - Optimized GPTQ algorithm behavior to decrease memory & time consumption by 2.71x and 1.16x, respectively.
- Added general support for optimization of models with FP8 and NF4 weights.
- Disable applying overflow fix for non 8-bit quantization.
Tutorials:
- Post-Training Optimization of Gemma3 Model
- Post-Training Optimization of GLM4-V Model
- Post-Training Optimization of Llasa Model
- Post-Training Optimization of YOLOv12 Model
- Post-Training Optimization of Phi-4-multimodal Model
- Post-Training Optimization of Qwen2.5VL Model
- Post-Training Optimization of DeepSeek-VL2 Model
- Post-Training Optimization of FLUX.1 Fill Model
- Post-Training Optimization of olmOCR Model
- Post-Training Optimization of SmolDocling Model
- Post-Training Optimization of SmolVLM2 Model
- Post-Training Optimization of GOT-OCR 2.0 Model
- Post-Training Optimization of LTX-Video Model
- Post-Training Optimization of OuteTTS Model
- Post-Training Optimization of SigLIP2 Model
- Post-Training Optimization of OpenCLIP Model
Compression-aware training:
Features:
- (PyTorch) Introduced a novel weight compression method to significantly improve the accuracy of Large Language Models (LLMs) with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The
nncf.compress_weights
API now includes a newcompression_format
option,nncf.CompressionFormat.FQ_LORA
, for this QAT method, a sample compression pipeline with preview support is available here. - (PyTorch) Changed compression modules serialization API:
compressed_model.nncf.get_config
was changed tonncf.torch.get_config
. The documentation was updated to use the new API.
Requirements:
- Updated PyTorch (2.6.0) and Torchvision (0.21.0) versions.
- Updated Transformers (>=4.48.0) version.
- Updated NumPy (<2.3.0) version support.
- Updated NetworkX (<3.5.0) version support.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@shumaari