Skip to content

v2.16.0

Latest
Compare
Choose a tag to compare
@nikita-malininn nikita-malininn released this 10 Apr 10:13
· 2516 commits to develop since this release

Post-training Quantization:

Features:

  • (PyTorch) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce quality loss.
  • (PyTorch, Experimental) Introduced TorchFunctionMode support for MinMax, FastBiasCorrection, SmoothQuant, WeightCompression algorithms.

Fixes:

  • Fixed occasional failures of the weights compression algorithm on ARM CPUs.
  • Fixed GPTQ fails with per-channel int4 weights compression.
  • Fixed weight compression fails for models with fp8 weights.
  • (PyTorch, Experimental) Fixed weights compression for float16/bfloat16 models.
  • (PyTorch, Experimental) Fixed several memory leak issues: non-detached tensors, extracted modules & graph building with gradients.

Improvements:

  • Reduced the run time and peak memory of the mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in the mixed precision case is about 20-40%; peak memory reduction is about 20%.
  • The NNCF hardware config has been extended with the narrow_range parameter, enabling more combinations of quantization configurations in the MinMax quantization algorithm.
  • (TorchFX, Experimental) Added quantization support for TorchFX models exported with dynamic shapes.
  • (TorchFX, Experimental) The constant folding step is removed from the quantize_pt2e function and the transform_for_annotation method of the OpenVINOQuantizer to align with the torch.ao quantization implementation.
  • Optimized GPTQ algorithm behavior to decrease memory & time consumption by 2.71x and 1.16x, respectively.
  • Added general support for optimization of models with FP8 and NF4 weights.
  • Disable applying overflow fix for non 8-bit quantization.

Tutorials:

Compression-aware training:

Features:

  • (PyTorch) Introduced a novel weight compression method to significantly improve the accuracy of Large Language Models (LLMs) with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The nncf.compress_weights API now includes a new compression_format option, nncf.CompressionFormat.FQ_LORA, for this QAT method, a sample compression pipeline with preview support is available here.
  • (PyTorch) Changed compression modules serialization API: compressed_model.nncf.get_config was changed to nncf.torch.get_config. The documentation was updated to use the new API.

Requirements:

  • Updated PyTorch (2.6.0) and Torchvision (0.21.0) versions.
  • Updated Transformers (>=4.48.0) version.
  • Updated NumPy (<2.3.0) version support.
  • Updated NetworkX (<3.5.0) version support.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@shumaari