Skip to content

Releases: openvinotoolkit/nncf

v2.16.0

10 Apr 10:13
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • (PyTorch) Added support for 4-bit weight compression with AWQ and Scale Estimation data-aware methods to reduce quality loss.
  • (PyTorch, Experimental) Introduced TorchFunctionMode support for MinMax, FastBiasCorrection, SmoothQuant, WeightCompression algorithms.

Fixes:

  • Fixed occasional failures of the weights compression algorithm on ARM CPUs.
  • Fixed GPTQ fails with per-channel int4 weights compression.
  • Fixed weight compression fails for models with fp8 weights.
  • (PyTorch, Experimental) Fixed weights compression for float16/bfloat16 models.
  • (PyTorch, Experimental) Fixed several memory leak issues: non-detached tensors, extracted modules & graph building with gradients.

Improvements:

  • Reduced the run time and peak memory of the mixed precision assignment procedure during weight compression in the OpenVINO backend. Overall compression time reduction in the mixed precision case is about 20-40%; peak memory reduction is about 20%.
  • The NNCF hardware config has been extended with the narrow_range parameter, enabling more combinations of quantization configurations in the MinMax quantization algorithm.
  • (TorchFX, Experimental) Added quantization support for TorchFX models exported with dynamic shapes.
  • (TorchFX, Experimental) The constant folding step is removed from the quantize_pt2e function and the transform_for_annotation method of the OpenVINOQuantizer to align with the torch.ao quantization implementation.
  • Optimized GPTQ algorithm behavior to decrease memory & time consumption by 2.71x and 1.16x, respectively.
  • Added general support for optimization of models with FP8 and NF4 weights.
  • Disable applying overflow fix for non 8-bit quantization.

Tutorials:

Compression-aware training:

Features:

  • (PyTorch) Introduced a novel weight compression method to significantly improve the accuracy of Large Language Models (LLMs) with int4 weights. Leveraging Quantization-Aware Training (QAT) and absorbable LoRA adapters, this approach can achieve a 2x reduction in accuracy loss during compression compared to the best post-training weight compression technique in NNCF (Scale Estimation + AWQ + GPTQ). The nncf.compress_weights API now includes a new compression_format option, nncf.CompressionFormat.FQ_LORA, for this QAT method, a sample compression pipeline with preview support is available here.
  • (PyTorch) Changed compression modules serialization API: compressed_model.nncf.get_config was changed to nncf.torch.get_config. The documentation was updated to use the new API.

Requirements:

  • Updated PyTorch (2.6.0) and Torchvision (0.21.0) versions.
  • Updated Transformers (>=4.48.0) version.
  • Updated NumPy (<2.3.0) version support.
  • Updated NetworkX (<3.5.0) version support.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@shumaari

v2.15.0

06 Feb 10:08
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • (TensorFlow) The nncf.quantize() method is now the recommended API for Quantization-Aware Training. Please refer to an example for more details about how to use a new approach.
  • (TensorFlow) Compression layers placement in the model now can be serialized and restored with new API functions: nncf.tensorflow.get_config() and nncf.tensorflow.load_from_config(). Please see the documentation for the saving/loading of a quantized model for more details.
  • (OpenVINO) Added example with LLM quantization to FP8 precision.
  • (TorchFX, Experimental) Preview support for the new quantize_pt2e API has been introduced, enabling quantization of torch.fx.GraphModule models with the OpenVINOQuantizer and the X86InductorQuantizer quantizers. quantize_pt2e API utilizes MinMax algorithm statistic collectors, as well as SmoothQuant, BiasCorrection and FastBiasCorrection Post-Training Quantization algorithms.
  • Added unification of scales for ScaledDotProductAttention operation.

Fixes:

  • (ONNX) Fixed sporadic accuracy issues with the BiasCorrection algorithm.
  • (ONNX) Fixed GroupConvolution operation weight quantization, which also improves performance for a number of models.
  • Fixed AccuracyAwareQuantization algorithm to solve #3118 issue.
  • Fixed issue with NNCF usage with potentially corrupted backend frameworks.

Improvements:

  • (TorchFX, Experimental) Added YoloV11 support.
  • (OpenvINO) The performance of the FastBiasCorrection algorithm was improved.
  • Significantly faster data-free weight compression for OpenVINO models: INT4 compression is now up to 10x faster, while INT8 compression is up to 3x faster. The larger the model the higher the time reduction.
  • AWQ weight compression is now up to 2x faster, improving overall runtime efficiency.
  • Peak memory usage during INT4 data-free weight compression in the OpenVINO backend is reduced by up to 50% for certain models.

Tutorials:

Deprecations/Removals:

  • (TensorFlow) The nncf.tensorflow.create_compressed_model() method is now marked as deprecated. Please use the nncf.quantize() method for the quantization initialization.

Requirements:

  • Updated the minimal version for numpy (>=1.24.0).
  • Removed tqdm dependency.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@rk119
@devesh-2002

v2.14.1

19 Dec 12:28
Compare
Choose a tag to compare

Post-training Quantization:

Bugfixes:

  • (PyTorch) Fixed the get_torch_compile_wrapper function to match with the torch.compile.
  • (OpenVINO) Updated cache statistics functionality to utilize the safetensors approach.

v2.14.0

20 Nov 11:49
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • Introduced backup_mode optional parameter in nncf.compress_weights() to specify the data type for embeddings, convolutions and last linear layers during 4-bit weights compression. Available options are INT8_ASYM by default, INT8_SYM, and NONE which retains the original floating-point precision of the model weights.
  • Added the quantizer_propagation_rule parameter, providing fine-grained control over quantizer propagation. This advanced option is designed to improve accuracy for models where quantizers with different granularity could be merged to per-tensor, potentially affecting model accuracy.
  • Introduced nncf.data.generate_text_data API method that utilizes LLM to generate data for further data-aware optimization. See the example for details.
  • (OpenVINO) Extended support of data-free and data-aware weight compression methods for nncf.compress_weights() with NF4 per-channel quantization, which makes compressed LLMs more accurate and faster on NPU.
  • (OpenVINO) Introduced a new option statistics_path to cache and reuse statistics for nncf.compress_weights(), reducing the time required to find optimal compression configurations. See the TinyLlama example for details.
  • (TorchFX, Experimental) Added support for quantization and weight compression of Torch FX models. The compressed models can be directly executed via torch.compile(compressed_model, backend="openvino") (see details here). Added INT8 quantization example. The list of supported features:
    • INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via nncf.quantize().
    • Data-free INT8, INT4, and mixed-precision weights compression with nncf.compress_weights().
  • (PyTorch, Experimental) Added model tracing and execution pre-post hooks based on TorchFunctionMode.

Fixes:

  • Resolved an issue with redundant quantizer insertion before elementwise operations, reducing noise introduced by quantization.
  • Fixed type mismatch issue for nncf.quantize_with_accuracy_control().
  • Fixed BiasCorrection algorithm for specific branching cases.
  • (OpenVINO) Fixed GPTQ weight compression method for Stable Diffusion models.
  • (OpenVINO) Fixed issue with the variational statistics processing for nncf.compress_weights().
  • (PyTorch, ONNX) Scaled dot product attention pattern quantization setup is aligned with OpenVINO.

Improvements:

  • Reduction in peak memory by 30-50% for data-aware nncf.compress_weights() with AWQ, Scale Estimation, LoRA and mixed-precision algorithms.
  • Reduction in compression time by 10-20% for nncf.compress_weights() with AWQ algorithm.
  • Aligned behavior for ignored subgraph between different networkx versions.
  • Extended ignored patterns with RoPE block for nncf.ModelType.TRANSFORMER scheme.
  • (OpenVINO) Extended to the ignored scope for nncf.ModelType.TRANSFORMER scheme with GroupNorm metatype.
  • (ONNX) SE-block ignored pattern variant for torchvision mobilenet_v3 has been extended.

Tutorials:

Known issues:

  • (ONNX) nncf.quantize() method can generate inaccurate INT8 results for MobileNet models with the BiasCorrection algorithm.

Deprecations/Removals:

  • Migrated from using setup.py to pyproject.toml for the build and package configuration. It is aligned with Python packaging standards as outlined in PEP 517 and PEP 518. The installation through setup.py does not work anymore. No impact on the installation from PyPI and Conda.
  • Removed support for Python 3.8.
  • (PyTorch) nncf.torch.create_compressed_model() function has been deprecated.

Requirements:

  • Updated ONNX (1.17.0) and ONNXRuntime (1.19.2) versions.
  • Updated PyTorch (2.5.1) and Torchvision (0.20.1) versions.
  • Updated NumPy (<2.2.0) version support.
  • Updated Ultralytics (8.3.22) version.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@rk119
@zina-cs

v2.13.0

19 Sep 10:24
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • (OpenVINO) Added support for combining GPTQ with AWQ and Scale Estimation (SE) algorithms in nncf.compress_weights() for more accurate weight compression of LLMs. Thus, the following combinations with GPTQ are now supported: AWQ+GPTQ+SE, AWQ+GPTQ, GPTQ+SE, GPTQ.
  • (OpenVINO) Added LoRA Correction Algorithm to further improve the accuracy of int4 compressed models on top of other algorithms - AWQ and Scale Estimation. It can be enabled via the optional lora_correction parameter of the nncf.compress_weights() API. The algorithm increases compression time and incurs a negligible model size overhead. Refer to accuracy/footprint trade-off for different int4 compression methods.
  • (PyTorch) Added implementation of the experimental Post-training Activation Pruning algorithm. Refer to Activation Sparsity for details.
  • Added a memory monitoring tool for logging the memory a piece of python code or a script allocates. Refer to NNCF tools for details.

Fixes:

  • (OpenVINO) Fixed the quantization of Convolution and LSTMSequence operations in cases where some inputs are part of a ShapeOF subgraph.
  • (OpenVINO) Fixed issue with the FakeConvert duplication for FP8.
  • Fixed Smooth Quant algorithm issue in case of the incorrect shapes.
  • Fixed non-deterministic layer-wise scheduling.

Improvements:

  • (OpenVINO) Increased hardware-fused pattern coverage.
  • Improved progress bar logic during weights compression for more accurate remaining time estimation.
  • Extended Scale estimation bitness range support for the nncf.compress_weights().
  • Removed extra logging for the algorithm-generated ignored scope.

Tutorials:

Compression-aware training:

Fixes:

  • (PyTorch) Fixed some scenarios of NNCF patching interfering with torch.compile.

Requirements:

  • Updated PyTorch (2.4.0) and Torchvision (0.19.0) versions.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@rk119

v2.12.0

31 Jul 12:28
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • (OpenVINO, PyTorch, ONNX) Excluded comparison operators from the quantization scope for nncf.ModelType.TRANSFORMER.
  • (OpenVINO, PyTorch) Changed the representation of symmetrically quantized weights from an unsigned integer with a fixed zero-point to a signed data type without a zero-point in the nncf.compress_weights() method.
  • (OpenVINO) Extended patterns support of the AWQ algorithm as part of nncf.compress_weights(). This allows apply AWQ for the wider scope of the models.
  • (OpenVINO) Introduced nncf.CompressWeightsMode.E2M1 mode option of nncf.compress_weights() as the new MXFP4 precision (Experimental).
  • (OpenVINO) Added support for models with BF16 precision in the nncf.quantize() method.
  • (PyTorch) Added quantization support for the torch.addmm.
  • (PyTorch) Added quantization support for the torch.nn.functional.scaled_dot_product_attention.

Fixes:

  • (OpenVINO, PyTorch, ONNX) Fixed Fast-/BiasCorrection algorithms with correct support of transposed MatMul layers.
  • (OpenVINO) Fixed nncf.IgnoredScope() functionality for models with If operation.
  • (OpenVINO) Fixed patterns with PReLU operations.
  • Fixed runtime error while importing NNCF without Matplotlib package.

Improvements:

  • Reduced the amount of memory required for applying nncf.compress_weights() to OpenVINO models.
  • Improved logging in case of the not empty nncf.IgnoredScope().

Tutorials:

Compression-aware training:

Fixes:

  • (PyTorch) Fixed issue with wrapping for operator without patched state.

Requirements:

  • Updated Tensorflow (2.15) version. This version requires Python 3.9-3.11.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@Lars-Codes

v2.11.0

17 Jun 11:02
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • (OpenVINO) Added Scale Estimation algorithm for 4-bit data-aware weights compression. The optional scale_estimation parameter was introduced to nncf.compress_weights() and can be used to minimize accuracy degradation of compressed models (note that this algorithm increases the compression time).
  • (OpenVINO) Added GPTQ algorithm for 8/4-bit data-aware weights compression, supporting INT8, INT4, and NF4 data types. The optional gptq parameter was introduced to nncf.compress_weights() to enable the GPTQ algorithm.
  • (OpenVINO) Added support for models with BF16 weights in the weights compression method, nncf.compress_weights().
  • (PyTorch) Added support for quantization and weight compression of the custom modules.

Fixes:

  • (OpenVINO) Fixed incorrect node with bias determination in Fast-/BiasCorrection and ChannelAlighnment algorithms.
  • (OpenVINO, PyTorch) Fixed incorrect behaviour of nncf.compress_weights() in case of compressed model as input.
  • (OpenVINO, PyTorch) Fixed SmoothQuant algorithm to work with Split ports correctly.

Improvements:

  • (OpenVINO) Aligned resulting compression subgraphs for the nncf.compress_weights() in different FP precisions.
  • Aligned 8-bit scheme for NPU target device with the CPU.

Examples:

  • (OpenVINO, ONNX) Updated ignored scope for YOLOv8 examples utilizing a subgraphs approach.

Tutorials:

Compression-aware training:

Features:

  • (PyTorch) nncf.quantize method is now the recommended path for the quantization initialization for Quantization-Aware Training.
  • (PyTorch) Compression modules placement in the model now can be serialized and restored with new API functions: compressed_model.nncf.get_config() and nncf.torch.load_from_config. The documentation for the saving/loading of a quantized model is available, and Resnet18 example was updated to use the new API.

Fixes:

  • (PyTorch) Fixed compatibility with torch.compile.

Improvements:

  • (PyTorch) Base parameters were extended for the EvolutionOptimizer (LeGR algorithm part).
  • (PyTorch) Improved wrapping for parameters which are not tensors.

Examples:

  • (PyTorch) Added an example for STFPM model from Anomalib.

Tutorials:

Deprecations/Removals:

  • Removed extra dependencies to install backends from setup.py (like [torch] are [tf], [onnx] and [openvino]).
  • Removed openvino-dev dependency.

Requirements:

  • Updated PyTorch (2.3.0) and Torchvision (0.18.0) versions.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@DaniAffCH
@UsingtcNower
@anzr299
@AdiKsOnDev
@Viditagarwal7479
@truhinnm

v2.10.0

25 Apr 12:01
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • Introduced the subgraph defining functionality for the nncf.IgnoredScope() option.
  • Introduced limited support for the batch size of more than 1. MobilenetV2 PyTorch example was updated with batch support.

Fixes:

  • Fixed issue with the nncf.OverflowFix parameter absence in some scenarios.
  • Aligned the list of correctable layers for the FastBiasCorrection algorithm between PyTorch, OpenVINO and ONNX backends.
  • Fixed issue with the nncf.QuantizationMode parameters combination.
  • Fixed MobilenetV2 (PyTorch, ONNX, OpenVINO) examples for the Windows platform.
  • (OpenVINO) Fixed Anomaly Classification example for the Windows platform.
  • (PyTorch) Fixed bias shift magnitude calculation for fused layers.
  • (OpenVINO) Fixed removing the ShapeOf graph which led to an error in the nncf.quantize_with_accuracy_control() method.
  • Improvements:
  • OverflowFix, AdvancedSmoothQuantParameters and AdvancedBiasCorrectionParameters were exposed into the nncf.* namespace.
  • (OpenVINO, PyTorch) Introduced scale compression to FP16 for weights in nncf.compress_weights() method, regardless of model weights precision.
  • (PyTorch) Modules that NNCF inserted were excluded from parameter tracing.
  • (OpenVINO) Extended the list of correctable layers for the BiasCorrection algorithm.
  • (ONNX) Aligned BiasCorrection algorithm behaviour with OpenVINO in specific cases.

Tutorials:

Compression-aware training:

Features:

  • (PyTorch) nncf.quantize method now may be used as quantization initialization for Quantization-Aware Training. Added a Resnet18-based example with the transition from the Post-Training Quantization to a Quantization-Aware Training algorithm.
  • (PyTorch) Introduced extractors for the fused Convolution, Batch-/GroupNorm, and Linear functions.

Fixes:

  • (PyTorch) Fixed apply_args_defaults function issue.
  • (PyTorch) Fixed dtype handling for the compressed torch.nn.Parameter.
  • (PyTorch) Fixed is_shared parameter propagation.

Improvements:

  • (PyTorch) Updated command creation behaviour to reduce the number of adapters.
  • (PyTorch) Added option to insert point for models that wrapped with replace_modules=False.

Deprecations/Removals:

  • (PyTorch) Removed the binarization algorithm.
  • NNCF installation via pip install nncf[] option is now deprecated.

Requirements:

  • Updated PyTorch (2.2.1) and CUDA (12.1) versions.
  • Updated ONNX (1.16.0) and ONNXRuntime (1.17.1) versions.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@Candyzorua
@clinty
@UsingtcNower
@DaniAffCH

v2.9.0

06 Mar 11:39
Compare
Choose a tag to compare

Post-training Quantization:

Features:

  • (OpenVINO) Added modified AWQ algorithm for 4-bit data-aware weights compression. This algorithm applied only for patterns MatMul->Multiply->Matmul. For that awq optional parameter has been added to nncf.compress_weights() and can be used to minimize accuracy degradation of compressed models (note that this option increases the compression time).
  • (ONNX) Introduced support for the ONNX backend in the nncf.quantize_with_accuracy_control() method. Users can now perform quantization with accuracy control for onnx.ModelProto. By leveraging this feature, users can enhance the accuracy of quantized models while minimizing performance impact.
  • (ONNX) Added an example based on the YOLOv8n-seg model for demonstrating the usage of quantization with accuracy control for the ONNX backend.
  • (PT) Added SmoothQuant algorithm for PyTorch backend in nncf.quantize().
  • (OpenVINO) Added an example with the hyperparameters tuning for the TinyLLama model.
  • Introduced the nncf.AdvancedAccuracyRestorerParameters.
  • Introduced the subset_size option for the nncf.compress_weights().
  • Introduced TargetDevice.NPU as the replacement for TargetDevice.VPU.

Fixes:

  • Fixed API Enums serialization/deserialization issue.
  • Fixed issue with required arguments for revert_operations_to_floating_point_precision method.

Improvements:

  • (ONNX) Aligned statistics collection with OpenVINO and PyTorch backends.
  • Extended nncf.compress_weights() with Convolution & Embeddings compression in order to reduce memory footprint.

Deprecations/Removals:

  • (OpenVINO) Removed outdated examples with nncf.quantize() for BERT and YOLOv5 models.
  • (OpenVINO) Removed outdated example with nncf.quantize_with_accuracy_control() for SSD MobileNetV1 FPN model.
  • (PyTorch) Deprecated the binarization algorithm.
  • Removed Post-training Optimization Tool as OpenVINO backend.
  • Removed Dockerfiles.
  • TargetDevice.VPU was replaced by TargetDevice.NPU.

Tutorials:

Compression-aware training:

Fixes

  • (PyTorch) Fixed issue with NNCFNetworkInterface.get_clean_shallow_copy missed arguments.

Acknowledgements

Thanks for contributions from the OpenVINO developer community:
@AishwaryaDekhane
@UsingtcNower
@Om-Doiphode

v2.8.1

09 Feb 09:45
Compare
Choose a tag to compare

Post-training Quantization:

Bugfixes:

  • (Common) Fixed issue with nncf.compress_weights() to avoid overflows on 32-bit Windows systems.
  • (Common) Fixed performance issue with nncf.compress_weights() on LLama models.
  • (Common) Fixed nncf.quantize_with_accuracy_control pipeline with tune_hyperparams=True enabled option.
  • (OpenVINO) Fixed issue for stateful LLM models and added state restoring after the inference for it.
  • (PyTorch) Fixed issue with nncf.compress_weights() for LLM models with the executing is_floating_point with tracing.