Skip to content

Commit 231ea70

Browse files
authored
Merge branch 'openvinotoolkit:develop' into develop
2 parents 7ad1586 + 2d23007 commit 231ea70

File tree

177 files changed

+36925
-18432
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

177 files changed

+36925
-18432
lines changed

.github/workflows/precommit.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -121,13 +121,13 @@ jobs:
121121
sudo apt-get --assume-yes install build-essential ninja-build libgl1-mesa-dev libglib2.0-0 wget make
122122
- name: Download CUDA
123123
run: |
124-
wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
125-
sudo sh cuda_12.1.1_530.30.02_linux.run --toolkit --silent
124+
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
125+
sudo sh cuda_12.4.0_550.54.14_linux.run --toolkit --silent
126126
- name: Runner info
127127
continue-on-error: true
128128
run: |
129-
export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}
130-
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
129+
export PATH=/usr/local/cuda-12.4/bin${PATH:+:${PATH}}
130+
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
131131
nvidia-smi
132132
cat /proc/cpuinfo
133133
nvcc --version
@@ -147,8 +147,8 @@ jobs:
147147
python -c "import torch; print(torch.cuda.is_available())"
148148
- name: Run PyTorch precommit test scope
149149
run: |
150-
export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}
151-
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
150+
export PATH=/usr/local/cuda-12.4/bin${PATH:+:${PATH}}
151+
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
152152
make test-torch-cuda
153153
154154
tensorflow:

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -439,12 +439,12 @@ conda install -c conda-forge nncf
439439
- Ubuntu\* 18.04 or later (64-bit)
440440
- Python\* 3.9 or later
441441
- Supported frameworks:
442-
- PyTorch\* >=2.3, <2.5
442+
- PyTorch\* >=2.4, <2.6
443443
- TensorFlow\* >=2.8.4, <=2.15.1
444-
- ONNX\* ==1.16.0
444+
- ONNX\* ==1.17.0
445445
- OpenVINO\* >=2022.3.0
446446

447-
This repository is tested on Python* 3.10.14, PyTorch* 2.4.0 (NVidia CUDA\* Toolkit 12.1) and TensorFlow* 2.12.1 (NVidia CUDA\* Toolkit 11.8).
447+
This repository is tested on Python* 3.10.14, PyTorch* 2.5.0 (NVidia CUDA\* Toolkit 12.4) and TensorFlow* 2.12.1 (NVidia CUDA\* Toolkit 11.8).
448448

449449
## NNCF Compressed NNCF Model Zoo
450450

constraints.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22
openvino==2024.4.0
33

44
# Pytorch
5-
torch==2.4.0
6-
torchvision==0.19.0
5+
torch==2.5.1
6+
torchvision==0.20.1
77

88
# ONNX
9-
onnx==1.16.2
9+
onnx==1.17.0
1010
onnxruntime==1.19.2
1111

1212
# TensorFlow

docs/Installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ as well as the supported versions of Python:
4343

4444
| NNCF | OpenVINO | PyTorch | ONNX | TensorFlow | Python |
4545
|-----------|------------|----------|----------|------------|--------|
46-
| `develop` | `2024.4.0` | `2.4.0` | `1.16.0` | `2.15.1` | `3.10` |
46+
| `develop` | `2024.4.0` | `2.5.1` | `1.17.0` | `2.15.1` | `3.10` |
4747
| `2.13.0` | `2024.4.0` | `2.4.0` | `1.16.0` | `2.15.1` | `3.8`* |
4848
| `2.12.0` | `2024.3.0` | `2.3.0` | `1.16.0` | `2.15.1` | `3.8`* |
4949
| `2.11.0` | `2024.2.0` | `2.3.0` | `1.16.0` | `2.12.0` | `3.8` |

docs/usage/post_training_compression/weights_compression/Usage.md

Lines changed: 79 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,24 @@
1-
## Weights Compression
21

3-
[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported.
2+
- [The algorithm description](#the-algorithm-description)
3+
- [Supported modes](#supported-modes)
4+
- [User guide](#user-guide)
5+
- [Data-free methods](#data-free-methods)
6+
- [Data-aware methods](#data-aware-methods)
7+
- [Caching Statistics](#caching-statistics)
8+
- [Evaluation results](#evaluation-results)
9+
- [Data-free Mixed-Precision on Lambada OpenAI dataset](#data-free-mixed-precision-on-lambada-openai-dataset)
10+
- [Data-aware Mixed-Precision and AWQ methods on Wikitext dataset](#data-aware-mixed-precision-and-awq-methods-on-wikitext-dataset)
11+
- [Scale Estimation and GPTQ methods on Lambada OpenAI dataset](#scale-estimation-and-gptq-methods-on-lambada-openai-dataset)
12+
- [Accuracy/Footprint trade-off](#accuracyfootprint-trade-off)
13+
- [Limitations](#limitations)
14+
- [Additional resources](#additional-resources)
415

516
### The algorithm description
617

718
The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). The algorithm compresses weights for Linear, Convolution and Embedding layers.
819

20+
[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported.
21+
922
### Supported modes
1023

1124
By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
@@ -16,6 +29,8 @@ Percent of the rest layers compressed to 4-bit can be configured by "ratio" para
1629

1730
### User guide
1831

32+
#### Data-free methods
33+
1934
- Compress weights asymmetrically to 8-bit integer data type.
2035

2136
```python
@@ -56,6 +71,8 @@ from nncf import compress_weights, CompressWeightsMode
5671
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, group_size=64, ratio=0.9) # model is openvino.Model object
5772
```
5873

74+
#### Data-aware methods
75+
5976
- Accuracy of the 4-bit compressed models can be improved by using data-aware mixed-precision algorithm. It is capable to find outliers in the input activations and assign different quantization precision to minimize accuracy degradation.
6077
Below is the example how to compress 80% of layers to 4-bit integer with a default data-aware mixed precision algorithm.
6178
It requires just one extra parameter - a NNCF wrapper of the dataset. Refer to the [full example](https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino) of data-aware weight compression for more details. If dataset is not specified, data-free mixed precision algorithm works based on weights only.
@@ -80,56 +97,59 @@ nncf_dataset = nncf.Dataset(synthetic_data, transform_fn)
8097
- Accuracy of the 4-bit compressed models also can be improved by using AWQ, Scale Estimation, GPTQ or Lora Correction algorithms over data-based mixed-precision algorithm. These algorithms work by equalizing a subset of weights to minimize the difference between the original precision and the 4-bit precision.
8198
Unlike all others, the Lora Correction algorithm inserts an additional Linear layers for reducing quantization noise and further accuracy improvement. Inevitably, this approach introduces a memory and a runtime overheads, but they are negligible, since the inserted weight much smaller and can be quantized to 8-bit. The AWQ, Scale Estimation (SE) and Lora Correction (LC) algo can be used in any combination together: AWQ + SE, AWQ + LC, SE + LC, AWQ + SE + LC. The GPTQ algorithm can be combined with AWQ and Scale Estimation in any combination: AWQ + GPTQ, GPTQ + SE, AWQ + GPTQ + SE. Below are examples demonstrating how to enable the AWQ, Scale Estimation, GPTQ or Lora Correction algorithms:
8299

83-
Prepare the calibration dataset for data-based algorithms:
100+
<details>
101+
<summary>Prepare the calibration dataset for data-based algorithms</summary>
84102

85-
```python
86-
from datasets import load_dataset
87-
from functools import partial
88-
from nncf import compress_weights, CompressWeightsMode, Dataset
89-
from optimum.intel.openvino import OVModelForCausalLM
90-
from transformers import AutoTokenizer
103+
```python
104+
from datasets import load_dataset
105+
from functools import partial
106+
from nncf import compress_weights, CompressWeightsMode, Dataset
107+
from optimum.intel.openvino import OVModelForCausalLM
108+
from transformers import AutoTokenizer
91109

92-
def transform_func(item, tokenizer, input_shapes):
93-
text = item['text']
94-
tokens = tokenizer(text)
110+
def transform_func(item, tokenizer, input_shapes):
111+
text = item['text']
112+
tokens = tokenizer(text)
95113

96-
res = {'input_ids': np.expand_dims(np.array(tokens['input_ids']), 0),
97-
'attention_mask': np.expand_dims(np.array(tokens['attention_mask']), 0)}
114+
res = {'input_ids': np.expand_dims(np.array(tokens['input_ids']), 0),
115+
'attention_mask': np.expand_dims(np.array(tokens['attention_mask']), 0)}
98116

99-
if 'position_ids' in input_shapes:
100-
position_ids = np.cumsum(res['attention_mask'], axis=1) - 1
101-
position_ids[res['attention_mask'] == 0] = 1
102-
res['position_ids'] = position_ids
117+
if 'position_ids' in input_shapes:
118+
position_ids = np.cumsum(res['attention_mask'], axis=1) - 1
119+
position_ids[res['attention_mask'] == 0] = 1
120+
res['position_ids'] = position_ids
103121

104-
for name, shape in input_shapes.items():
105-
if name in res:
106-
continue
107-
res[name] = np.zeros(shape)
122+
for name, shape in input_shapes.items():
123+
if name in res:
124+
continue
125+
res[name] = np.zeros(shape)
108126

109-
return res
127+
return res
110128

111-
def get_input_shapes(model, batch_size = 1):
112-
inputs = {}
129+
def get_input_shapes(model, batch_size = 1):
130+
inputs = {}
113131

114-
for val in model.model.inputs:
115-
name = val.any_name
116-
shape = list(val.partial_shape.get_min_shape())
117-
shape[0] = batch_size
118-
inputs[name] = shape
132+
for val in model.model.inputs:
133+
name = val.any_name
134+
shape = list(val.partial_shape.get_min_shape())
135+
shape[0] = batch_size
136+
inputs[name] = shape
119137

120-
return inputs
138+
return inputs
121139

122-
# load your model and tokenizer
123-
model = OVModelForCausalLM.from_pretrained(...)
124-
tokenizer = AutoTokenizer.from_pretrained(...)
140+
# load your model and tokenizer
141+
model = OVModelForCausalLM.from_pretrained(...)
142+
tokenizer = AutoTokenizer.from_pretrained(...)
125143

126-
# prepare dataset for compression
127-
dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train')
128-
dataset = dataset.filter(lambda example: len(example["text"]) > 80)
129-
input_shapes = get_input_shapes(model)
130-
nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer,
131-
input_shapes=input_shapes))
132-
```
144+
# prepare dataset for compression
145+
dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train')
146+
dataset = dataset.filter(lambda example: len(example["text"]) > 80)
147+
input_shapes = get_input_shapes(model)
148+
nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer,
149+
input_shapes=input_shapes))
150+
```
151+
152+
</details>
133153

134154
- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation. It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.
135155

@@ -180,6 +200,24 @@ from nncf import compress_weights, CompressWeightsMode
180200
compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True)
181201
```
182202

203+
#### Caching Statistics
204+
205+
To optimize compression time and reuse statistics across multiple configurations, you can use the `statistics_path` option. This feature enables caching of calculated statistics, allowing them to be loaded from a specified path rather than recalculated for each configuration. This approach can significantly reduce compression time during repeated model compression iterations, making it ideal when searching for optimal compression parameters.
206+
207+
To enable statistics caching, set the `statistics_path` parameter to your chosen path.
208+
209+
```python
210+
from nncf.quantization.advanced_parameters import AdvancedCompressionParameters
211+
from nncf import compress_weights
212+
213+
compressed_model = compress_weights(
214+
model,
215+
advanced_parameters=AdvancedCompressionParameters(statistics_path="statistics")
216+
)
217+
```
218+
219+
When `statistics_path` is provided, the system first checks if the specified path exists. If it does, the statistics are loaded from this path. If the path does not exist, the statistics are computed and saved to this path for future use.
220+
183221
### Evaluation results
184222

185223
#### Data-free Mixed-Precision on Lambada OpenAI dataset

examples/llm_compression/openvino/tiny_llama/main.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@
1111
import time
1212
from functools import partial
1313

14-
import datasets
1514
import numpy as np
1615
import openvino as ov
16+
from datasets import load_dataset
1717
from optimum.intel.openvino import OVModelForCausalLM
1818
from transformers import AutoTokenizer
1919

@@ -24,7 +24,7 @@ def main():
2424
MODEL_ID = "PY007/TinyLlama-1.1B-Chat-v0.3"
2525
OUTPUT_DIR = "tinyllama_compressed"
2626

27-
dataset = datasets.load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
27+
dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
2828

2929
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
3030
model = OVModelForCausalLM.from_pretrained(MODEL_ID, export=True, load_in_8bit=False, compile=False, stateful=False)
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
transformers
22
datasets==2.14.7
33
optimum-intel[openvino]
4-
onnx<1.16.2
4+
onnx==1.17.0

examples/llm_compression/openvino/tiny_llama_find_hyperparams/main.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424

2525
import nncf
2626
from nncf.common.logging import nncf_logger
27+
from nncf.quantization.advanced_parameters import AdvancedCompressionParameters
2728

2829
DataItem = TypeVar("DataItem")
2930
ModelInput = TypeVar("ModelInput")
@@ -63,6 +64,7 @@ def compress_model(
6364
group_size=group_size,
6465
awq=awq,
6566
sensitivity_metric=nncf.parameters.SensitivityMetric.MAX_ACTIVATION_VARIANCE,
67+
advanced_parameters=AdvancedCompressionParameters(statistics_path="statistics"),
6668
)
6769
return optimized_ov_model
6870

examples/llm_compression/openvino/tiny_llama_find_hyperparams/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@ numpy>=1.23.5
44
openvino==2024.4
55
optimum-intel[openvino]>=1.13.0
66
transformers>=4.35.2
7-
onnx<1.16.2
7+
onnx==1.17.0
88
numpy<2
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
torch==2.4.0
1+
torch==2.5.1
22
datasets==3.0.1
33
numpy>=1.23.5
44
openvino==2024.4
55
optimum-intel[openvino]>=1.13.0
66
transformers>=4.35.2
7-
onnx==1.16.0
7+
onnx==1.17.0
88
numpy<2

0 commit comments

Comments
 (0)