You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/usage/post_training_compression/weights_compression/Usage.md
+79-41Lines changed: 79 additions & 41 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,24 @@
1
-
## Weights Compression
2
1
3
-
[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported.
The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). The algorithm compresses weights for Linear, Convolution and Embedding layers.
8
19
20
+
[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported.
21
+
9
22
### Supported modes
10
23
11
24
By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
@@ -16,6 +29,8 @@ Percent of the rest layers compressed to 4-bit can be configured by "ratio" para
16
29
17
30
### User guide
18
31
32
+
#### Data-free methods
33
+
19
34
- Compress weights asymmetrically to 8-bit integer data type.
20
35
21
36
```python
@@ -56,6 +71,8 @@ from nncf import compress_weights, CompressWeightsMode
56
71
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, group_size=64, ratio=0.9) # model is openvino.Model object
57
72
```
58
73
74
+
#### Data-aware methods
75
+
59
76
- Accuracy of the 4-bit compressed models can be improved by using data-aware mixed-precision algorithm. It is capable to find outliers in the input activations and assign different quantization precision to minimize accuracy degradation.
60
77
Below is the example how to compress 80% of layers to 4-bit integer with a default data-aware mixed precision algorithm.
61
78
It requires just one extra parameter - a NNCF wrapper of the dataset. Refer to the [full example](https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino) of data-aware weight compression for more details. If dataset is not specified, data-free mixed precision algorithm works based on weights only.
- Accuracy of the 4-bit compressed models also can be improved by using AWQ, Scale Estimation, GPTQ or Lora Correction algorithms over data-based mixed-precision algorithm. These algorithms work by equalizing a subset of weights to minimize the difference between the original precision and the 4-bit precision.
81
98
Unlike all others, the Lora Correction algorithm inserts an additional Linear layers for reducing quantization noise and further accuracy improvement. Inevitably, this approach introduces a memory and a runtime overheads, but they are negligible, since the inserted weight much smaller and can be quantized to 8-bit. The AWQ, Scale Estimation (SE) and Lora Correction (LC) algo can be used in any combination together: AWQ + SE, AWQ + LC, SE + LC, AWQ + SE + LC. The GPTQ algorithm can be combined with AWQ and Scale Estimation in any combination: AWQ + GPTQ, GPTQ + SE, AWQ + GPTQ + SE. Below are examples demonstrating how to enable the AWQ, Scale Estimation, GPTQ or Lora Correction algorithms:
82
99
83
-
Prepare the calibration dataset for data-based algorithms:
100
+
<details>
101
+
<summary>Prepare the calibration dataset for data-based algorithms</summary>
84
102
85
-
```python
86
-
from datasets import load_dataset
87
-
from functools import partial
88
-
from nncf import compress_weights, CompressWeightsMode, Dataset
89
-
from optimum.intel.openvino import OVModelForCausalLM
90
-
from transformers import AutoTokenizer
103
+
```python
104
+
from datasets import load_dataset
105
+
from functools import partial
106
+
from nncf import compress_weights, CompressWeightsMode, Dataset
107
+
from optimum.intel.openvino import OVModelForCausalLM
108
+
from transformers import AutoTokenizer
91
109
92
-
deftransform_func(item, tokenizer, input_shapes):
93
-
text = item['text']
94
-
tokens = tokenizer(text)
110
+
deftransform_func(item, tokenizer, input_shapes):
111
+
text = item['text']
112
+
tokens = tokenizer(text)
95
113
96
-
res = {'input_ids': np.expand_dims(np.array(tokens['input_ids']), 0),
- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation. It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.
135
155
@@ -180,6 +200,24 @@ from nncf import compress_weights, CompressWeightsMode
To optimize compression time and reuse statistics across multiple configurations, you can use the `statistics_path` option. This feature enables caching of calculated statistics, allowing them to be loaded from a specified path rather than recalculated for each configuration. This approach can significantly reduce compression time during repeated model compression iterations, making it ideal when searching for optimal compression parameters.
206
+
207
+
To enable statistics caching, set the `statistics_path` parameter to your chosen path.
208
+
209
+
```python
210
+
from nncf.quantization.advanced_parameters import AdvancedCompressionParameters
When `statistics_path` is provided, the system first checks if the specified path exists. If it does, the statistics are loaded from this path. If the path does not exist, the statistics are computed and saved to this path for future use.
220
+
183
221
### Evaluation results
184
222
185
223
#### Data-free Mixed-Precision on Lambada OpenAI dataset
0 commit comments