Aligned bf16 tuning vs f32 inference for 4bit compression (#3493)

ljaljushkin · web-flow · commit 0d42595edddb · 2025-05-19T11:25:30.000+02:00
### Changes Always cast input to float32 inside FQ + LoRA. Benchmark results with new schema on the https://github.com/ljaljushkin/nncf_pytorch/tree/nl/ref_benchmark with small modifications from @nikita-malininn's branch https://github.com/nikita-malininn/nncf/tree/nm/ref_benchmark: device | dtype | exec_type | tensor_type | granularity | symmetric | narrow_range | timing_mode | num_runs | input_size -- | -- | -- | -- | -- | -- | -- | -- | -- | -- cuda | bfloat16 | ExecutionType.REGULAR | TensorType.WEIGHTS | GranularityType.PER_CHANNEL | TRUE | FALSE | TimingMode.KERNEL | 1000 | [2048, 128256] name | Mode | forward_avg, ms | backward_avg, ms | memory, Gb -- | -- | -- | -- | -- compile (PR) | sym | 6.5 | 10.7 | 3.9 compile (PR) | asym | 6.8 | 10.7 | 3.9 compile (before) | sym | 1.6 | 9.5 | 4.2 compile (before) | asym | 1.9 | 9.5 | 3.9 not compiled (PR) | sym | 19.0 | 46.6 | 5.9 not compiled (PR) | asym | 19.6 | 47.0 | 5.9 not compiled (before) | sym | 9.2 | 37.0 | 5.4 not compiled (before) | asym | 9.6 | 37.0 | 5.4 There's an overhead on forward, but it's leveled up by using torch.compile. There's a 1-6% overhead on RTX per epoch, and on A100, depending on the setup, there can even be a boost of 6% or a slowdown of 3%. ![image](https://github.com/user-attachments/assets/a7eb98e9-a906-4462-98a3-c4e2e061eb5a) ![image](https://github.com/user-attachments/assets/1e803a51-1a2b-4a44-9603-b7987e940bb2) ### Reason for changes Minimize the disparity in precision between the Torch model and its exported OV equivalent. The full alignment would be very inefficient, so here's a compromise: align accuracy with minimal overhead on the forward pass. e2e test on `facebook/opt-125m` proves that output is the same now within default absolute tolerance (1e-8) instead of 1e-2 one: https://github.com/openvinotoolkit/nncf/pull/3493/files#diff-7a4f90fe4f07d515df355d6fb618112d7d3fe88eb8ba777e502c695a7c715010R170 Previously, there were 3 problematic models with significant difference in accuracy, now it's much more aligned: ![image](https://github.com/user-attachments/assets/a22c855a-7c00-4f77-895f-91ce5713387c) ### Related tickets 166195 ### Tests test examples - https://github.com/openvinotoolkit/nncf/actions/runs/15024278726/job/42221028011
diff --git a/examples/llm_compression/torch/qat_with_lora/README.md b/examples/llm_compression/torch/qat_with_lora/README.md
@@ -64,28 +64,37 @@ Where:
 - `PPL_PTWC` is the perplexity after applying the best Post-Training Weight Compression method identified
 for each specific model: this was "AWQ + Scale Estimation + GPTQ" for "HuggingFaceTB/SmolLM-1.7B-Instruct",
 and "AWQ + Scale Estimation" for all other models evaluated.
-- `PPL_QAT+LoRA` is the perplexity after applying Quantization-Aware Training with LoRA.
+- `PPL_QAT+LoRA` is the perplexity after applying Quantization-Aware Training with LoRA for 10 epochs.
 
 All quantization methods compressed the models to `INT4_ASYM` precision with a group size of `64`.
 
-| Model                              | Precision         | Wikitext,<br>word_ppl | Improvement |
-|------------------------------------|-------------------|-----------------------|-------------|
-| google/gemma-2-2b-it               | BF16              | 15.02                 |             |
-| google/gemma-2-2b-it               | INT4 (QAT + LoRA) | 15.13                 | 86%         |
-| google/gemma-2-2b-it               | INT4 (best PTWC)  | 15.80                 |             |
-| microsoft/phi3-mini-4k-instruct    | BF16              | 9.49                  |             |
-| microsoft/phi3-mini-4k-instruct    | INT4 (QAT + LoRA) | 10.12                 | 27%         |
-| microsoft/phi3-mini-4k-instruct    | INT4 (best PTWC)  | 10.36                 |             |
-| Qwen/Qwen2.5-3B-Instruct           | BF16              | 11.01                 |             |
-| Qwen/Qwen2.5-3B-Instruct           | INT4 (QAT + LoRA) | 11.49                 | 25%         |
-| Qwen/Qwen2.5-3B-Instruct           | INT4 (best PTWC)  | 11.65                 |             |
-| HuggingFaceTB/SmolLM-1.7B-Instruct | BF16              | 19.11                 |             |
-| HuggingFaceTB/SmolLM-1.7B-Instruct | INT4 (QAT + LoRA) | 19.25                 | 79%         |
-| HuggingFaceTB/SmolLM-1.7B-Instruct | INT4 (best PTWC)  | 19.79                 |             |
-| mistralai/Mistral-7B-v0.3          | BF16              | 8.21                  |             |
-| mistralai/Mistral-7B-v0.3          | INT4 (QAT + LoRA) | 8.38                  | 12%         |
-| mistralai/Mistral-7B-v0.3          | INT4 (best PTWC)  | 8.40                  |             |
-| meta-llama/Llama-3.2-3B-Instruct   | BF16              | 12.67                 |             |
-| meta-llama/Llama-3.2-3B-Instruct   | INT4 (QAT + LoRA) | 12.82                 | 73%         |
-| meta-llama/Llama-3.2-3B-Instruct   | INT4 (best PTWC)  | 13.22                 |             |
-|                                    |                   |               Average | 50.4%       |
+| Model                               | Precision         | Wikitext,<br>word_ppl | Improvement |
+|-------------------------------------|-------------------|-----------------------|-------------|
+| google/gemma-2-2b-it                | BF16              | 15.02                 |             |
+| google/gemma-2-2b-it                | INT4 (QAT + LoRA) | 15.09                 | 91%         |
+| google/gemma-2-2b-it                | INT4 (best PTWC)  | 15.80                 |             |
+| microsoft/phi3-mini-4k-instruct     | BF16              | 9.49                  |             |
+| microsoft/phi3-mini-4k-instruct     | INT4 (QAT + LoRA) | 10.04                 | 37%         |
+| microsoft/phi3-mini-4k-instruct     | INT4 (best PTWC)  | 10.36                 |             |
+| Qwen/Qwen2.5-3B-Instruct            | BF16              | 11.01                 |             |
+| Qwen/Qwen2.5-3B-Instruct            | INT4 (QAT + LoRA) | 11.44                 | 33%         |
+| Qwen/Qwen2.5-3B-Instruct            | INT4 (best PTWC)  | 11.65                 |             |
+| HuggingFaceTB/SmolLM-1.7B-Instruct  | BF16              | 19.11                 |             |
+| HuggingFaceTB/SmolLM-1.7B-Instruct  | INT4 (QAT + LoRA) | 19.34                 | 66%         |
+| HuggingFaceTB/SmolLM-1.7B-Instruct  | INT4 (best PTWC)  | 19.79                 |             |
+| mistralai/Mistral-7B-v0.3           | BF16              | 8.21                  |             |
+| mistralai/Mistral-7B-v0.3           | INT4 (QAT + LoRA) | 8.36                  | 20%         |
+| mistralai/Mistral-7B-v0.3           | INT4 (best PTWC)  | 8.40                  |             |
+| meta-llama/Llama-3.2-1B-Instruct    | BF16              | 16.30                 |             |
+| meta-llama/Llama-3.2-1B-Instruct    | INT4 (QAT + LoRA) | 17.12                 | 40%         |
+| meta-llama/Llama-3.2-1B-Instruct    | INT4 (best PTWC)  | 17.67                 |             |
+| meta-llama/Llama-3.2-3B-Instruct    | BF16              | 12.67                 |             |
+| meta-llama/Llama-3.2-3B-Instruct    | INT4 (QAT + LoRA) | 13.00                 | 39%         |
+| meta-llama/Llama-3.2-3B-Instruct    | INT4 (best PTWC)  | 13.22                 |             |
+| meta-llama/Meta-Llama-3-8B-Instruct | BF16              | 10.22                 |             |
+| meta-llama/Meta-Llama-3-8B-Instruct | INT4 (QAT + LoRA) | 10.30                 | 62%         |
+| meta-llama/Meta-Llama-3-8B-Instruct | INT4 (best PTWC)  | 10.45                 |             |
+| microsoft/phi3.5-mini-instruct      | BF16              | 10.00                 |             |
+| microsoft/phi3.5-mini-instruct      | INT4 (QAT + LoRA) | 10.53                 | 37%         |
+| microsoft/phi3.5-mini-instruct      | INT4 (best PTWC)  | 10.71                 |             |
+|                                     |                   |               Average | 46%         |
diff --git a/nncf/torch/quantization/quantize_functions.py b/nncf/torch/quantization/quantize_functions.py
@@ -126,23 +126,19 @@ def forward(ctx, input_, input_shape, scale, level_low, level_high, levels):
         input_low = torch.where(scale > 0, -scale, -scale / level_low * level_high)
         # 15/8 * scale or (2-1/8) * scale
         input_range = torch.abs((2 + 1 / level_low) * scale)
-
-        if input_.dtype in [torch.bfloat16, torch.float16]:
-            input_low = input_low.type(input_.dtype)
-            input_range = input_range.type(input_.dtype)
-
+        dtype = input_.dtype
         original_shape = input_.shape
         input_ = input_.reshape(input_shape)
 
-        output = RQ.Quantize_forward(input_, input_low, input_range, levels)
+        output = RQ.Quantize_forward(input_.type(torch.float32), input_low, input_range, levels)
 
         ctx.save_for_backward(input_, input_low, input_range)
         ctx.level_low = level_low
         ctx.level_high = level_high
         ctx.levels = levels
 
         output = output.reshape(original_shape)
-        return output
+        return output.type(dtype)
 
     @staticmethod
     def backward(ctx, grad_output):
@@ -168,14 +164,11 @@ def backward(ctx, grad_output):
 class QuantizeAsymmetricTorch(torch.autograd.Function):
     @staticmethod
     def forward(ctx, input_, input_shape, input_low, input_range, level_low, level_high, levels):
-        if input_.dtype in [torch.bfloat16, torch.float16]:
-            input_low = input_low.type(input_.dtype)
-            input_range = input_range.type(input_.dtype)
-
+        dtype = input_.dtype
         original_shape = input_.shape
         input_ = input_.reshape(input_shape)
 
-        output = RQ.Quantize_forward(input_, input_low, input_range, levels)
+        output = RQ.Quantize_forward(input_.type(torch.float32), input_low, input_range, levels)
 
         # Save tensors for backward pass
         ctx.save_for_backward(input_, input_low, input_range)
@@ -184,7 +177,7 @@ def forward(ctx, input_, input_shape, input_low, input_range, level_low, level_h
         ctx.levels = levels
 
         output = output.reshape(original_shape)
-        return output
+        return output.type(dtype)
 
     @staticmethod
     def backward(ctx, grad_output):
diff --git a/tests/cross_fw/examples/example_scope.json b/tests/cross_fw/examples/example_scope.json
@@ -282,8 +282,8 @@
         "cpu": "Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz",
         "accuracy_tolerance": 0.1,
         "accuracy_metrics": {
-            "perplexity_diff_torch": 0.6,
-            "best_ov_perplexity": 35.1
+            "perplexity_diff_torch": 0.75,
+            "best_ov_perplexity": 34.94
         }
     },
     "quantization_aware_training_tensorflow_mobilenet_v2": {
diff --git a/tests/torch2/function_hook/quantization/strip/test_strip_dequantize.py b/tests/torch2/function_hook/quantization/strip/test_strip_dequantize.py
@@ -12,19 +12,24 @@
 from dataclasses import dataclass
 from typing import Any
 
+import openvino as ov
 import pytest
 import torch
+from openvino._pyopenvino.properties.hint import inference_precision
+from openvino.tools.ovc import convert_model
 from pytest_mock import MockerFixture
 from torch import nn
 
 import nncf
 import nncf.torch
 from nncf.common.quantization.structs import QuantizationScheme
+from nncf.openvino.optimized_functions.models import _compile_ov_model
 from nncf.parameters import CompressWeightsMode
 from nncf.parameters import StripFormat
 from nncf.torch.function_hook.wrapper import get_hook_storage
 from nncf.torch.quantization.layers import AsymmetricLoraQuantizer
 from nncf.torch.quantization.layers import BaseQuantizer
+from nncf.torch.quantization.layers import BaseWeightsDecompressor
 from nncf.torch.quantization.layers import INT4AsymmetricWeightsDecompressor as INT4AsymDQ
 from nncf.torch.quantization.layers import INT4SymmetricWeightsDecompressor as INT4SymDQ
 from nncf.torch.quantization.layers import INT8AsymmetricWeightsDecompressor as INT8AsymDQ
@@ -53,7 +58,8 @@ class ParamStripLora:
     mode: CompressWeightsMode
     decompressor_class: type
     torch_dtype: torch.dtype
-    atol: float
+    torch_atol: float
+    ov_atol: float
     weight_dtype: torch.dtype
 
     def __str__(self) -> str:
@@ -76,17 +82,14 @@ def num_call_pack_weight(self) -> int:
 @pytest.mark.parametrize(
     ("param"),
     (
-        ParamStripLora(CompressWeightsMode.INT4_ASYM, INT4AsymDQ, torch.float32, 1e-3, torch.uint8),
-        ParamStripLora(CompressWeightsMode.INT4_ASYM, INT4AsymDQ, torch.float16, 1e-8, torch.uint8),
-        ParamStripLora(CompressWeightsMode.INT4_ASYM, INT4AsymDQ, torch.bfloat16, 1e-2, torch.uint8),
-        ParamStripLora(CompressWeightsMode.INT4_SYM, INT4SymDQ, torch.float32, 1e-3, torch.uint8),
-        # torch.compile introduces bigger diff for sym
-        ParamStripLora(CompressWeightsMode.INT4_SYM, INT4SymDQ, torch.float16, 1e-3, torch.uint8),
-        ParamStripLora(CompressWeightsMode.INT4_SYM, INT4SymDQ, torch.bfloat16, 1e-2, torch.uint8),
-        # int8 uses per-channel vs int4 group-wise
-        ParamStripLora(CompressWeightsMode.INT8_SYM, INT8SymDQ, torch.bfloat16, 1e-2, torch.int8),
-        # int8 uses per-channel vs int4 group-wise
-        ParamStripLora(CompressWeightsMode.INT8_ASYM, INT8AsymDQ, torch.bfloat16, 1e-8, torch.uint8),
+        ParamStripLora(CompressWeightsMode.INT4_ASYM, INT4AsymDQ, torch.float32, 1e-3, 1e-3, torch.uint8),
+        ParamStripLora(CompressWeightsMode.INT4_ASYM, INT4AsymDQ, torch.float16, 1e-3, 1e-3, torch.uint8),
+        ParamStripLora(CompressWeightsMode.INT4_ASYM, INT4AsymDQ, torch.bfloat16, 1e-8, 1e-1, torch.uint8),
+        ParamStripLora(CompressWeightsMode.INT4_SYM, INT4SymDQ, torch.float32, 1e-3, 1e-3, torch.uint8),
+        ParamStripLora(CompressWeightsMode.INT4_SYM, INT4SymDQ, torch.float16, 1e-8, 1e-3, torch.uint8),
+        ParamStripLora(CompressWeightsMode.INT4_SYM, INT4SymDQ, torch.bfloat16, 1e-8, 1e-2, torch.uint8),
+        ParamStripLora(CompressWeightsMode.INT8_SYM, INT8SymDQ, torch.bfloat16, 1e-2, 1e-3, torch.int8),
+        ParamStripLora(CompressWeightsMode.INT8_ASYM, INT8AsymDQ, torch.bfloat16, 1e-8, 1e-3, torch.uint8),
     ),
     ids=str,
 )
@@ -114,12 +117,24 @@ def test_nncf_strip_lora_model(param: ParamStripLora, mocker: MockerFixture):
             compressed_model, do_copy=True, strip_format=StripFormat.DQ, example_input=example_input
         )
         stripped_output = strip_compressed_model(example_input)
-
         assert pack_weight_spy.call_count == param.num_call_pack_weight
         assert strip_compressed_model.linear.weight.dtype == param.weight_dtype
 
         check_compression_modules(strip_compressed_model, param.decompressor_class)
-        assert torch.allclose(compressed_output, stripped_output, atol=param.atol)
+        assert torch.allclose(compressed_output, stripped_output, atol=param.torch_atol)
+
+        example_input = example_input.type(torch.float32)
+        hook_storage = get_hook_storage(strip_compressed_model)
+        for _, module in hook_storage.named_hooks():
+            if isinstance(module, BaseWeightsDecompressor):
+                module.result_dtype = torch.float32
+        ov_model = convert_model(strip_compressed_model, example_input=example_input)
+        compiled_model = _compile_ov_model(ov_model, device_name="CPU", config={inference_precision(): ov.Type.f32})
+        infer_request = compiled_model.create_infer_request()
+        res = infer_request.infer(example_input)
+        out_name = compiled_model.outputs[0]
+        ov_output = torch.from_numpy(res[out_name])
+        assert torch.allclose(compressed_output.type(torch.float32), ov_output, atol=param.ov_atol)
 
 
 SIGNED_WEIGHT_SAMPLE = [-1.0, -0.75, -0.5, -0.25, 0.0, 0.25, 0.5, 0.75]
@@ -155,7 +170,7 @@ def test_sym_fq_to_decompressor(param: ParamSymFQ):
 
     scale_shape = (1, 1)
     scale = torch.tensor(SCALE_SAMPLE)
-    scale = scale.expand(scale_shape).to(torch.float16)
+    scale = scale.expand(scale_shape)
 
     # reference scale calculates with this formula:
     # levels = (2 ** num_bits)
@@ -246,10 +261,10 @@ def test_asym_fq_to_decompressor(param: ParamAsymFQ):
     ref_zero_point = ref_zero_point.expand(scale_shape).to(torch.uint8)
 
     input_low = torch.tensor(INPUT_LOW_SAMPLE)
-    input_low = input_low.expand(scale_shape).to(param.torch_dtype)
+    input_low = input_low.expand(scale_shape)
 
     input_range = torch.tensor(INPUT_RANGE_SAMPLE)
-    input_range = input_range.expand(scale_shape).to(param.torch_dtype)
+    input_range = input_range.expand(scale_shape)
 
     qspec = PTQuantizerSpec(
         num_bits=param.num_bits,
diff --git a/tests/torch2/function_hook/quantization/test_fq_lora.py b/tests/torch2/function_hook/quantization/test_fq_lora.py
@@ -166,9 +166,10 @@ def test_fq_lora_tuning(tmp_path, mode, backup_mode, compression_kwargs, ref_num
         tuned_vs_stripped = vm.calculate_similarity(tuned_output, stripped_output)
         tuned_vs_stripped_ov = vm.calculate_similarity(tuned_output, stripped_ov_output)
 
-        atol = 0.03 if mode == nncf.CompressWeightsMode.INT4_SYM else 0.01  # torch.compile introduces bigger diff
-        assert torch.allclose(tuned_vs_stripped, vm.validation_ref, atol=atol)
-        assert torch.allclose(tuned_vs_stripped_ov, vm.validation_ref, atol=atol)
+        # torch.compiled version of FQ+LoRA leads to a small error
+        atol = 1e-2 if mode == nncf.CompressWeightsMode.INT4_SYM else 1e-8
+        assert torch.allclose(tuned_vs_stripped, vm.validation_ref, atol)
+        assert torch.allclose(tuned_vs_stripped_ov, vm.validation_ref, atol)
 
 
 def test_checkpoint_loading(tmp_path: Path, use_cuda: bool):