Aligned bf16 tuning vs f32 inference for 4bit compression #3493

ljaljushkin · 2025-05-14T15:05:34Z

Changes

Always cast input to float32 inside FQ + LoRA.

Benchmark results with new schema on the https://github.com/ljaljushkin/nncf_pytorch/tree/nl/ref_benchmark with small modifications from @nikita-malininn's branch https://github.com/nikita-malininn/nncf/tree/nm/ref_benchmark:

device	dtype	exec_type	tensor_type	granularity	symmetric	narrow_range	timing_mode	num_runs	input_size
cuda	bfloat16	ExecutionType.REGULAR	TensorType.WEIGHTS	GranularityType.PER_CHANNEL	TRUE	FALSE	TimingMode.KERNEL	1000	[2048, 128256]

name	Mode	forward_avg, ms	backward_avg, ms	memory, Gb
compile (PR)	sym	6.5	10.7	3.9
compile (PR)	asym	6.8	10.7	3.9
compile (before)	sym	1.6	9.5	4.2
compile (before)	asym	1.9	9.5	3.9
not compiled (PR)	sym	19.0	46.6	5.9
not compiled (PR)	asym	19.6	47.0	5.9
not compiled (before)	sym	9.2	37.0	5.4
not compiled (before)	asym	9.6	37.0	5.4

There's an overhead on forward, but it's leveled up by using torch.compile.

There's a 1-6% overhead on RTX per epoch, and on A100, depending on the setup, there can even be a boost of 6% or a slowdown of 3%.

Reason for changes

Minimize the disparity in precision between the Torch model and its exported OV equivalent.
The full alignment would be very inefficient, so here's a compromise: align accuracy with minimal overhead on the forward pass.

e2e test on facebook/opt-125m proves that output is the same now within default absolute tolerance (1e-8) instead of 1e-2 one:
https://github.com/openvinotoolkit/nncf/pull/3493/files#diff-7a4f90fe4f07d515df355d6fb618112d7d3fe88eb8ba777e502c695a7c715010R170

Previously, there were 3 problematic models with significant difference in accuracy, now it's much more aligned:

Related tickets

166195

Tests

test examples - https://github.com/openvinotoolkit/nncf/actions/runs/15024278726/job/42221028011

ljaljushkin · 2025-05-14T17:06:41Z

tests/torch2/function_hook/quantization/strip/test_strip_dequantize.py

@@ -155,7 +170,7 @@ def test_sym_fq_to_decompressor(param: ParamSymFQ):

    scale_shape = (1, 1)
    scale = torch.tensor(SCALE_SAMPLE)
-    scale = scale.expand(scale_shape).to(torch.float16)
+    scale = scale.expand(scale_shape)


Doesn't influence the result, just more aligned with default precision in FQ (float32).

ljaljushkin · 2025-05-14T17:09:54Z

tests/torch2/function_hook/quantization/strip/test_strip_dequantize.py

+        ParamStripLora(CompressWeightsMode.INT8_SYM, INT8SymDQ, torch.bfloat16, 1e-2, 1e-3, torch.int8),
+        ParamStripLora(CompressWeightsMode.INT8_ASYM, INT8AsymDQ, torch.bfloat16, 1e-8, 1e-3, torch.uint8),


Note: It's expected that the ov_tol value is higher than the torch_tol value, since the OV model executes in f32 but the torch model has activations in bf16 or f16 on tuning. Even though the ov_tol isn't very small, in a few cases it was larger before the PR.

nncf/torch/quantization/quantize_functions.py

nikita-malininn · 2025-05-15T13:47:24Z

I don't get the results from the performance/memory table. Or this PR slows down the reference and increases memory consumption or the table filled somehow wrong.

ljaljushkin · 2025-05-15T13:53:15Z

I don't get the results from the performance/memory table. Or this PR slows down the reference and increases memory consumption or the table filled somehow wrong.

Yes, your understanding is correct. This is the price for the accuracy.
As I mentioned in the description, increase is not that bad in comparison with not compiled version and backward is not strongly affected.

nikita-malininn · 2025-05-15T14:39:11Z

I don't get the results from the performance/memory table. Or this PR slows down the reference and increases memory consumption or the table filled somehow wrong.

Yes, your understanding is correct. This is the price for the accuracy. As I mentioned in the description, increase is not that bad in comparison with not compiled version and backward is not strongly affected.

About ~4x slowness for the compiled forward version in the symmetric case (or ~2x slowness for non-compiled) and a 10% memory consumption increase - is it not that bad? I don't think so.

ljaljushkin · 2025-05-15T14:53:58Z

I don't get the results from the performance/memory table. Or this PR slows down the reference and increases memory consumption or the table filled somehow wrong.

Yes, your understanding is correct. This is the price for the accuracy. As I mentioned in the description, increase is not that bad in comparison with not compiled version and backward is not strongly affected.

About ~4x slowness for the compiled forward version in the symmetric case (or ~2x slowness for non-compiled) and a 10% memory consumption increase - is it not that bad? I don't think so.

It's still 1.4x faster than not compiled before PR and 2.9x than not compiled with PR.
I don't see instruments to improve it rather than select between compiled and not compiled version.

Do you have some suggestion of improvement?

tests/torch2/function_hook/quantization/strip/test_strip_dequantize.py

ljaljushkin · 2025-05-16T12:46:42Z

Updated description with total time required for 1 epoch before and with PR.
There's a 1-6% overhead on RTX per epoch, and on A100, depending on the setup, there can even be a boost of 6% or a slowdown of 3%.

github-actions bot added documentation Improvements or additions to documentation NNCF PT Pull requests that updates NNCF PyTorch labels May 14, 2025

Aligned bf16 tuning vs f32 inference for 4bit compression

54ad79c

ljaljushkin commented May 14, 2025

View reviewed changes

ljaljushkin requested review from nikita-malininn and AlexanderDokuchaev May 14, 2025 17:11

ljaljushkin marked this pull request as ready for review May 14, 2025 17:11

ljaljushkin requested a review from a team as a code owner May 14, 2025 17:11

AlexanderDokuchaev approved these changes May 15, 2025

View reviewed changes

nikita-malininn reviewed May 15, 2025

View reviewed changes

nncf/torch/quantization/quantize_functions.py Show resolved Hide resolved

ljaljushkin requested a review from nikita-malininn May 15, 2025 14:08

nikita-malininn reviewed May 15, 2025

View reviewed changes

tests/torch2/function_hook/quantization/strip/test_strip_dequantize.py Show resolved Hide resolved

MaximProshin added the Code Freeze label May 16, 2025

nikita-malininn approved these changes May 16, 2025

View reviewed changes

ljaljushkin merged commit 0d42595 into openvinotoolkit:develop May 19, 2025
19 checks passed

ljaljushkin mentioned this pull request Jun 11, 2025

[release_v2170] Release notes #3524

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aligned bf16 tuning vs f32 inference for 4bit compression #3493

Aligned bf16 tuning vs f32 inference for 4bit compression #3493

Uh oh!

ljaljushkin commented May 14, 2025 •

edited

Loading

Uh oh!

ljaljushkin May 14, 2025

Uh oh!

ljaljushkin May 14, 2025

Uh oh!

Uh oh!

nikita-malininn commented May 15, 2025

Uh oh!

ljaljushkin commented May 15, 2025

Uh oh!

nikita-malininn commented May 15, 2025

Uh oh!

ljaljushkin commented May 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

ljaljushkin commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

		ParamStripLora(CompressWeightsMode.INT8_SYM, INT8SymDQ, torch.bfloat16, 1e-2, 1e-3, torch.int8),
		ParamStripLora(CompressWeightsMode.INT8_ASYM, INT8AsymDQ, torch.bfloat16, 1e-8, 1e-3, torch.uint8),

Aligned bf16 tuning vs f32 inference for 4bit compression #3493

Aligned bf16 tuning vs f32 inference for 4bit compression #3493

Uh oh!

Conversation

ljaljushkin commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Reason for changes

Related tickets

Tests

Uh oh!

ljaljushkin May 14, 2025

Choose a reason for hiding this comment

Uh oh!

ljaljushkin May 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nikita-malininn commented May 15, 2025

Uh oh!

ljaljushkin commented May 15, 2025

Uh oh!

nikita-malininn commented May 15, 2025

Uh oh!

ljaljushkin commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ljaljushkin commented May 16, 2025

Uh oh!

Uh oh!

Uh oh!

ljaljushkin commented May 14, 2025 •

edited

Loading

ljaljushkin commented May 15, 2025 •

edited

Loading