Differentiable “Self-Compression” as an Optional Training-Time Feature #3810

stktyagi · 2025-12-24T08:02:16Z

stktyagi
Dec 24, 2025

I wanted to float an idea for an experimental training-time compression feature that could sit alongside existing PTQ and QAT workflows in NNCF.

The core idea is self-compression: instead of manually configuring mixed precision, sparsity schedules, or multi-stage compression pipelines, the model learns its own optimal bit-widths and channel usage during training via gradients.

What this adds (at a high level)

Learnable bit-widths
Introduce a differentiable quantizer where bit-depth ($b$) is a trainable parameter. This gives users an automated alternative to hand-crafted mixed-precision setups.
Single unified compression objective
Add a SelfCompressionLoss term that penalizes total network bit-count. This naturally pushes the optimizer toward both quantization and pruning in one pass.
A modern alternative to deprecated structural pruning
Channel-level elimination happens implicitly through gradients rather than explicit pruning schedules, which feels more aligned with current training practices.
Size-targeted optimization
Users can control aggressiveness via a single size/memory penalty ($\gamma$), letting the model discover the best weight-to-bit tradeoff on its own.
Minimal disruption to existing workflows
This could be opt-in, experimental, and designed to coexist cleanly with PTQ and QAT.

Implementation-wise, this could live as a new DifferentiableQuantizer and a corresponding CompressionAlgorithm.

From a user perspective, this becomes a more “set-and-forget” option:

Define a compression penalty $\to$ train normally $\to$ let the model converge to its most efficient form without multi-stage schedules or manual tuning.

Does this sound like something that could be worked on?

Answered by ljaljushkin

Feb 4, 2026

Greetings, @stktyagi!

Thank you for the idea, it looks interesting!
Gradient-based methods, unfortunately, can be quite costly in terms of both time and memory. We are focusing on training-free methods now.

We have other potential directions.

For example, unstructured sparsity https://github.com/openvinotoolkit/nncf/blob/develop/examples/pruning/torch/resnet18/main.py#L61
Currently, it’s only magnitude based, but there’s no 2:4 or m:n sparsity, which is opportunity for contribution.
If you still want something with training, we have efficient distillation with lora adapters https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/torch/distillation_qat_with_lora
It w…

View full answer

MaximProshin · 2026-01-14T12:18:55Z

MaximProshin
Jan 14, 2026
Maintainer

@ljaljushkin , please take a look,

0 replies

ljaljushkin · 2026-02-04T12:40:31Z

ljaljushkin
Feb 4, 2026
Maintainer

Greetings, @stktyagi!

Thank you for the idea, it looks interesting!
Gradient-based methods, unfortunately, can be quite costly in terms of both time and memory. We are focusing on training-free methods now.

We have other potential directions.

For example, unstructured sparsity https://github.com/openvinotoolkit/nncf/blob/develop/examples/pruning/torch/resnet18/main.py#L61
Currently, it’s only magnitude based, but there’s no 2:4 or m:n sparsity, which is opportunity for contribution.
If you still want something with training, we have efficient distillation with lora adapters https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/torch/distillation_qat_with_lora
It works for 4 bit, but it would be great to see extension of this method for 3 and even 2 bits, potentially mix of 2-4 bits.

1 reply

stktyagi Feb 13, 2026
Author

Thank you for the suggestions and replying @ljaljushkin ,
I would love to work on these and I think it would be good to start with unstructured sparsity. Would you like me to open a feature request issue outlining the proposed design before starting implementation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiable “Self-Compression” as an Optional Training-Time Feature #3810

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Differentiable “Self-Compression” as an Optional Training-Time Feature #3810

Uh oh!

stktyagi Dec 24, 2025

What this adds (at a high level)

Replies: 2 comments · 1 reply

Uh oh!

MaximProshin Jan 14, 2026 Maintainer

Uh oh!

ljaljushkin Feb 4, 2026 Maintainer

Uh oh!

stktyagi Feb 13, 2026 Author

stktyagi
Dec 24, 2025

Replies: 2 comments 1 reply

MaximProshin
Jan 14, 2026
Maintainer

ljaljushkin
Feb 4, 2026
Maintainer

stktyagi Feb 13, 2026
Author