[Compressors] Refactor compressors, remove sparsity & CompressedLinear by kylesayrs · Pull Request #610 · vllm-project/compressed-tensors

kylesayrs · 2026-03-02T22:40:04Z

Purpose

Remove complexity related to supporting sparse compression
Remove complexity related to CompressedLinear
Define an easy-to-use api for both module and state dict compression/decompression
Prepare to support distributed parallel compression

Corequisites

[Compressors] Remove sparse compression llm-compressor#2452

Entrypoints

Compressed Tensors has the following entrypoints into compression:

Name	Location	Purpose	Used code paths
ModelCompressor.from_compression_config	Transformers	Load a `HF::CompressedTensorsConfig` representing a model config. Used when loading a compressed model using transformers for inference.	`ModelCompressor.quantization_config`, `ModelCompressor.compress_model` ,`ModelCompressor.decompress_model`
ModelCompressor.from_compression_config	vLLM - Cutlass24	Load a `dict[str, Any]` representing a model config. The layer is decompressed using ours, the recompressed using `ops.cutlass_sparse_compress` , to be recompressed later using a better format.	`ModelCompressor.sparsity_config.format`, `ModelCompressor.sparsity_compressor.decompress_weight`
ModelCompressor.from_pretrained_model	LLM Compressor	Compresses a model before saving it so that the data format and disk space is optimal for use by inference kernels.	`ModelCompressor..compress_model`, `oModelCompressor..update_config`

This PR removes support for (2), as vLLM will no longer support 24 sparsity in the future. The functionality of the other two entrypoints remains unchanged.

Changes

Simplify compressors
- Remove concept of "quantization" and "sparsity" compressors
- Each format has exactly one compressor, and vice versa. Compressors define which quantization schemes they can support, and modules are compressed using whichever compressor supports them, in order of a defined priority.
- Modules can be compressed via Compressor.compress_module() if the format is known, or compress_module() if the format should be inferred
Remove sparsity
- Remove sparsity compressors
- Deprecate sparsity-related config arguments
- Remove (very) out of date examples referring to sparsity
Remove CompressedLinear
- Instead, the ModelCompressor adds a pre_forward hook to the model which triggers decompression on the first forward pass
- The model's status changed to QuantizationStatus.DECOMPRESSED for efficient inference
- Add new QuantizationStatus.DECOMPRESSED, which defines what was previously implicitly defined: the state where CompressedLinear had decompressed itself, and runs forward passes without any weight qdq.
  - As a side note, I believe that the original CompressedLinear was actually broken, in that it did not actually perform activation quantization
  - This status is distinct from QuantizationStatus.FROZEN in that FROZEN will still perform weight qdq during forward pass (in order to create correct emulation), but DECOMPRESSED does not need to perform additional weight qdq (because the weight has already been qdqed permanently). See the documentation for QuantizationStatus

Testing

Added unit and integration tests for ModelCompressor
Added tests for serializing quantization configs and schemes
Added round-trip compression tests for all formats and preset schemes
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/22814276395

Follow-ups

Add up-to-date documentation for compressed tensors
Add distributed parallelized compression
Add dequantize() method implementation on transformers
(optional) greater cleanup for removing sparsity

mergify · 2026-03-02T22:43:07Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-02T22:55:30Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-02T23:01:30Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-03T20:55:16Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-03T21:57:28Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-05T00:08:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-06T17:03:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-07T07:08:19Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-08T22:45:03Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-09T02:55:10Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages.

mergify · 2026-03-09T17:53:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mergify · 2026-03-09T20:57:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kylesayrs.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

brian-dellabetta · 2026-03-10T16:04:24Z

src/compressed_tensors/compressors/mxfp4/base.py

can we either change these to git mved files rather than deleted/created, or do the quantized_compressors folder re-org in a separate PR? This diff is rather unwieldy in its current form

brian-dellabetta

I discussed this in a screenshare with Kyle. I like the changes and the code looks a lot cleaner, but there's a lot in this PR. it would be good to run e2e and example tests.

Approving with a handful of nits

brian-dellabetta · 2026-03-10T21:29:28Z

src/compressed_tensors/compressors/model_compressors/model_compressor.py

-        This method iterates over the dense_weight_generator and
-        updates the corresponding weights in the model. If a parameter
-        name does not exist in the model, it will be skipped.
+        The hook automatically removes itself after decompression, allowing the model


nit -- this tripped me up a bit when reviewing. I don't see any code in the hook to do this, but it does live on .decompress_model.

Suggested change

The hook automatically removes itself after decompression, allowing the model

The hook is automatically removed after decompression, allowing the model

This is called out in the code comment

# decompress_model already removes the hook via remove_decompression_hook

brian-dellabetta · 2026-03-10T21:32:06Z

src/compressed_tensors/compressors/naive_quantized/base.py

+        return state_dict
+
+    @classmethod
+    def match(cls, module_type: type, scheme: QuantizationScheme) -> bool:


nit -- can we prefix with is_ to indicate it works a little different than our other match_ functions and returns a bool?

Suggested change

def match(cls, module_type: type, scheme: QuantizationScheme) -> bool:

def is_match(cls, module_type: type, scheme: QuantizationScheme) -> bool:

brian-dellabetta · 2026-03-10T21:41:49Z

src/compressed_tensors/quantization/lifecycle/forward.py


    # in compressed mode, the weight is already compressed and quantized so we don't
    # need to run fake quantization
+    # TODO: remove this line, as this is already guarded by `set_forward_quantized`


brian-dellabetta · 2026-03-10T21:42:32Z

src/compressed_tensors/quantization/lifecycle/apply.py


    # force zero points during initialization
-    force_zero_point = config.quantization_status != QuantizationStatus.COMPRESSED
+    # TODO: remove zero points from initialization


you had this as a TODO on another line. I think this is better served as a first good issue than a TODO

It can be both

brian-dellabetta · 2026-03-10T21:42:52Z

src/compressed_tensors/offload/README.md

rebase needed?

brian-dellabetta · 2026-03-10T21:45:18Z

src/compressed_tensors/compressors/pack_quantized/helpers.py

this for example is a clear case where git mv should be done. it is helpful to retain the git history as much as possible

rahul-tuli

The diff looks good, pending merge conflicts, However I agree that this is too big a change to review in one PR

rahul-tuli · 2026-03-11T13:44:41Z

src/compressed_tensors/compressors/dense/base.py

+        return state_dict
+
+    @classmethod
+    def match(cls, module_type: type, scheme: QuantizationScheme) -> bool:


nit: maybe rename to is_match?

rahul-tuli · 2026-03-11T13:46:11Z

src/compressed_tensors/compressors/naive_quantized/base.py

+        return state_dict
+
+    @classmethod
+    def match(cls, module_type: type, scheme: QuantizationScheme) -> bool:


mergify bot added quality-failed and removed quality-failed labels Mar 2, 2026

mergify bot added quality-failed and removed quality-failed labels Mar 3, 2026

mergify bot added the quality-failed label Mar 3, 2026

mergify bot added needs-rebase documentation Improvements or additions to documentation and removed quality-failed needs-rebase labels Mar 5, 2026

kylesayrs force-pushed the kylesayrs/compressor-refactor-claude branch from 071b11b to e41e147 Compare March 5, 2026 23:32

kylesayrs changed the title ~~[WIP] Refactor compressors~~ [Compressors] Refactor compressors, remove sparsity & CompressedLinear Mar 6, 2026

mergify bot added the needs-rebase label Mar 6, 2026

kylesayrs mentioned this pull request Mar 7, 2026

[Compressors] Remove sparse compression vllm-project/llm-compressor#2452

Open

mergify bot removed the needs-rebase label Mar 7, 2026

kylesayrs marked this pull request as ready for review March 7, 2026 07:07

mergify bot added quality-failed and removed quality-failed labels Mar 7, 2026

mergify bot added quality-failed and removed quality-failed labels Mar 8, 2026

mergify bot added quality-failed and removed quality-failed labels Mar 9, 2026

mergify bot added the needs-rebase label Mar 9, 2026

Refactor compressors

9e7435b

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/compressor-refactor-claude branch from 215e6cf to 9e7435b Compare March 9, 2026 19:41

mergify bot removed the needs-rebase label Mar 9, 2026

mergify bot added the needs-rebase label Mar 9, 2026

This was referenced Mar 10, 2026

[Offload] Convert to CT offloading for decompression #597

Closed

[WIP] Explore refactoring compression #525

Closed

brian-dellabetta reviewed Mar 10, 2026

View reviewed changes

kylesayrs requested review from HDCharles, brian-dellabetta and dsikka March 10, 2026 16:50

brian-dellabetta approved these changes Mar 10, 2026

View reviewed changes

rahul-tuli reviewed Mar 11, 2026

View reviewed changes

	The hook automatically removes itself after decompression, allowing the model
	The hook is automatically removed after decompression, allowing the model

	def match(cls, module_type: type, scheme: QuantizationScheme) -> bool:
	def is_match(cls, module_type: type, scheme: QuantizationScheme) -> bool:

Conversation

kylesayrs commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Corequisites

Entrypoints

Changes

Testing

Follow-ups

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

mergify bot commented Mar 6, 2026

Uh oh!

mergify bot commented Mar 7, 2026

Uh oh!

mergify bot commented Mar 8, 2026

Uh oh!

mergify bot commented Mar 9, 2026

Uh oh!

mergify bot commented Mar 9, 2026

Uh oh!

mergify bot commented Mar 9, 2026

Uh oh!

brian-dellabetta Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylesayrs commented Mar 2, 2026 •

edited

Loading

brian-dellabetta Mar 10, 2026 •

edited

Loading