[Compressors] Refactor compressors, remove sparsity & CompressedLinear#610
[Compressors] Refactor compressors, remove sparsity & CompressedLinear#610
Conversation
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
This pull request has merge conflicts that must be resolved before it can be |
071b11b to
e41e147
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
215e6cf to
9e7435b
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
brian-dellabetta
left a comment
There was a problem hiding this comment.
I discussed this in a screenshare with Kyle. I like the changes and the code looks a lot cleaner, but there's a lot in this PR. it would be good to run e2e and example tests.
Approving with a handful of nits
| This method iterates over the dense_weight_generator and | ||
| updates the corresponding weights in the model. If a parameter | ||
| name does not exist in the model, it will be skipped. | ||
| The hook automatically removes itself after decompression, allowing the model |
There was a problem hiding this comment.
nit -- this tripped me up a bit when reviewing. I don't see any code in the hook to do this, but it does live on .decompress_model.
| The hook automatically removes itself after decompression, allowing the model | |
| The hook is automatically removed after decompression, allowing the model |
There was a problem hiding this comment.
This is called out in the code comment
# decompress_model already removes the hook via remove_decompression_hook| return state_dict | ||
|
|
||
| @classmethod | ||
| def match(cls, module_type: type, scheme: QuantizationScheme) -> bool: |
There was a problem hiding this comment.
nit -- can we prefix with is_ to indicate it works a little different than our other match_ functions and returns a bool?
| def match(cls, module_type: type, scheme: QuantizationScheme) -> bool: | |
| def is_match(cls, module_type: type, scheme: QuantizationScheme) -> bool: |
|
|
||
| # in compressed mode, the weight is already compressed and quantized so we don't | ||
| # need to run fake quantization | ||
| # TODO: remove this line, as this is already guarded by `set_forward_quantized` |
|
|
||
| # force zero points during initialization | ||
| force_zero_point = config.quantization_status != QuantizationStatus.COMPRESSED | ||
| # TODO: remove zero points from initialization |
There was a problem hiding this comment.
you had this as a TODO on another line. I think this is better served as a first good issue than a TODO
There was a problem hiding this comment.
this for example is a clear case where git mv should be done. it is helpful to retain the git history as much as possible
rahul-tuli
left a comment
There was a problem hiding this comment.
The diff looks good, pending merge conflicts, However I agree that this is too big a change to review in one PR
| return state_dict | ||
|
|
||
| @classmethod | ||
| def match(cls, module_type: type, scheme: QuantizationScheme) -> bool: |
There was a problem hiding this comment.
nit: maybe rename to is_match?
| return state_dict | ||
|
|
||
| @classmethod | ||
| def match(cls, module_type: type, scheme: QuantizationScheme) -> bool: |
Purpose
CompressedLinearCorequisites
Entrypoints
Compressed Tensors has the following entrypoints into compression:
HF::CompressedTensorsConfigrepresenting a model config. Used when loading a compressed model using transformers for inference.ModelCompressor.quantization_config,ModelCompressor.compress_model,ModelCompressor.decompress_modeldict[str, Any]representing a model config. The layer is decompressed using ours, the recompressed usingops.cutlass_sparse_compress, to be recompressed later using a better format.ModelCompressor.sparsity_config.format,ModelCompressor.sparsity_compressor.decompress_weightModelCompressor..compress_model,oModelCompressor..update_configThis PR removes support for (2), as vLLM will no longer support 24 sparsity in the future. The functionality of the other two entrypoints remains unchanged.
Changes
Compressor.compress_module()if the format is known, orcompress_module()if the format should be inferredCompressedLinearModelCompressoradds apre_forwardhook to the model which triggers decompression on the first forward passQuantizationStatus.DECOMPRESSEDfor efficient inferenceQuantizationStatus.DECOMPRESSED, which defines what was previously implicitly defined: the state whereCompressedLinearhad decompressed itself, and runs forward passes without any weight qdq.CompressedLinearwas actually broken, in that it did not actually perform activation quantizationQuantizationStatus.FROZENin thatFROZENwill still perform weight qdq during forward pass (in order to create correct emulation), butDECOMPRESSEDdoes not need to perform additional weight qdq (because the weight has already been qdqed permanently). See the documentation forQuantizationStatusTesting
ModelCompressorFollow-ups
dequantize()method implementation on transformers