feat: AutoAWQ to compressed-tensors conversion tool#2440
feat: AutoAWQ to compressed-tensors conversion tool#2440NJX-njx wants to merge 1 commit intovllm-project:mainfrom
Conversation
Add a conversion module that converts AutoAWQ quantized checkpoints to the compressed-tensors pack_quantized format, enabling direct loading in vLLM. Key features: - Handles AutoAWQ's interleaved int4 packing order [0,2,4,6,1,3,5,7] and repacks weights into compressed-tensors sequential order - Converts tensor naming: qweight -> weight_packed, scales -> weight_scale, qzeros -> weight_zero_point - Generates proper quantization_config in config.json with compressed-tensors metadata - Supports multi-shard models with proper index file rewriting - Both Python API and CLI entry point Closes vllm-project#2087
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new conversion tool to transform AutoAWQ quantized models into the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable conversion tool to transform AutoAWQ quantized models into the compressed-tensors format. The implementation is well-structured, with clear separation of concerns for packing/unpacking logic, key renaming, and the main conversion workflow. The inclusion of a CLI entry point and comprehensive unit tests, including a reference implementation for AWQ packing, is commendable.
My review focuses on improving the robustness of file handling, enhancing code clarity, and increasing test coverage. Specifically, I've suggested a more robust method for copying auxiliary model files, simplified a redundant conditional block in the tensor conversion loop, and recommended adding a check for zero-point correctness in the integration test.
| suffix = key[len(matched_prefix):] | ||
|
|
||
| if suffix == ".qweight": | ||
| converted[f"{matched_prefix}.weight_packed"] = ( | ||
| _repack_awq_to_ct(tensor) | ||
| ) | ||
|
|
||
| elif suffix == ".scales": | ||
| converted[f"{matched_prefix}.weight_scale"] = tensor | ||
|
|
||
| elif suffix == ".qzeros": | ||
| # Zero-points are also packed with the AWQ interleave. | ||
| zp = _unpack_awq_int4(tensor) | ||
| converted[f"{matched_prefix}.weight_zero_point"] = zp | ||
|
|
||
| elif suffix == ".bias": | ||
| converted[key] = tensor | ||
|
|
||
| else: | ||
| converted[key] = tensor |
There was a problem hiding this comment.
This section for handling different tensor suffixes can be simplified. The elif suffix == ".bias": block is redundant because its logic is identical to the else: block that follows. Combining them will make the code more concise and easier to read.
suffix = key[len(matched_prefix):]
if suffix == ".qweight":
converted[f"{matched_prefix}.weight_packed"] = (
_repack_awq_to_ct(tensor)
)
elif suffix == ".scales":
converted[f"{matched_prefix}.weight_scale"] = tensor
elif suffix == ".qzeros":
# Zero-points are also packed with the AWQ interleave.
zp = _unpack_awq_int4(tensor)
converted[f"{matched_prefix}.weight_zero_point"] = zp
else:
# Pass through other parameters like bias.
converted[key] = tensor| _auxiliary_globs = [ | ||
| "generation_config.json", | ||
| "special_tokens_map.json", | ||
| "merges.txt", | ||
| ] | ||
| for pattern in _auxiliary_globs: | ||
| for src in model_path.glob(pattern): | ||
| dst = output_path / src.name | ||
| if not dst.exists(): | ||
| shutil.copy2(src, dst) |
There was a problem hiding this comment.
The current method of copying auxiliary files using a hardcoded list of globs is brittle. It may miss important files required for the model to load correctly, such as tokenizer.json or other tokenizer-related files not covered by save_pretrained. A more robust approach is to iterate through all files in the source directory and copy any that are not explicitly generated or modified by this script. This ensures a more complete and reliable model conversion.
# ----- Copy any remaining auxiliary files -----
for src in model_path.glob("*"):
if src.is_dir() or src.suffix == ".safetensors":
continue
dst = output_path / src.name
if not dst.exists():
shutil.copy2(src, dst)| for i in range(8): | ||
| ct_unpacked[:, i::8] = (ct_packed >> (i * 4)) & 0xF | ||
|
|
||
| torch.testing.assert_close(ct_unpacked, ground_truth["weights"]) |
There was a problem hiding this comment.
The integration test verifies that the repacked weights are correct, but it's missing a similar verification for the zero points. Since zero points are also transformed (unpacked from the AWQ format), it's important to add an assertion to ensure they are correctly handled in the conversion process. This will improve the test's coverage and confidence in the conversion logic.
| torch.testing.assert_close(ct_unpacked, ground_truth["weights"]) | |
| torch.testing.assert_close(ct_unpacked, ground_truth["weights"]) | |
| # Verify zero-point values are correct after unpacking | |
| zp_unpacked = f.get_tensor( | |
| "model.layers.0.self_attn.q_proj.weight_zero_point" | |
| ) | |
| torch.testing.assert_close(zp_unpacked, ground_truth["zeros"]) |
There was a problem hiding this comment.
Pull request overview
Adds a new conversion utility to transform AutoAWQ int4-packed checkpoints into the compressed-tensors pack_quantized format (including renaming tensors and writing updated quantization metadata) so the result can be loaded directly by vLLM.
Changes:
- Introduces
llmcompressor.conversion.autoawq_to_ctwith AWQ→CT int4 repacking, tensor key renaming, config.json rewriting, and optional safetensors index rewriting. - Adds CLI entrypoint support via module execution and exports the converter from
llmcompressor.conversion. - Adds unit + integration tests for packing/unpacking, key renaming, and an end-to-end synthetic checkpoint conversion.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
src/llmcompressor/conversion/autoawq_to_ct.py |
Core conversion implementation (packing, renaming, config + index rewriting, CLI parser). |
src/llmcompressor/conversion/__init__.py |
Exposes convert_autoawq_to_ct from the conversion package. |
src/llmcompressor/conversion/__main__.py |
Adds a python -m llmcompressor.conversion ... entrypoint that delegates to the converter CLI. |
tests/llmcompressor/conversion/test_autoawq_to_ct.py |
New tests covering packing correctness, key renaming, and a single-shard end-to-end conversion. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _repack_awq_to_ct(packed_awq: torch.Tensor) -> torch.Tensor: | ||
| """One-shot conversion: AWQ-packed int32 → CT-packed int32.""" | ||
| return _pack_ct_int4(_unpack_awq_int4(packed_awq)) | ||
|
|
||
|
|
There was a problem hiding this comment.
_repack_awq_to_ct expands the packed int32 weights into a full int32 matrix in _unpack_awq_int4 (8× more elements) and then repacks. For large models this intermediate can be tens of GB and may OOM CPU RAM. Consider repacking by extracting/reordering nibbles within the packed int32 tensor (staying in the packed shape) or using a chunked/streaming approach to cap peak memory.
| def _repack_awq_to_ct(packed_awq: torch.Tensor) -> torch.Tensor: | |
| """One-shot conversion: AWQ-packed int32 → CT-packed int32.""" | |
| return _pack_ct_int4(_unpack_awq_int4(packed_awq)) | |
| def _repack_awq_to_ct( | |
| packed_awq: torch.Tensor, | |
| max_chunk_bytes: int = 256 * 1024 * 1024, | |
| ) -> torch.Tensor: | |
| """Convert AWQ-packed int32 → CT-packed int32 with bounded peak memory. | |
| The naive implementation would unpack the entire tensor to an 8× larger | |
| int32 matrix and then repack it. For large models this can require tens | |
| of GB of RAM. To avoid that, we process the tensor in row-wise chunks: | |
| each chunk is unpacked and repacked independently, and the intermediate | |
| is immediately discarded. | |
| :param packed_awq: AWQ-packed int32 tensor of shape ``(rows, cols_packed)``. | |
| :param max_chunk_bytes: Approximate upper bound on the size of the | |
| unpacked intermediate per chunk, in bytes. | |
| :return: CT-packed int32 tensor with the same shape as ``packed_awq``. | |
| """ | |
| if packed_awq.dim() != 2: | |
| # Keep behavior simple and explicit: this helper is for 2D weight | |
| # matrices. If other shapes are needed, they should be reshaped by | |
| # the caller. | |
| raise ValueError( | |
| f"_repack_awq_to_ct expects a 2D tensor, got shape {tuple(packed_awq.shape)}" | |
| ) | |
| rows, cols_packed = packed_awq.shape | |
| if rows == 0 or cols_packed == 0: | |
| return packed_awq.clone() | |
| # Each packed column expands to 8 int32 values in the unpacked matrix. | |
| cols_unpacked = cols_packed * 8 | |
| bytes_per_row_unpacked = cols_unpacked * 4 # int32 = 4 bytes | |
| # Compute a chunk size (number of rows) that keeps the unpacked | |
| # intermediate for a chunk under `max_chunk_bytes`. Always process at | |
| # least one row. | |
| max_rows_per_chunk = max(1, max_chunk_bytes // max(bytes_per_row_unpacked, 1)) | |
| # Preallocate output tensor in the packed CT layout. | |
| packed_ct = torch.empty_like(packed_awq) | |
| for start in range(0, rows, max_rows_per_chunk): | |
| end = min(rows, start + max_rows_per_chunk) | |
| # Slice the current chunk of rows, convert layout, and write back. | |
| chunk_packed_awq = packed_awq[start:end] | |
| chunk_unpacked = _unpack_awq_int4(chunk_packed_awq) | |
| chunk_packed_ct = _pack_ct_int4(chunk_unpacked) | |
| packed_ct[start:end] = chunk_packed_ct | |
| return packed_ct |
| num_bits = awq_config.get("bits", num_bits) | ||
| group_size = awq_config.get("group_size", group_size) | ||
| # AutoAWQ uses ``zero_point: True`` to indicate *asymmetric* quant. | ||
| symmetric = not awq_config.get("zero_point", True) |
There was a problem hiding this comment.
CLI-provided num_bits / group_size / symmetric are always overwritten when config.quantization_config is present, so users cannot override AutoAWQ metadata even if they pass explicit flags. If override is intended, consider using None defaults (and argparse defaults of None) so you can distinguish “not provided” from “provided”, or add an explicit --no-autodetect / --prefer-cli switch.
| num_bits = awq_config.get("bits", num_bits) | |
| group_size = awq_config.get("group_size", group_size) | |
| # AutoAWQ uses ``zero_point: True`` to indicate *asymmetric* quant. | |
| symmetric = not awq_config.get("zero_point", True) | |
| # Only apply AutoAWQ metadata when the corresponding value is still at | |
| # its default, so that explicit CLI arguments can override it. | |
| if num_bits == 4: | |
| num_bits = awq_config.get("bits", num_bits) | |
| if group_size == 128: | |
| group_size = awq_config.get("group_size", group_size) | |
| # AutoAWQ uses ``zero_point: True`` to indicate *asymmetric* quant. | |
| if symmetric is False: | |
| symmetric = not awq_config.get("zero_point", True) |
| # ----- Build compressed-tensors quantization_config ----- | ||
| strategy = "group" if group_size > 0 else "channel" | ||
| quant_config = { | ||
| "quant_method": "compressed-tensors", | ||
| "format": "pack_quantized", | ||
| "global_compression_ratio": None, | ||
| "config_groups": { | ||
| "group_0": { | ||
| "targets": ["Linear"], | ||
| "weights": { | ||
| "num_bits": num_bits, | ||
| "type": "int", | ||
| "symmetric": symmetric, | ||
| "strategy": strategy, | ||
| "group_size": group_size if group_size > 0 else None, | ||
| }, | ||
| "input_activations": None, | ||
| "output_activations": None, | ||
| } | ||
| }, | ||
| "ignore": ["lm_head"], | ||
| } |
There was a problem hiding this comment.
The generated quantization_config schema differs from the one produced elsewhere in this repo (e.g., entrypoints/model_free/save_utils.update_config), which includes fields like compression_version and quantization_status and constructs the config via compressed_tensors.quantization.QuantizationConfig. To avoid incompatibilities with downstream loaders expecting the standard compressed-tensors config shape, consider building this dict using QuantizationConfig/QuantizationScheme and dumping it similarly to update_config (including format, ignore, and quantization_status).
| @@ -0,0 +1,5 @@ | |||
| """Allow ``python -m llmcompressor.conversion.autoawq_to_ct``.""" | |||
There was a problem hiding this comment.
The module docstring says python -m llmcompressor.conversion.autoawq_to_ct, but src/llmcompressor/conversion/__main__.py is only used by python -m llmcompressor.conversion. Either update the docstring to reflect the actual invocation, or consider moving this entrypoint to autoawq_to_ct/__main__.py (or rely solely on if __name__ == '__main__' already present in autoawq_to_ct.py).
| """Allow ``python -m llmcompressor.conversion.autoawq_to_ct``.""" | |
| """Allow ``python -m llmcompressor.conversion``.""" |
| def test_convert_autoawq_to_ct(fake_awq_model: Path, tmp_path: Path): | ||
| """Full conversion pipeline: verify tensor contents and config.""" | ||
| output_dir = tmp_path / "ct_model" | ||
| convert_autoawq_to_ct(model_path=fake_awq_model, output_path=output_dir) | ||
|
|
||
| # --- config.json --- | ||
| with open(output_dir / "config.json") as f: | ||
| cfg = json.load(f) | ||
| qcfg = cfg["quantization_config"] | ||
| assert qcfg["quant_method"] == "compressed-tensors" | ||
| assert qcfg["format"] == "pack_quantized" | ||
| group_cfg = qcfg["config_groups"]["group_0"]["weights"] | ||
| assert group_cfg["num_bits"] == 4 | ||
| assert group_cfg["group_size"] == 16 | ||
| assert group_cfg["symmetric"] is False | ||
|
|
||
| # --- safetensors --- | ||
| from safetensors import safe_open | ||
|
|
||
| with safe_open(str(output_dir / "model.safetensors"), framework="pt") as f: | ||
| keys = set(f.keys()) | ||
| assert "model.layers.0.self_attn.q_proj.weight_packed" in keys | ||
| assert "model.layers.0.self_attn.q_proj.weight_scale" in keys | ||
| assert "model.layers.0.self_attn.q_proj.weight_zero_point" in keys | ||
| assert "model.embed_tokens.weight" in keys | ||
|
|
||
| # Old AWQ keys must be gone | ||
| assert "model.layers.0.self_attn.q_proj.qweight" not in keys | ||
| assert "model.layers.0.self_attn.q_proj.scales" not in keys | ||
| assert "model.layers.0.self_attn.q_proj.qzeros" not in keys | ||
|
|
There was a problem hiding this comment.
The integration test exercises only a single-shard model.safetensors case and doesn’t cover the multi-shard path (*.safetensors.index.json) or key rewriting in the index. Since the converter has dedicated logic for index rewriting, consider adding a test that creates 2 shards plus an index file and validates that (1) renamed keys exist in the correct output shard and (2) the rewritten weight_map matches the produced tensors.
| # AWQ prefixes in *this* shard | ||
| shard_prefixes: set[str] = set() | ||
| for key in keys: | ||
| if key.endswith(".qweight"): | ||
| shard_prefixes.add(key.removesuffix(".qweight")) | ||
| all_awq_prefixes |= shard_prefixes | ||
|
|
||
| for key in tqdm(keys, desc=f" {st_file.name}", leave=False): | ||
| tensor = f.get_tensor(key) | ||
|
|
||
| # Try to match to an AWQ quantised layer | ||
| matched_prefix = None | ||
| for prefix in shard_prefixes: | ||
| if key.startswith(prefix): | ||
| matched_prefix = prefix | ||
| break | ||
|
|
||
| if matched_prefix is None: | ||
| # Non-quantised parameter – pass through unchanged. | ||
| converted[key] = tensor | ||
| continue | ||
|
|
There was a problem hiding this comment.
Shard conversion currently identifies quantized layer prefixes only from keys ending in .qweight within the same shard. If a shard contains .scales/.qzeros for a layer whose .qweight lives in a different shard, those tensors will be passed through unconverted, while the index rewrite later will still rename them based on all_awq_prefixes, producing a broken checkpoint (index points to renamed keys that don't exist). Consider detecting AWQ tensors directly by suffix (.qweight, .scales, .qzeros) and deriving the prefix from the key itself, or do a pre-pass over all shards to collect prefixes and use the global set when converting each shard.
| # Try to match to an AWQ quantised layer | ||
| matched_prefix = None | ||
| for prefix in shard_prefixes: | ||
| if key.startswith(prefix): | ||
| matched_prefix = prefix | ||
| break | ||
|
|
There was a problem hiding this comment.
The per-tensor conversion does an O(#keys × #quant_prefixes) scan (for prefix in shard_prefixes: if key.startswith(prefix)) for every key. On large sharded checkpoints this can be a noticeable CPU cost. Consider determining the prefix via known suffixes (e.g., if key.endswith('.qweight'): prefix=removesuffix(...)) or precomputing a lookup so each key is classified in O(1).
| @@ -0,0 +1,272 @@ | |||
| """Tests for the AutoAWQ → compressed-tensors conversion tool.""" | |||
There was a problem hiding this comment.
Not sure how useful this test is, better to make an example that downloads an auto awq model from hf and converts it. Ideally you'd do lm eval on the output model to verify accuracy approximately matches before/after conversion
HDCharles
left a comment
There was a problem hiding this comment.
Thank you for your contribution, Please address comments.
Summary
Adds a conversion module (
llmcompressor.conversion.autoawq_to_ct) that converts AutoAWQ quantized checkpoints to the compressed-tensorspack_quantizedformat, enabling direct loading in vLLM without accuracy loss.Closes #2087
Key Changes
Core conversion logic
[0, 2, 4, 6, 1, 3, 5, 7]and repacks weights into compressed-tensors' sequential order[0, 1, 2, 3, 4, 5, 6, 7]qweight→weight_packedscales→weight_scaleqzeros→weight_zero_pointMetadata generation
quantization_configinconfig.jsonwithquant_method: compressed-tensorsandformat: pack_quantizedbits,group_size, andsymmetricfrom AWQ configUsage
Python API:
CLI:
python -m llmcompressor.conversion.autoawq_to_ct \ --model-path /path/to/autoawq-model \ --output-path /path/to/outputTesting
References