Skip to content

feat: AutoAWQ to compressed-tensors conversion tool#2440

Open
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:feat/autoawq-to-compressed-tensors-converter
Open

feat: AutoAWQ to compressed-tensors conversion tool#2440
NJX-njx wants to merge 1 commit intovllm-project:mainfrom
NJX-njx:feat/autoawq-to-compressed-tensors-converter

Conversation

@NJX-njx
Copy link

@NJX-njx NJX-njx commented Mar 4, 2026

Summary

Adds a conversion module (llmcompressor.conversion.autoawq_to_ct) that converts AutoAWQ quantized checkpoints to the compressed-tensors pack_quantized format, enabling direct loading in vLLM without accuracy loss.

Closes #2087

Key Changes

Core conversion logic

  • Int4 repacking: Correctly handles AutoAWQ's interleaved packing order [0, 2, 4, 6, 1, 3, 5, 7] and repacks weights into compressed-tensors' sequential order [0, 1, 2, 3, 4, 5, 6, 7]
  • Tensor renaming: Maps AWQ parameter names to compressed-tensors conventions:
    • qweightweight_packed
    • scalesweight_scale
    • qzerosweight_zero_point
  • Zero-point conversion: Unpacks AWQ's packed zero points (also interleaved)

Metadata generation

  • Generates proper quantization_config in config.json with quant_method: compressed-tensors and format: pack_quantized
  • Auto-detects bits, group_size, and symmetric from AWQ config
  • Supports multi-shard models with proper safetensors index file rewriting

Usage

Python API:

from llmcompressor.conversion.autoawq_to_ct import convert_autoawq_to_ct

convert_autoawq_to_ct(
    model_path="/path/to/autoawq-model",
    output_path="/path/to/output",
)

CLI:

python -m llmcompressor.conversion.autoawq_to_ct \
    --model-path /path/to/autoawq-model \
    --output-path /path/to/output

Testing

  • Unit tests covering packing round-trip, sequential repacking, key renaming, and AWQ order verification (all passing)
  • Integration test with a synthetic AutoAWQ checkpoint validates full pipeline correctness

References

Add a conversion module that converts AutoAWQ quantized checkpoints to the
compressed-tensors pack_quantized format, enabling direct loading in vLLM.

Key features:
- Handles AutoAWQ's interleaved int4 packing order [0,2,4,6,1,3,5,7]
  and repacks weights into compressed-tensors sequential order
- Converts tensor naming: qweight -> weight_packed, scales -> weight_scale,
  qzeros -> weight_zero_point
- Generates proper quantization_config in config.json with compressed-tensors
  metadata
- Supports multi-shard models with proper index file rewriting
- Both Python API and CLI entry point

Closes vllm-project#2087
Copilot AI review requested due to automatic review settings March 4, 2026 11:34
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new conversion tool to transform AutoAWQ quantized models into the compressed-tensors pack_quantized format. This enables direct loading of AutoAWQ models within vLLM, ensuring compatibility and maintaining accuracy by correctly handling the differences in int4 weight packing and metadata.

Highlights

  • Int4 Weight Repacking: Implemented logic to correctly re-pack AutoAWQ's interleaved int4 weights (order [0, 2, 4, 6, 1, 3, 5, 7]) into compressed-tensors' sequential order ([0, 1, 2, 3, 4, 5, 6, 7]).
  • Tensor Renaming: Mapped AutoAWQ tensor names (e.g., 'qweight', 'scales', 'qzeros') to compressed-tensors conventions ('weight_packed', 'weight_scale', 'weight_zero_point').
  • Zero-Point Conversion: Added functionality to unpack AutoAWQ's interleaved zero points for correct conversion to the compressed-tensors format.
  • Quantization Metadata Generation: Automatically generates the appropriate 'quantization_config' in 'config.json' for compressed-tensors, including auto-detection of 'bits', 'group_size', and 'symmetric' parameters from the AutoAWQ config.
  • Multi-Shard Model Support: Ensured proper handling and rewriting of safetensors index files for models distributed across multiple shards.
  • Usage API: Provided both Python API and CLI interfaces for easy model conversion.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/llmcompressor/conversion/init.py
    • Added convert_autoawq_to_ct to the module's __all__ export list.
  • src/llmcompressor/conversion/main.py
    • Created a new module to enable CLI execution of the conversion tool.
  • src/llmcompressor/conversion/autoawq_to_ct.py
    • Implemented the core logic for converting AutoAWQ models to compressed-tensors format.
    • Added functions for unpacking AutoAWQ's interleaved int4 values and packing them into compressed-tensors sequential format.
    • Included logic for renaming tensor keys and generating quantization_config metadata.
  • tests/llmcompressor/conversion/test_autoawq_to_ct.py
    • Added comprehensive unit tests for int4 packing/unpacking, repacking, and key renaming.
    • Included an integration test with a synthetic AutoAWQ model to validate the full conversion pipeline.
Activity
  • Unit tests were added covering packing round-trip, sequential repacking, key renaming, and AWQ order verification, all reported as passing.
  • An integration test with a synthetic AutoAWQ checkpoint was implemented and validated the full pipeline correctness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable conversion tool to transform AutoAWQ quantized models into the compressed-tensors format. The implementation is well-structured, with clear separation of concerns for packing/unpacking logic, key renaming, and the main conversion workflow. The inclusion of a CLI entry point and comprehensive unit tests, including a reference implementation for AWQ packing, is commendable.

My review focuses on improving the robustness of file handling, enhancing code clarity, and increasing test coverage. Specifically, I've suggested a more robust method for copying auxiliary model files, simplified a redundant conditional block in the tensor conversion loop, and recommended adding a check for zero-point correctness in the integration test.

Comment on lines +224 to +243
suffix = key[len(matched_prefix):]

if suffix == ".qweight":
converted[f"{matched_prefix}.weight_packed"] = (
_repack_awq_to_ct(tensor)
)

elif suffix == ".scales":
converted[f"{matched_prefix}.weight_scale"] = tensor

elif suffix == ".qzeros":
# Zero-points are also packed with the AWQ interleave.
zp = _unpack_awq_int4(tensor)
converted[f"{matched_prefix}.weight_zero_point"] = zp

elif suffix == ".bias":
converted[key] = tensor

else:
converted[key] = tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section for handling different tensor suffixes can be simplified. The elif suffix == ".bias": block is redundant because its logic is identical to the else: block that follows. Combining them will make the code more concise and easier to read.

                suffix = key[len(matched_prefix):]

                if suffix == ".qweight":
                    converted[f"{matched_prefix}.weight_packed"] = (
                        _repack_awq_to_ct(tensor)
                    )
                elif suffix == ".scales":
                    converted[f"{matched_prefix}.weight_scale"] = tensor
                elif suffix == ".qzeros":
                    # Zero-points are also packed with the AWQ interleave.
                    zp = _unpack_awq_int4(tensor)
                    converted[f"{matched_prefix}.weight_zero_point"] = zp
                else:
                    # Pass through other parameters like bias.
                    converted[key] = tensor

Comment on lines +300 to +309
_auxiliary_globs = [
"generation_config.json",
"special_tokens_map.json",
"merges.txt",
]
for pattern in _auxiliary_globs:
for src in model_path.glob(pattern):
dst = output_path / src.name
if not dst.exists():
shutil.copy2(src, dst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current method of copying auxiliary files using a hardcoded list of globs is brittle. It may miss important files required for the model to load correctly, such as tokenizer.json or other tokenizer-related files not covered by save_pretrained. A more robust approach is to iterate through all files in the source directory and copy any that are not explicitly generated or modified by this script. This ensures a more complete and reliable model conversion.

    # ----- Copy any remaining auxiliary files -----
    for src in model_path.glob("*"):
        if src.is_dir() or src.suffix == ".safetensors":
            continue
        dst = output_path / src.name
        if not dst.exists():
            shutil.copy2(src, dst)

for i in range(8):
ct_unpacked[:, i::8] = (ct_packed >> (i * 4)) & 0xF

torch.testing.assert_close(ct_unpacked, ground_truth["weights"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The integration test verifies that the repacked weights are correct, but it's missing a similar verification for the zero points. Since zero points are also transformed (unpacked from the AWQ format), it's important to add an assertion to ensure they are correctly handled in the conversion process. This will improve the test's coverage and confidence in the conversion logic.

Suggested change
torch.testing.assert_close(ct_unpacked, ground_truth["weights"])
torch.testing.assert_close(ct_unpacked, ground_truth["weights"])
# Verify zero-point values are correct after unpacking
zp_unpacked = f.get_tensor(
"model.layers.0.self_attn.q_proj.weight_zero_point"
)
torch.testing.assert_close(zp_unpacked, ground_truth["zeros"])

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new conversion utility to transform AutoAWQ int4-packed checkpoints into the compressed-tensors pack_quantized format (including renaming tensors and writing updated quantization metadata) so the result can be loaded directly by vLLM.

Changes:

  • Introduces llmcompressor.conversion.autoawq_to_ct with AWQ→CT int4 repacking, tensor key renaming, config.json rewriting, and optional safetensors index rewriting.
  • Adds CLI entrypoint support via module execution and exports the converter from llmcompressor.conversion.
  • Adds unit + integration tests for packing/unpacking, key renaming, and an end-to-end synthetic checkpoint conversion.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 7 comments.

File Description
src/llmcompressor/conversion/autoawq_to_ct.py Core conversion implementation (packing, renaming, config + index rewriting, CLI parser).
src/llmcompressor/conversion/__init__.py Exposes convert_autoawq_to_ct from the conversion package.
src/llmcompressor/conversion/__main__.py Adds a python -m llmcompressor.conversion ... entrypoint that delegates to the converter CLI.
tests/llmcompressor/conversion/test_autoawq_to_ct.py New tests covering packing correctness, key renaming, and a single-shard end-to-end conversion.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +120 to +124
def _repack_awq_to_ct(packed_awq: torch.Tensor) -> torch.Tensor:
"""One-shot conversion: AWQ-packed int32 → CT-packed int32."""
return _pack_ct_int4(_unpack_awq_int4(packed_awq))


Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_repack_awq_to_ct expands the packed int32 weights into a full int32 matrix in _unpack_awq_int4 (8× more elements) and then repacks. For large models this intermediate can be tens of GB and may OOM CPU RAM. Consider repacking by extracting/reordering nibbles within the packed int32 tensor (staying in the packed shape) or using a chunked/streaming approach to cap peak memory.

Suggested change
def _repack_awq_to_ct(packed_awq: torch.Tensor) -> torch.Tensor:
"""One-shot conversion: AWQ-packed int32 → CT-packed int32."""
return _pack_ct_int4(_unpack_awq_int4(packed_awq))
def _repack_awq_to_ct(
packed_awq: torch.Tensor,
max_chunk_bytes: int = 256 * 1024 * 1024,
) -> torch.Tensor:
"""Convert AWQ-packed int32CT-packed int32 with bounded peak memory.
The naive implementation would unpack the entire tensor to an 8× larger
int32 matrix and then repack it. For large models this can require tens
of GB of RAM. To avoid that, we process the tensor in row-wise chunks:
each chunk is unpacked and repacked independently, and the intermediate
is immediately discarded.
:param packed_awq: AWQ-packed int32 tensor of shape ``(rows, cols_packed)``.
:param max_chunk_bytes: Approximate upper bound on the size of the
unpacked intermediate per chunk, in bytes.
:return: CT-packed int32 tensor with the same shape as ``packed_awq``.
"""
if packed_awq.dim() != 2:
# Keep behavior simple and explicit: this helper is for 2D weight
# matrices. If other shapes are needed, they should be reshaped by
# the caller.
raise ValueError(
f"_repack_awq_to_ct expects a 2D tensor, got shape {tuple(packed_awq.shape)}"
)
rows, cols_packed = packed_awq.shape
if rows == 0 or cols_packed == 0:
return packed_awq.clone()
# Each packed column expands to 8 int32 values in the unpacked matrix.
cols_unpacked = cols_packed * 8
bytes_per_row_unpacked = cols_unpacked * 4 # int32 = 4 bytes
# Compute a chunk size (number of rows) that keeps the unpacked
# intermediate for a chunk under `max_chunk_bytes`. Always process at
# least one row.
max_rows_per_chunk = max(1, max_chunk_bytes // max(bytes_per_row_unpacked, 1))
# Preallocate output tensor in the packed CT layout.
packed_ct = torch.empty_like(packed_awq)
for start in range(0, rows, max_rows_per_chunk):
end = min(rows, start + max_rows_per_chunk)
# Slice the current chunk of rows, convert layout, and write back.
chunk_packed_awq = packed_awq[start:end]
chunk_unpacked = _unpack_awq_int4(chunk_packed_awq)
chunk_packed_ct = _pack_ct_int4(chunk_unpacked)
packed_ct[start:end] = chunk_packed_ct
return packed_ct

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +176
num_bits = awq_config.get("bits", num_bits)
group_size = awq_config.get("group_size", group_size)
# AutoAWQ uses ``zero_point: True`` to indicate *asymmetric* quant.
symmetric = not awq_config.get("zero_point", True)
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI-provided num_bits / group_size / symmetric are always overwritten when config.quantization_config is present, so users cannot override AutoAWQ metadata even if they pass explicit flags. If override is intended, consider using None defaults (and argparse defaults of None) so you can distinguish “not provided” from “provided”, or add an explicit --no-autodetect / --prefer-cli switch.

Suggested change
num_bits = awq_config.get("bits", num_bits)
group_size = awq_config.get("group_size", group_size)
# AutoAWQ uses ``zero_point: True`` to indicate *asymmetric* quant.
symmetric = not awq_config.get("zero_point", True)
# Only apply AutoAWQ metadata when the corresponding value is still at
# its default, so that explicit CLI arguments can override it.
if num_bits == 4:
num_bits = awq_config.get("bits", num_bits)
if group_size == 128:
group_size = awq_config.get("group_size", group_size)
# AutoAWQ uses ``zero_point: True`` to indicate *asymmetric* quant.
if symmetric is False:
symmetric = not awq_config.get("zero_point", True)

Copilot uses AI. Check for mistakes.
Comment on lines +247 to +268
# ----- Build compressed-tensors quantization_config -----
strategy = "group" if group_size > 0 else "channel"
quant_config = {
"quant_method": "compressed-tensors",
"format": "pack_quantized",
"global_compression_ratio": None,
"config_groups": {
"group_0": {
"targets": ["Linear"],
"weights": {
"num_bits": num_bits,
"type": "int",
"symmetric": symmetric,
"strategy": strategy,
"group_size": group_size if group_size > 0 else None,
},
"input_activations": None,
"output_activations": None,
}
},
"ignore": ["lm_head"],
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generated quantization_config schema differs from the one produced elsewhere in this repo (e.g., entrypoints/model_free/save_utils.update_config), which includes fields like compression_version and quantization_status and constructs the config via compressed_tensors.quantization.QuantizationConfig. To avoid incompatibilities with downstream loaders expecting the standard compressed-tensors config shape, consider building this dict using QuantizationConfig/QuantizationScheme and dumping it similarly to update_config (including format, ignore, and quantization_status).

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,5 @@
"""Allow ``python -m llmcompressor.conversion.autoawq_to_ct``."""
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring says python -m llmcompressor.conversion.autoawq_to_ct, but src/llmcompressor/conversion/__main__.py is only used by python -m llmcompressor.conversion. Either update the docstring to reflect the actual invocation, or consider moving this entrypoint to autoawq_to_ct/__main__.py (or rely solely on if __name__ == '__main__' already present in autoawq_to_ct.py).

Suggested change
"""Allow ``python -m llmcompressor.conversion.autoawq_to_ct``."""
"""Allow ``python -m llmcompressor.conversion``."""

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +255
def test_convert_autoawq_to_ct(fake_awq_model: Path, tmp_path: Path):
"""Full conversion pipeline: verify tensor contents and config."""
output_dir = tmp_path / "ct_model"
convert_autoawq_to_ct(model_path=fake_awq_model, output_path=output_dir)

# --- config.json ---
with open(output_dir / "config.json") as f:
cfg = json.load(f)
qcfg = cfg["quantization_config"]
assert qcfg["quant_method"] == "compressed-tensors"
assert qcfg["format"] == "pack_quantized"
group_cfg = qcfg["config_groups"]["group_0"]["weights"]
assert group_cfg["num_bits"] == 4
assert group_cfg["group_size"] == 16
assert group_cfg["symmetric"] is False

# --- safetensors ---
from safetensors import safe_open

with safe_open(str(output_dir / "model.safetensors"), framework="pt") as f:
keys = set(f.keys())
assert "model.layers.0.self_attn.q_proj.weight_packed" in keys
assert "model.layers.0.self_attn.q_proj.weight_scale" in keys
assert "model.layers.0.self_attn.q_proj.weight_zero_point" in keys
assert "model.embed_tokens.weight" in keys

# Old AWQ keys must be gone
assert "model.layers.0.self_attn.q_proj.qweight" not in keys
assert "model.layers.0.self_attn.q_proj.scales" not in keys
assert "model.layers.0.self_attn.q_proj.qzeros" not in keys

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test exercises only a single-shard model.safetensors case and doesn’t cover the multi-shard path (*.safetensors.index.json) or key rewriting in the index. Since the converter has dedicated logic for index rewriting, consider adding a test that creates 2 shards plus an index file and validates that (1) renamed keys exist in the correct output shard and (2) the rewritten weight_map matches the produced tensors.

Copilot uses AI. Check for mistakes.
Comment on lines +202 to +223
# AWQ prefixes in *this* shard
shard_prefixes: set[str] = set()
for key in keys:
if key.endswith(".qweight"):
shard_prefixes.add(key.removesuffix(".qweight"))
all_awq_prefixes |= shard_prefixes

for key in tqdm(keys, desc=f" {st_file.name}", leave=False):
tensor = f.get_tensor(key)

# Try to match to an AWQ quantised layer
matched_prefix = None
for prefix in shard_prefixes:
if key.startswith(prefix):
matched_prefix = prefix
break

if matched_prefix is None:
# Non-quantised parameter – pass through unchanged.
converted[key] = tensor
continue

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shard conversion currently identifies quantized layer prefixes only from keys ending in .qweight within the same shard. If a shard contains .scales/.qzeros for a layer whose .qweight lives in a different shard, those tensors will be passed through unconverted, while the index rewrite later will still rename them based on all_awq_prefixes, producing a broken checkpoint (index points to renamed keys that don't exist). Consider detecting AWQ tensors directly by suffix (.qweight, .scales, .qzeros) and deriving the prefix from the key itself, or do a pre-pass over all shards to collect prefixes and use the global set when converting each shard.

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +218
# Try to match to an AWQ quantised layer
matched_prefix = None
for prefix in shard_prefixes:
if key.startswith(prefix):
matched_prefix = prefix
break

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-tensor conversion does an O(#keys × #quant_prefixes) scan (for prefix in shard_prefixes: if key.startswith(prefix)) for every key. On large sharded checkpoints this can be a noticeable CPU cost. Consider determining the prefix via known suffixes (e.g., if key.endswith('.qweight'): prefix=removesuffix(...)) or precomputing a lookup so each key is classified in O(1).

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,272 @@
"""Tests for the AutoAWQ → compressed-tensors conversion tool."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this test is, better to make an example that downloads an auto awq model from hf and converts it. Ideally you'd do lm eval on the output model to verify accuracy approximately matches before/after conversion

Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, Please address comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request][Help Wanted] Convert AutoAWQ checkpoints to compressed-tensors

3 participants