Skip to content

Commit 967b43e

Browse files
authored
Merge branch 'main' into awq-mse-observer-alignment
2 parents 148a178 + 41c5c26 commit 967b43e

File tree

99 files changed

+4967
-1415
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

99 files changed

+4967
-1415
lines changed

.github/mergify.yml

Lines changed: 5 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,4 @@
1-
queue_rules:
2-
- name: default
3-
merge_method: merge
4-
commit_message_template: |
5-
{{ title }} (#{{ number }})
6-
7-
{{ body }}
8-
9-
Signed-off-by: Mergify <noreply@mergify.com>
10-
queue_conditions:
11-
- check-success=DCO
12-
- check-success=quality-check
13-
- check-success=transformers-tests
14-
- check-success=base-tests (3.10)
15-
- check-success=base-tests (3.13)
16-
- check-success=pytorch-tests (3.10)
17-
- check-success=pytorch-tests (3.13)
18-
- check-success=markdown-link-check
19-
201
pull_request_rules:
21-
- name: Automatically merge when ready
22-
conditions:
23-
- base=main
24-
- label=ready
25-
- "#approved-reviews-by>=2"
26-
- check-success=DCO
27-
- check-success=quality-check
28-
- check-success=transformers-tests
29-
- check-success=base-tests (3.10)
30-
- check-success=base-tests (3.13)
31-
- check-success=pytorch-tests (3.10)
32-
- check-success=pytorch-tests (3.13)
33-
- check-success=markdown-link-check
34-
- check-success=ready-label-check
35-
- -conflict
36-
- -draft
37-
actions:
38-
queue:
39-
name: default
40-
412
- name: label-documentation
423
description: Automatically apply documentation label
434
conditions:
@@ -47,6 +8,11 @@ pull_request_rules:
478
- files~=^[^/]+\.md$
489
- files~=^docs/
4910
- files~=^examples/
11+
- -files~=^src/
12+
- -files~=^tests/
13+
- -files~=^\.github/
14+
- -files~=^Makefile$
15+
- -files~=^pyproject\.toml$
5016
actions:
5117
label:
5218
add:

README.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,24 +37,24 @@ Big updates have landed in LLM Compressor! To get a more in-depth look, check ou
3737

3838
Some of the exciting new features include:
3939

40+
* **Qwen3.5 Support**: Qwen 3.5 can now be quantized using LLM Compressor. You will need to update your local transformers version using `uv pip install --upgrade transformers` and install LLM Compressor from source if using `<0.11`. Once updated, you should be able to run examples for the [MoE](examples/quantization_w4a4_fp4/qwen3_5_example.py) and [non-MoE](examples/quantization_w4a4_fp4/qwen3_5_example.py) variants of Qwen 3.5 end-to-end. For models quantized and published by the RedHat team, consider using the [NVFP4](https://huggingface.co/RedHatAI/Qwen3.5-122B-A10B-NVFP4) and FP8 checkpoints for [Qwen3.5-122B](https://huggingface.co/RedHatAI/Qwen3.5-122B-A10B-FP8-dynamic) and [Qwen3.5-397B](https://huggingface.co/RedHatAI/Qwen3.5-397B-A17B-FP8-dynamic).
4041
* **Updated offloading and model loading support**: Loading transformers models that are offloaded to disk and/or offloaded across distributed process ranks is now supported. Disk offloading allows users to load and compress very large models which normally would not fit in CPU memory. Offloading functionality is no longer supported through accelerate but through model loading utilities added to compressed-tensors. For a full summary of updated loading and offloading functionality, for both single-process and distributed flows, see the [Big Models and Distributed Support guide](docs/guides/big_models_and_distributed/model_loading.md).
4142
* **Distributed GPTQ Support**: GPTQ now supports Distributed Data Parallel (DDP) functionality to significantly improve calibration runtime. An example using DDP with GPTQ can be found [here](examples/quantization_w4a16/llama3_ddp_example.py).
4243
* **Updated FP4 Microscale Support**: GPTQ now supports FP4 quantization schemes, including both [MXFP4](examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models
4344
* **New Model-Free PTQ Pathway**: A new model-free PTQ pathway has been added to LLM Compressor, called [`model_free_ptq`](src/llmcompressor/entrypoints/model_free/__init__.py#L36). This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where `oneshot` may fail. This pathway is currently supported for data-free pathways only i.e FP8 quantization and was leveraged to quantize the [Mistral Large 3 model](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512). Additional [examples](examples/model_free_ptq) have been added illustrating how LLM Compressor can be used for Kimi K2
45+
* **MXFP8 Microscale Support (Experimental)**: LLM Compressor now supports MXFP8 quantization via PTQ. Both W8A8 ([MXFP8](experimental/mxfp8/qwen3_example_w8a8_mxfp8.py)) and W8A16 weight-only ([MXFP8A16](experimental/mxfp8/qwen3_example_w8a16_mxfp8.py)) modes are available.
4446
* **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new `per-head` quantization scheme. Support for these checkpoints is on-going in vLLM and scripts to get started have been added to the [experimental folder](experimental/attention)
4547

4648

4749
### Supported Formats
48-
* Activation Quantization: W8A8 (int8 and fp8)
49-
* Mixed Precision: W4A16, W8A16, NVFP4 (W4A4 and W4A16 support)
50-
* 2:4 Semi-structured and Unstructured Sparsity
50+
* Activation Quantization: W8A8 (int8 and fp8), MXFP8 (experimental)
51+
* Mixed Precision: W4A16, W8A16, MXFP8A16 (experimental), NVFP4 (W4A4 and W4A16 support)
5152

5253
### Supported Algorithms
5354
* Simple PTQ
5455
* GPTQ
5556
* AWQ
5657
* SmoothQuant
57-
* SparseGPT
5858
* AutoRound
5959

6060
### When to Use Which Optimization
@@ -75,6 +75,8 @@ pip install llmcompressor
7575
Applying quantization with `llmcompressor`:
7676
* [Activation quantization to `int8`](examples/quantization_w8a8_int8/README.md)
7777
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8/README.md)
78+
* [Activation quantization to MXFP8 (experimental)](experimental/mxfp8/qwen3_example_w8a8_mxfp8.py)
79+
* [Weight-only quantization to MXFP8A16 (experimental)](experimental/mxfp8/qwen3_example_w8a16_mxfp8.py)
7880
* [Activation quantization to `fp4`](examples/quantization_w4a4_fp4/llama3_example.py)
7981
* [Activation quantization to `fp4` using AutoRound](examples/autoround/quantization_w4a4_fp4/README.md)
8082
* [Activation quantization to `fp8` and weight quantization to `int4`](examples/quantization_w4a8_fp8/)
@@ -183,3 +185,7 @@ If you find LLM Compressor useful in your research or projects, please consider
183185
url={https://github.com/vllm-project/llm-compressor},
184186
}
185187
```
188+
189+
190+
!!! warning
191+
Sparse compression (24 sparsity) is no longer supported by LLM Compressor due lack of hardware support and usage

docs/.nav.yml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,20 +19,28 @@ nav:
1919
- Qwen3:
2020
- key-models/qwen3/index.md
2121
- FP8 Example: key-models/qwen3/fp8-example.md
22+
- Qwen3.5:
23+
- key-models/qwen3.5/index.md
24+
- NVFP4A16 VL Example: key-models/qwen3.5/nvfp4-vl-example.md
25+
- NVFP4 MoE Example: key-models/qwen3.5/nvfp4-moe-example.md
2226
- Kimi-K2:
2327
- key-models/kimi-k2/index.md
2428
- FP8 Example: key-models/kimi-k2/fp8-example.md
2529
- Mistral Large 3:
2630
- key-models/mistral-large-3/index.md
2731
- FP8 Example: key-models/mistral-large-3/fp8-example.md
28-
- Guides:
32+
- User Guides:
33+
- Entrypoints:
34+
- guides/entrypoints/index.md
35+
- oneshot: guides/entrypoints/oneshot.md
36+
- model-free-ptq: guides/entrypoints/model-free-ptq.md
37+
- Compression Schemes: guides/compression_schemes.md
38+
- Observers: guides/observers.md
2939
- Big Models and Distributed Support:
3040
- Model Loading: guides/big_models_and_distributed/model_loading.md
3141
- Sequential Onloading: guides/big_models_and_distributed/sequential_onloading.md
3242
- Distributed Oneshot: guides/big_models_and_distributed/distributed_oneshot.md
33-
- Compression Schemes: guides/compression_schemes.md
34-
- Saving a Model: guides/saving_a_model.md
35-
- Observers: guides/observers.md
43+
- Saving a Compressed Model: guides/saving_a_model.md
3644
- Memory Requirements: guides/memory.md
3745
- Runtime Performance: guides/runtime.md
3846
- Examples:

docs/guides/compression_schemes.md

Lines changed: 2 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ A full list of supported schemes can be found [here](https://github.com/vllm-pro
88
- [W8A8-INT8](#int8_w8a8)
99
- [W4A16 and W8A16](#w4a16-and-w8a16)
1010
- [NVFP4](#nvfp4)
11-
- [2:4 Semi-structured Sparsity](#semi-structured)
12-
- [Unstructured Sparsity](#unstructured)
1311

1412
## PTQ Compression Schemes
1513

@@ -63,27 +61,5 @@ A full list of supported schemes can be found [here](https://github.com/vllm-pro
6361
| Calibration | Requires a calibration dataset to calibrate activation global scales |
6462
| Use case | Supported on all NVIDIA Blackwell GPUs or later
6563

66-
## Sparsification Compression Schemes
67-
68-
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
69-
70-
71-
### Semi-Structured
72-
| Feature | Description |
73-
|---------------|----------------------------------------------------------------------------------------------|
74-
| 2:4 Semi-structured Sparsity | Uses semi-structured sparsity (SparseGPT), where 2 of every 4 contiguous weights are set to zero. |
75-
| Weights | 2:4 sparsity |
76-
| Activations | N/A |
77-
| Calibration | Requires a calibration dataset |
78-
| Use case | Fine-grained sparsity for compression and speedups |
79-
80-
81-
82-
### Unstructured
83-
| Feature | Description |
84-
|---------------|----------------------------------------------------------------------------------------------|
85-
| Unstructured Sparsity | Zeros out individual weights without a regular pattern, removing weights wherever they contribute least. Produces a fine-grained sparse matrix. |
86-
| Weights | Sparsified individually (no structure) |
87-
| Activations | N/A |
88-
| Calibration | Does not require a calibration dataset |
89-
| Use case | Fine-grained sparsity for compression and speedups |
64+
!!! warning
65+
Sparse compression (including 2of4 sparsity) is no longer supported by LLM Compressor due lack of hardware support and user interest. Please see https://github.com/vllm-project/vllm/pull/36799 for more information.

docs/guides/entrypoints/index.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Entrypoints
2+
3+
LLM Compressor provides two entrypoints for post-training quantization (PTQ), each suited to different scenarios.
4+
5+
## Choosing an Entrypoint
6+
7+
| | [`oneshot`](oneshot.md) | [`model_free_ptq`](model-free-ptq.md) |
8+
|---|---|---|
9+
| **Can apply calibration data** | Yes | No — data-free only |
10+
| **Requires HF model definition** | Yes | No |
11+
| **Supports GPTQ / AWQ / SmoothQuant** | Yes | No |
12+
| **Supports FP8 / NVFP4 data-free** | Yes | Yes |
13+
| **Works when model has no transformers definition** | No | Yes |
14+
| **Fallback when `oneshot` fails** || Yes |
15+
16+
## oneshot
17+
18+
Use `oneshot` when your quantization algorithm or scheme **requires calibration data**, such as GPTQ, AWQ, SmoothQuant, or static activation quantization (FP8 or INT8 with static per tensor activations). It loads the model through Hugging Face `transformers`, runs calibration forward passes, and applies recipe-defined modifiers.
19+
20+
[:octicons-arrow-right-24: oneshot documentation](oneshot.md)
21+
22+
## model_free_ptq
23+
24+
Use `model_free_ptq` when your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16) and either the model has no Hugging Face model definition, or `oneshot` fails for your model. It works directly on the safetensors checkpoint without loading the model through `transformers`.
25+
26+
[:octicons-arrow-right-24: model_free_ptq documentation](model-free-ptq.md)
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# model_free_ptq
2+
3+
`model_free_ptq` is a PTQ entrypoint for **data-free quantization schemes** that operates directly on safetensors checkpoint files without requiring a Hugging Face model definition or loading the model through `transformers`.
4+
5+
## When to Use
6+
7+
Use `model_free_ptq` when:
8+
9+
- Your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16, MXFP4/MXFP8)
10+
- The model **does not have a Hugging Face transformers definition** (e.g. a newly released model not yet in transformers)
11+
- `oneshot` **fails** for your model
12+
13+
For schemes that require calibration data (GPTQ, AWQ, SmoothQuant, static activation quantization), use [`oneshot`](oneshot.md) instead.
14+
15+
## Basic Usage
16+
17+
```python
18+
from llmcompressor import model_free_ptq
19+
20+
model_free_ptq(
21+
model_stub="meta-llama/Meta-Llama-3-8B-Instruct",
22+
save_directory="Meta-Llama-3-8B-Instruct-FP8-BLOCK",
23+
scheme="FP8_BLOCK",
24+
ignore=["lm_head"],
25+
device="cuda:0",
26+
)
27+
```
28+
29+
## How It Works
30+
31+
`model_free_ptq` processes each `.safetensors` file in the checkpoint independently, without ever loading the full model into memory as a `torch.nn.Module`. For each file:
32+
33+
1. **Validate** — check that all quantizable tensors can be quantized with the given scheme
34+
2. **Initialize** — create a minimal `torch.nn.Linear` module for each weight tensor
35+
3. **Calibrate** — compute scale and zero point directly from the weight tensor (data-free)
36+
4. **Compress** — call `compress_module` from `compressed-tensors` to pack/quantize the weights
37+
5. **Save** — write the compressed tensors back to disk
38+
39+
After all files are processed, the safetensors index and model config are updated with the quantization metadata.
40+
41+
Multiple files can be processed in parallel using the `max_workers` argument.
42+
43+
## Arguments
44+
45+
| Argument | Type | Default | Description |
46+
|----------|------|---------|-------------|
47+
| `model_stub` | `str \| PathLike` || HuggingFace model ID or path to a local directory containing safetensors files |
48+
| `save_directory` | `str \| PathLike` || Directory to save the quantized checkpoint |
49+
| `scheme` | `QuantizationScheme \| str` || Quantization scheme to apply. Can be a preset string (e.g. `"FP8_BLOCK"`, `"NVFP4A16"`) or a `QuantizationScheme` object |
50+
| `ignore` | `Iterable[str]` | `()` | Module names or regex patterns to skip. Modules ending in `"norm"` are always ignored automatically |
51+
| `max_workers` | `int` | `1` | Number of parallel worker threads for processing safetensors files |
52+
| `device` | `str \| torch.device \| None` | `None` | Device to use for quantization. Defaults to GPU if available, otherwise CPU |
53+
| `converter` | `Converter \| None` | `None` | Optional `compressed-tensors` converter to apply before quantization, e.g. to convert modelopt-format checkpoints to compressed-tensors format |
54+
55+
## Standard Flow (Non-Microscale Schemes)
56+
57+
For schemes without a global scale (e.g. `FP8_BLOCK`, `FP8_DYNAMIC`), call `model_free_ptq` directly:
58+
59+
```python
60+
from llmcompressor import model_free_ptq
61+
62+
model_free_ptq(
63+
model_stub="unsloth/Kimi-K2-Thinking-BF16",
64+
save_directory="Kimi-K2-Thinking-FP8-BLOCK",
65+
scheme="FP8_BLOCK",
66+
ignore=[
67+
"re:.*gate$",
68+
"lm_head",
69+
"re:.*kv_a_proj_with_mqa$",
70+
"re:.*q_a_proj$",
71+
"model.embed_tokens",
72+
],
73+
max_workers=15,
74+
device="cuda:0",
75+
)
76+
```
77+
78+
## Microscale Flow (NVFP4)
79+
80+
NVFP4 requires a **global scale** that is fused across related weight groups (e.g. qkv projections, gate/up projections). For this fusion to work correctly, the weights of each fused group must reside in the **same safetensors shard**.
81+
82+
Standard model checkpoints often split these weights across different shards. To fix this, run the `reindex_fused_weights` CLI tool first to reorganize the checkpoint:
83+
84+
```bash
85+
llmcompressor.reindex_fused_weights \
86+
unsloth/Kimi-K2-Thinking-BF16 \
87+
Kimi-K2-Thinking-BF16-reindexed \
88+
--num_workers=10
89+
```
90+
91+
Then run `model_free_ptq` on the reindexed checkpoint:
92+
93+
```python
94+
from llmcompressor import model_free_ptq
95+
96+
model_free_ptq(
97+
model_stub="Kimi-K2-Thinking-BF16-reindexed",
98+
save_directory="Kimi-K2-Thinking-NVFP4A16",
99+
scheme="NVFP4A16",
100+
ignore=[
101+
"re:.*gate$",
102+
"lm_head",
103+
"re:.*kv_a_proj_with_mqa$",
104+
"re:.*q_a_proj$",
105+
"model.embed_tokens",
106+
],
107+
max_workers=15,
108+
device="cuda:0",
109+
)
110+
```
111+
112+
!!! note
113+
Reindexing is only required for **NVFP4**, which uses a global scale. MXFP4 does not use a global scale and does not require reindexing.
114+
115+
## Ignoring Layers
116+
117+
The `ignore` argument accepts module name strings or regex patterns prefixed with `re:`. Modules whose names end in `"norm"` are automatically ignored regardless of the `ignore` list.
118+
119+
```python
120+
ignore=[
121+
"lm_head", # exact name match
122+
"re:.*gate$", # regex: any module ending in "gate"
123+
"model.embed_tokens", # exact name match
124+
]
125+
```
126+
127+
## Supported Schemes
128+
129+
`model_free_ptq` supports any data-free weight quantization scheme. Common presets:
130+
131+
| Scheme | Description |
132+
|--------|-------------|
133+
| `FP8_DYNAMIC` | FP8 weights with dynamic per-token activation quantization |
134+
| `FP8_BLOCK` | FP8 weights with block-wise scaling (Blackwell-optimized) |
135+
| `NVFP4A16` | NVFP4 weight-only quantization with FP8 group scales and a global scale |
136+
| `MXFP4/MXFP8` | MXFP4 or MXFP8 quantization with MX-format microscales |
137+
138+
Note: Many of these schemes, such as NVFP4 and MXFP4 may potentially lead to improved recovery when applied with a calibration algorithm that requires data, such as GPTQ. Consider comparing performance using oneshot.
139+
For the full list of supported schemes and formats, see [Compression Schemes](../compression_schemes.md).

0 commit comments

Comments
 (0)