|
| 1 | +# model_free_ptq |
| 2 | + |
| 3 | +`model_free_ptq` is a PTQ entrypoint for **data-free quantization schemes** that operates directly on safetensors checkpoint files without requiring a Hugging Face model definition or loading the model through `transformers`. |
| 4 | + |
| 5 | +## When to Use |
| 6 | + |
| 7 | +Use `model_free_ptq` when: |
| 8 | + |
| 9 | +- Your quantization scheme is **data-free** (e.g. FP8 dynamic, FP8 block, NVFP4A16, MXFP4/MXFP8) |
| 10 | +- The model **does not have a Hugging Face transformers definition** (e.g. a newly released model not yet in transformers) |
| 11 | +- `oneshot` **fails** for your model |
| 12 | + |
| 13 | +For schemes that require calibration data (GPTQ, AWQ, SmoothQuant, static activation quantization), use [`oneshot`](oneshot.md) instead. |
| 14 | + |
| 15 | +## Basic Usage |
| 16 | + |
| 17 | +```python |
| 18 | +from llmcompressor import model_free_ptq |
| 19 | + |
| 20 | +model_free_ptq( |
| 21 | + model_stub="meta-llama/Meta-Llama-3-8B-Instruct", |
| 22 | + save_directory="Meta-Llama-3-8B-Instruct-FP8-BLOCK", |
| 23 | + scheme="FP8_BLOCK", |
| 24 | + ignore=["lm_head"], |
| 25 | + device="cuda:0", |
| 26 | +) |
| 27 | +``` |
| 28 | + |
| 29 | +## How It Works |
| 30 | + |
| 31 | +`model_free_ptq` processes each `.safetensors` file in the checkpoint independently, without ever loading the full model into memory as a `torch.nn.Module`. For each file: |
| 32 | + |
| 33 | +1. **Validate** — check that all quantizable tensors can be quantized with the given scheme |
| 34 | +2. **Initialize** — create a minimal `torch.nn.Linear` module for each weight tensor |
| 35 | +3. **Calibrate** — compute scale and zero point directly from the weight tensor (data-free) |
| 36 | +4. **Compress** — call `compress_module` from `compressed-tensors` to pack/quantize the weights |
| 37 | +5. **Save** — write the compressed tensors back to disk |
| 38 | + |
| 39 | +After all files are processed, the safetensors index and model config are updated with the quantization metadata. |
| 40 | + |
| 41 | +Multiple files can be processed in parallel using the `max_workers` argument. |
| 42 | + |
| 43 | +## Arguments |
| 44 | + |
| 45 | +| Argument | Type | Default | Description | |
| 46 | +|----------|------|---------|-------------| |
| 47 | +| `model_stub` | `str \| PathLike` | — | HuggingFace model ID or path to a local directory containing safetensors files | |
| 48 | +| `save_directory` | `str \| PathLike` | — | Directory to save the quantized checkpoint | |
| 49 | +| `scheme` | `QuantizationScheme \| str` | — | Quantization scheme to apply. Can be a preset string (e.g. `"FP8_BLOCK"`, `"NVFP4A16"`) or a `QuantizationScheme` object | |
| 50 | +| `ignore` | `Iterable[str]` | `()` | Module names or regex patterns to skip. Modules ending in `"norm"` are always ignored automatically | |
| 51 | +| `max_workers` | `int` | `1` | Number of parallel worker threads for processing safetensors files | |
| 52 | +| `device` | `str \| torch.device \| None` | `None` | Device to use for quantization. Defaults to GPU if available, otherwise CPU | |
| 53 | +| `converter` | `Converter \| None` | `None` | Optional `compressed-tensors` converter to apply before quantization, e.g. to convert modelopt-format checkpoints to compressed-tensors format | |
| 54 | + |
| 55 | +## Standard Flow (Non-Microscale Schemes) |
| 56 | + |
| 57 | +For schemes without a global scale (e.g. `FP8_BLOCK`, `FP8_DYNAMIC`), call `model_free_ptq` directly: |
| 58 | + |
| 59 | +```python |
| 60 | +from llmcompressor import model_free_ptq |
| 61 | + |
| 62 | +model_free_ptq( |
| 63 | + model_stub="unsloth/Kimi-K2-Thinking-BF16", |
| 64 | + save_directory="Kimi-K2-Thinking-FP8-BLOCK", |
| 65 | + scheme="FP8_BLOCK", |
| 66 | + ignore=[ |
| 67 | + "re:.*gate$", |
| 68 | + "lm_head", |
| 69 | + "re:.*kv_a_proj_with_mqa$", |
| 70 | + "re:.*q_a_proj$", |
| 71 | + "model.embed_tokens", |
| 72 | + ], |
| 73 | + max_workers=15, |
| 74 | + device="cuda:0", |
| 75 | +) |
| 76 | +``` |
| 77 | + |
| 78 | +## Microscale Flow (NVFP4) |
| 79 | + |
| 80 | +NVFP4 requires a **global scale** that is fused across related weight groups (e.g. qkv projections, gate/up projections). For this fusion to work correctly, the weights of each fused group must reside in the **same safetensors shard**. |
| 81 | + |
| 82 | +Standard model checkpoints often split these weights across different shards. To fix this, run the `reindex_fused_weights` CLI tool first to reorganize the checkpoint: |
| 83 | + |
| 84 | +```bash |
| 85 | +llmcompressor.reindex_fused_weights \ |
| 86 | + unsloth/Kimi-K2-Thinking-BF16 \ |
| 87 | + Kimi-K2-Thinking-BF16-reindexed \ |
| 88 | + --num_workers=10 |
| 89 | +``` |
| 90 | + |
| 91 | +Then run `model_free_ptq` on the reindexed checkpoint: |
| 92 | + |
| 93 | +```python |
| 94 | +from llmcompressor import model_free_ptq |
| 95 | + |
| 96 | +model_free_ptq( |
| 97 | + model_stub="Kimi-K2-Thinking-BF16-reindexed", |
| 98 | + save_directory="Kimi-K2-Thinking-NVFP4A16", |
| 99 | + scheme="NVFP4A16", |
| 100 | + ignore=[ |
| 101 | + "re:.*gate$", |
| 102 | + "lm_head", |
| 103 | + "re:.*kv_a_proj_with_mqa$", |
| 104 | + "re:.*q_a_proj$", |
| 105 | + "model.embed_tokens", |
| 106 | + ], |
| 107 | + max_workers=15, |
| 108 | + device="cuda:0", |
| 109 | +) |
| 110 | +``` |
| 111 | + |
| 112 | +!!! note |
| 113 | + Reindexing is only required for **NVFP4**, which uses a global scale. MXFP4 does not use a global scale and does not require reindexing. |
| 114 | + |
| 115 | +## Ignoring Layers |
| 116 | + |
| 117 | +The `ignore` argument accepts module name strings or regex patterns prefixed with `re:`. Modules whose names end in `"norm"` are automatically ignored regardless of the `ignore` list. |
| 118 | + |
| 119 | +```python |
| 120 | +ignore=[ |
| 121 | + "lm_head", # exact name match |
| 122 | + "re:.*gate$", # regex: any module ending in "gate" |
| 123 | + "model.embed_tokens", # exact name match |
| 124 | +] |
| 125 | +``` |
| 126 | + |
| 127 | +## Supported Schemes |
| 128 | + |
| 129 | +`model_free_ptq` supports any data-free weight quantization scheme. Common presets: |
| 130 | + |
| 131 | +| Scheme | Description | |
| 132 | +|--------|-------------| |
| 133 | +| `FP8_DYNAMIC` | FP8 weights with dynamic per-token activation quantization | |
| 134 | +| `FP8_BLOCK` | FP8 weights with block-wise scaling (Blackwell-optimized) | |
| 135 | +| `NVFP4A16` | NVFP4 weight-only quantization with FP8 group scales and a global scale | |
| 136 | +| `MXFP4/MXFP8` | MXFP4 or MXFP8 quantization with MX-format microscales | |
| 137 | + |
| 138 | +Note: Many of these schemes, such as NVFP4 and MXFP4 may potentially lead to improved recovery when applied with a calibration algorithm that requires data, such as GPTQ. Consider comparing performance using oneshot. |
| 139 | +For the full list of supported schemes and formats, see [Compression Schemes](../compression_schemes.md). |
0 commit comments