Add Attention Quantization Examples (vllm-project#2484)

kylesayrs · brian-dellabetta · 2imi9 · commit 3219b168726e · 2026-04-03T02:10:41.000-04:00
## Purpose ##
* Add attention and kv quantization examples to reflect current vllm
support
*

## Changes ##
* Add fp8 tensor attention example
* Add fp8 kv head example

## Testing ##
* Ran examples e2e and ran in vllm

---------

Signed-off-by: Kyle Sayers &lt;kylesayrs@gmail.com&gt;
Co-authored-by: Brian Dellabetta &lt;brian-dellabetta@users.noreply.github.com&gt;
Signed-off-by: Ziming &lt;frankziming26@outlook.com&gt;
diff --git a/README.md b/README.md
@@ -43,7 +43,7 @@ Some of the exciting new features include:
 * **Updated FP4 Microscale Support**: GPTQ now supports FP4 quantization schemes, including both [MXFP4](examples/quantization_w4a16_fp4/mxfp4/llama3_example.py) and [NVFP4](examples/quantization_w4a4_fp4/llama3_gptq_example.py). MXFP4 support has also been improved with updated weight scale generation. Models with weight-only quantization in the MXFP4 format can now run in vLLM as of vLLM v0.14.0. MXFP4 models with activation quantization are not yet supported in vLLM for compressed-tensors models
 * **New Model-Free PTQ Pathway**: A new model-free PTQ pathway has been added to LLM Compressor, called [`model_free_ptq`](src/llmcompressor/entrypoints/model_free/__init__.py#L36). This pathway allows you to quantize your model without the requirement of Hugging Face model definition and is especially useful in cases where `oneshot` may fail. This pathway is currently supported for data-free pathways only i.e FP8 quantization and was leveraged to quantize the [Mistral Large 3 model](https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512). Additional [examples](examples/model_free_ptq) have been added illustrating how LLM Compressor can be used for Kimi K2
 * **MXFP8 Microscale Support (Experimental)**: LLM Compressor now supports MXFP8 quantization via PTQ. Both W8A8 ([MXFP8](experimental/mxfp8/qwen3_example_w8a8_mxfp8.py)) and W8A16 weight-only ([MXFP8A16](experimental/mxfp8/qwen3_example_w8a16_mxfp8.py)) modes are available.
-* **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization. KV Cache quantization, which previously only supported per-tensor scales, has been extended to support any quantization scheme including a new `per-head` quantization scheme. Support for these checkpoints is on-going in vLLM and scripts to get started have been added to the [experimental folder](experimental/attention)
+* **Extended KV Cache and Attention Quantization Support**: LLM Compressor now supports attention quantization, as well as fine-grained KV Cache quantization. Previously only per-tensor KV cache quantization was supported. Now, you can quantize KV cache with `per-head` scales and run with vLLM. Examples of more generalized attention and kv cache quantization can be found in the [experimental folder](experimental/attention).
 
 
 ### Supported Formats
@@ -86,7 +86,8 @@ Applying quantization with `llmcompressor`:
 * [Weight only quantization to `int4` using AWQ](examples/awq/README.md)
 * [Weight only quantization to `int4` using AutoRound](examples/autoround/quantization_w4a16/README.md)
 * [KV Cache quantization to `fp8`](examples/quantization_kv_cache/README.md)
-* [Attention quantization to `fp8` (experimental)](experimental/attention/README.md)
+* [KV Cache quantization to `fp8` using per-head](examples/quantization_kv_cache/llama3_fp8_head_kv_example.py)
+* [Attention quantization to `fp8`](examples/quantization_attention/README.md)
 * [Attention quantization to `nvfp4` with SpinQuant (experimental)](experimental/attention/README.md)
 * [Quantizing MoE LLMs](examples/quantizing_moe/README.md)
 * [Quantizing Vision-Language Models](examples/multimodal_vision/README.md)
diff --git a/examples/quantization_attention/README.md b/examples/quantization_attention/README.md
@@ -0,0 +1,21 @@
+# Attention Quantization in LLM Compressor #
+LLM Compressor supports applying static attention quantization to models
+
+## Per-Head FP8 Attention Example ##
+For an example applying attention quantization, see [llama3_attention.py](/examples/quantization_attention/llama3_attention.py).
+
+```python
+recipe = QuantizationModifier(
+    config_groups={
+        "attention": QuantizationScheme(
+            targets=["LlamaAttention"],
+            input_activations=QuantizationArgs(
+                num_bits=8, type="float", strategy="attn_head"
+            ),
+        )
+    }
+)
+```
+
+Accuracy should be almost identical to the base model for FP8 attention.
+Note that attention quantization also implicitly applies kv cache quantization with the same quantization arguments.
diff --git a/examples/quantization_attention/llama3_attention.py b/examples/quantization_attention/llama3_attention.py
@@ -1,10 +1,10 @@
+from compressed_tensors.offload import dispatch_model
 from compressed_tensors.quantization import QuantizationArgs, QuantizationScheme
 from datasets import load_dataset
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
 from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import QuantizationModifier
-from compressed_tensors.offload import dispatch_model
 
 # Select model and load it.
 model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
@@ -56,7 +56,7 @@ def tokenize(sample):
         "attention": QuantizationScheme(
             targets=["LlamaAttention"],
             input_activations=QuantizationArgs(
-                num_bits=8, type="float", strategy="attn_head"
+                num_bits=8, type="float", strategy="tensor"
             ),
         )
     }
@@ -82,6 +82,6 @@ def tokenize(sample):
 print("==========================================\n\n")
 
 # Save to disk compressed.
-SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-attention-fp8-head"
+SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-attention-fp8"
 model.save_pretrained(SAVE_DIR, save_compressed=True)
 tokenizer.save_pretrained(SAVE_DIR)
diff --git a/examples/quantization_kv_cache/llama3_fp8_head_kv_example.py b/examples/quantization_kv_cache/llama3_fp8_head_kv_example.py
@@ -0,0 +1,83 @@
+from compressed_tensors.offload import dispatch_model
+from compressed_tensors.quantization import QuantizationArgs
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+from llmcompressor import oneshot
+from llmcompressor.modifiers.quantization import QuantizationModifier
+
+# Select model and load it.
+model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# Select calibration dataset.
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
+
+# Select number of samples. 512 samples is a good place to start.
+# Increasing the number of samples can improve accuracy.
+NUM_CALIBRATION_SAMPLES = 512
+MAX_SEQUENCE_LENGTH = 2048
+
+# Load dataset and preprocess.
+ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
+ds = ds.shuffle(seed=42)
+
+
+def preprocess(example):
+    return {
+        "text": tokenizer.apply_chat_template(
+            example["messages"],
+            tokenize=False,
+        )
+    }
+
+
+ds = ds.map(preprocess)
+
+
+# Tokenize inputs.
+def tokenize(sample):
+    return tokenizer(
+        sample["text"],
+        padding=False,
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        add_special_tokens=False,
+    )
+
+
+ds = ds.map(tokenize, remove_columns=ds.column_names)
+
+# Configure the quantization algorithm to run.
+recipe = QuantizationModifier(
+    targets="Linear",
+    scheme="FP8_DYNAMIC",
+    ignore=["lm_head"],
+    kv_cache_scheme=QuantizationArgs(num_bits=8, type="float", strategy="attn_head"),
+)
+
+# Apply algorithms.
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+
+# Confirm generations of the quantized model look sane.
+print("\n\n")
+print("========== SAMPLE GENERATION ==============")
+dispatch_model(model)
+sample = tokenizer("Hello my name is", return_tensors="pt")
+sample = {key: value.to(model.device) for key, value in sample.items()}
+output = model.generate(**sample, max_new_tokens=100)
+print(tokenizer.decode(output[0]))
+print("==========================================\n\n")
+
+# Save to disk compressed.
+SAVE_DIR = model_id.rstrip("/").split("/")[-1] + "-fp8-kv-head"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
diff --git a/experimental/attention/README.md b/experimental/attention/README.md
@@ -1,23 +1,5 @@
 # Attention Quantization in LLM Compressor #
-LLM Compressor supports applying static attention quantization to models. Please note that attention quantization support in vLLM is still ongoing and is not fully supported as of this writing.
-
-## FP8 Attention Example ##
-For an example applying attention quantization, see [llama3_attention.py](/experimental/attention/llama3_attention.py).
-
-```python
-recipe = QuantizationModifier(
-    config_groups={
-        "attention": QuantizationScheme(
-            targets=["LlamaAttention"],
-            input_activations=QuantizationArgs(
-                num_bits=8, type="float", strategy="attn_head"
-            ),
-        )
-    }
-)
-```
-
-Note that attention quantization also implicitly applies kv cache quantization with the same quantization arguments.
+LLM Compressor supports applying static attention quantization to models. Please note that NVFP4 attention quantization and R3 support in vLLM is still ongoing and is not fully supported as of this writing.
 
 ## NVFP4 Attention + R3 Example ##
 Attention quantization can be improved using the R3 transform, as described by [SpinQuant](https://arxiv.org/abs/2405.16406). This transform reduces the presence of outliers in the attention activation distribution, thereby improving accurcy recovery.