llm-compressor supports AutoRound, an advanced quantization technique that delivers high-accuracy, low-bit quantization. The quantized results are fully compatible with compressed-tensors and can be served directly with vLLM.
AutoRound introduces three trainable parameters (V, α, and β) to optimize rounding values and clipping ranges during quantization. The method processes each decoder layer sequentially, using block-wise output reconstruction error as the training objective to fine-tune these parameters. This approach combines the efficiency of post-training quantization with the adaptability of parameter tuning, delivering robust compression for large language models while maintaining strong performance.
To get started, install:
git clone https://github.com/vllm-project/llm-compressor.git
cd llm-compressor
pip install -e .The example includes end-to-end scripts for applying the AutoRound quantization algorithm.
python3 llama3.1_example.pyThe resulting model Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound is ready to be loaded into vLLM.
With the model created, we can now load and run in vLLM (after installing).
from vllm import LLM
model = LLM("./Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound")Note: quantized models can be sensitive to the presence of the
bostoken.lm_evaldoes not add abostoken by default, so make sure to include theadd_bos_token=Trueargument when running your evaluations.
Run the following to test accuracy on GSM-8K:
lm_eval --model vllm \
--model_args pretrained="./Meta-Llama-3.1-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 'auto'| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.7710 | ± | 0.0116 |
| strict-match | 5 | exact_match | ↑ | 0.7043 | ± | 0.0126 |
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.7248 | ± | 0.0123 |
| strict-match | 5 | exact_match | ↑ | 0.6611 | ± | 0.0130 |
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.7362 | ± | 0.0121 |
| strict-match | 5 | exact_match | ↑ | 0.6702 | ± | 0.0129 |
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.7210 | ± | 0.0124 |
| strict-match | 5 | exact_match | ↑ | 0.6945 | ± | 0.0127 |
Note: quantized model accuracy may vary slightly due to nondeterminism.
python3 qwen3_vl_example.pyThe resulting model Qwen3-VL-8B-Instruct-NVFP4-AutoRound is ready to be loaded into vLLM.
Run the following to test accuracy on GSM-8K and ChartQA:
lm_eval --model vllm-vlm \
--model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 'auto'
lm_eval --model vllm-vlm \
--model_args pretrained="./Qwen3-VL-8B-Instruct-NVFP4-AutoRound",add_bos_token=true \
--tasks chartqa \
--batch_size 'auto' \
--apply_chat_template| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.8628 | ± | 0.0095 |
| strict-match | 5 | exact_match | ↑ | 0.8453 | ± | 0.0100 |
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| chartqa | 0 | none | 0 | anywhere_accuracy | ↑ | 0.7908 | ± | 0.0081 |
| none | 0 | exact_match | ↑ | 0.5592 | ± | 0.0099 | ||
| none | 0 | relaxed_accuracy | ↑ | 0.7696 | ± | 0.0084 |
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.8415 | ± | 0.0101 |
| strict-match | 5 | exact_match | ↑ | 0.8408 | ± | 0.0101 |
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| chartqa | 0 | none | 0 | anywhere_accuracy | ↑ | 0.8220 | ± | 0.0077 |
| none | 0 | exact_match | ↑ | 0.5748 | ± | 0.0099 | ||
| none | 0 | relaxed_accuracy | ↑ | 0.8044 | ± | 0.0079 |
Note: quantized model accuracy may vary slightly due to nondeterminism.
Please open up an issue on vllm-project/llm-compressor or intel/auto-round.