This is a simple example to demonstrate calibrating and serving ModelOpt fakequant models in vLLM.
Compared with realquant, fakequant is 2-5x slower, but doesn't require dedicated kernel support and facilitates research.
This example is tested with vllm 0.9.0 and 0.19.1
Follow the following instruction to build a docker environment, or install vllm with pip.
docker build -f examples/vllm_serve/Dockerfile -t vllm-modelopt .Step 1: Configure quantization settings.
You can either edit the quant_config dictionary in vllm_serve_fakequant.py, or set the following environment variables to control quantization behavior:
| Variable | Description | Default |
|---|---|---|
| QUANT_DATASET | Dataset name for calibration | cnn_dailymail |
| QUANT_CALIB_SIZE | Number of samples used for calibration | 512 |
| QUANT_CFG | Quantization config | None |
| KV_QUANT_CFG | KV-cache quantization config | None |
| QUANT_FILE_PATH | Optional path to exported quantizer state dict quantizer_state.pth |
None |
| MODELOPT_STATE_PATH | Optional path to exported vllm_fq_modelopt_state.pth (restores quantizer state and parameters) |
None |
| CALIB_BATCH_SIZE | Calibration batch size | 1 |
| RECIPE_PATH | Optional path to a ModelOpt PTQ recipe YAML | None |
Set these variables in your shell or Docker environment as needed to customize calibration.
Step 2: Run the following command, with all supported flag as vllm serve:
python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000Step 3: test the API server with curl:
curl -X POST "http://127.0.0.1:8000/v1/chat/completions" -H "Content-Type: application/json" -d '{
"model": "<model_path>",
"messages": [
{"role": "user", "content": "Hi, what is your name"}
],
"max_tokens": 8
}'
Step 4 (Optional): using lm_eval to run evaluation
lm_eval --model local-completions --tasks gsm8k --model_args model=<model_name>,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False,batch_size=128,tokenizer_backend=NoneStep 1: export the model with bf16 weights and quantizer state. To export the model:
- For HF models, use
examples/llm_ptq/hf_ptq.pywith--vllm_fakequant_export:
python ../llm_ptq/hf_ptq.py \
--pyt_ckpt_path <MODEL_PATH> \
--recipe <PATH_TO_RECIPE> \
--calib_size 512 \
--export_path <EXPORT_DIR> \
--vllm_fakequant_export \
--trust_remote_codeThis creates <EXPORT_DIR>/vllm_fq_modelopt_state.pth (ModelOpt quantizer state for vLLM fake-quant reload) and saves the HF-exported model under <EXPORT_DIR> (config/tokenizer/weights).
Note: --pyt_ckpt_path can point to either an HF checkpoint or a ModelOpt-saved checkpoint (e.g., a QAT/QAD checkpoint produced by examples/llm_qat/main.py). If the input checkpoint is already quantized, the script will skip re-quantization and only export artifacts for vLLM fakequant reload.
- For MCore models, export the model with flag
--export-vllm-fqas described in Megatron-LM README. This generatesquantizer_state.pth, which contains quantizer tensors for vLLM reload viaQUANT_FILE_PATH.
Step 2: use the exported artifacts when serving:
- HF export: pass the exported
vllm_fq_modelopt_state.pthviaMODELOPT_STATE_PATH
# HF
MODELOPT_STATE_PATH=<vllm_fq_modelopt_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000- MCore export: pass the exported
quantizer_state.pthviaQUANT_FILE_PATHand setQUANT_CFGto match the MCore quantization recipe
# MCore
QUANT_CFG=<quant_cfg> QUANT_FILE_PATH=<quantizer_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000- MCore reload does not use
MODELOPT_STATE_PATH; useQUANT_FILE_PATHand make sureQUANT_CFGmatches the quantization recipe used for the original MCore model (otherwise quantizer keys/config won’t align). - AWQ reload is not supported yet
- KV cache quantization export and reload is not supported in MCore yet.
NVFP4_KV_CFGandNVFP4_AFFINE_KV_CFGrequire--enforce-eager; these configs use a dynamic-block Triton kernel for KV-cache quantization that is incompatible with CUDA graph capture (the kernel grid is computed from Python-level tensor shapes, which get baked in at capture time). Without--enforce-eager, the captured grid will be wrong for different batch sizes, producing incorrect outputs.