This repository contains code necessary to reproduce the experimental results presented in our paper.
- Sep, 2025: Q-Palette is accepted to NeurIPS 2025.
We introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Built on Q-Palette, we propose a novel mixed-scheme quantization (MSQ) framework, fusion-aware MSQ, that jointly optimizes quantizer selection and layer fusion.
We provide our machine information below for reference to facilitate reproduction.
- GPU: NVIDIA RTX4090 GPU
- CPU: AMD EPYC 7B13 64-Core Processor
- OS: Ubuntu 22.04.5
- CUDA Version: 12.4
-
Initialize environment and install CUDA kernels:
Navigate to
kernels/and follow the instructions in theREADME.mdfile there. This step sets up the conda environmentqpaland installs the optimized CUDA kernels for Q-Palette. -
Install Python dependencies:
Return to the current directory and run:
pip install -r requirements.txt
-
Initialize environmental variable:
export CUDA_HOME=<YOUR_CUDA12.4_HOME> export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:.venv/lib/python3.11/site-packages/torch/lib
-
Install Python dependencies and CUDA kernels
uv sync --preview-features extra-build-dependencies --extra kernels source .venv/bin/activate
| Type | Models | 🤗 Hugging Face |
|---|---|---|
| Data-free quantization (Figure 1, 5, Table 3) | Llama-3.1-8B, Llama-3.1-70B, Llama-3.2-1B, Llama-3.2-3B, Qwen-2.5-7B |
Link |
| Data-aware quantization (Table 4) | Llama-2-7b-hf, Llama-2-13b-hf |
Link |
Example usage:
python solve_mem_const.py --model meta-llama/Llama-3.1-8B --target_bitwidth 3.25--model: Hugging Face model name or path--target_bitwidth: Target average bitwidth for memory constraint
The resulting quantization configuration is stored at:
codes/msq_results/3_8b/mem_constrained/default/3.25bit.pt
Evaluate the resulting model's perplexity on WikiText2 (takes about 1–2 hours for the 1st quantization):
python eval_qdict.py --qdict_path msq_results/3_8b/mem_constrained/default/3.25bit.ptExpected perplexity: ~6.10
Example usage:
python solve_lat_const.py --target_thp 200 --target_thp: Target throughput (tokens/sec) at batch size=1 on an RTX 4090 GPU.--use_cc: Flag that enables CUDA-Core implementation (optional).--no_fuse: Flag that disables fusion-aware MSQ.
The resulting configurations are quickly saved at:
codes/msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt
codes/msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp_merge_info.pt
Evaluate WikiText2 perplexity (about 1–2 hours for the 1st quantization):
python eval_qdict.py --qdict_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.ptExpected perplexity: ~6.37
Evaluate throughput:
python eval/measure_latency_merge_simt.py \
--hf_path meta-llama/Llama-3.1-8B \
--qdict_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt \
--use_inc_mlp --use_inc_attn \
--merge_info_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp_merge_info.pt \
--print_result(You may need to modify PYTHONPATH to contain this codebase, i.e., export PYTHONPATH=./:$PYTHONPATH)
Expected throughput: ~190–200 tokens/sec (RTX 4090 GPU)
Evaluate perplexity:
python eval_qdict.py --quantizer_key tcq-3.25Evaluate throughput:
python eval/measure_latency.py \
--hf_path meta-llama/Llama-3.1-8B \
--quantizer_str tcq-3.25 \
--use_inc_mlp --use_inc_attn --print_resultWe provide the exact quantization configuration we used for figure 1 (c) in this codebase as follows:
Evaluate perplexity:
python eval_qdict.py --qdict_path msq_results/figure1c/0.0_8.0bit_1.11.ptEvaluate throughput:
python eval/measure_latency_merge_simt.py \
--hf_path meta-llama/Llama-3.1-8B \
--qdict_path msq_results/figure1c/0.0_8.0bit_1.11.pt \
--use_inc_mlp --use_inc_attn \
--merge_info_path msq_results/figure1c/0.0_8.0bit_1.11_merge_info.pt \
--print_resultWe provide the exact quantization configuration we used for figure 1 (d) in this codebase as follows:
Evaluate perplexity:
python eval_qdict.py --qdict_path msq_results/figure1d/0.0_8.0bit_1.17.ptEvaluate throughput:
python eval/measure_latency_merge_simt.py \
--hf_path meta-llama/Llama-3.1-8B \
--qdict_path msq_results/figure1d/0.0_8.0bit_1.17.pt \
--use_inc_mlp --use_inc_attn \
--merge_info_path msq_results/figure1d/0.0_8.0bit_1.17_merge_info.pt \
--print_resultThe current release provides a minimal implementation to reproduce the main results of the paper.
Planned updates include:
- Setup support with
uv - Upload HuggingFace checkpoints for some key results
- Add tutorials and usage examples for practitioners
- Broader model, kernel, and usability support
- Clean evaluation code
- Release code for loss and cost term computation
Stay tuned!
@inproceedings{lee2025qpalette,
title={Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment},
author={Deokjae Lee and Hyun Oh Song},
booktitle = {Advances in Neural Information Processing Systems},
year={2025},
}