Q-Palette

This repository contains code necessary to reproduce the experimental results presented in our paper.

News

Sep, 2025: Q-Palette is accepted to NeurIPS 2025.

Overview

We introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Built on Q-Palette, we propose a novel mixed-scheme quantization (MSQ) framework, fusion-aware MSQ, that jointly optimizes quantizer selection and layer fusion.

Machine Information

We provide our machine information below for reference to facilitate reproduction.

GPU: NVIDIA RTX4090 GPU
CPU: AMD EPYC 7B13 64-Core Processor
OS: Ubuntu 22.04.5
CUDA Version: 12.4

Setup

Option 1. Using Conda

Initialize environment and install CUDA kernels:

Navigate to kernels/ and follow the instructions in the README.md file there. This step sets up the conda environment qpal and installs the optimized CUDA kernels for Q-Palette.
Install Python dependencies:

Return to the current directory and run:
```
pip install -r requirements.txt
```

Option 2. Using uv

Initialize environmental variable:

export CUDA_HOME=<YOUR_CUDA12.4_HOME>
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:.venv/lib/python3.11/site-packages/torch/lib

Install Python dependencies and CUDA kernels

uv sync --preview-features extra-build-dependencies --extra kernels 
source .venv/bin/activate

Pre-quantized Models

Type	Models	🤗 Hugging Face
Data-free quantization (Figure 1, 5, Table 3)	`Llama-3.1-8B`, `Llama-3.1-70B`, `Llama-3.2-1B`, `Llama-3.2-3B`, `Qwen-2.5-7B`	Link
Data-aware quantization (Table 4)	`Llama-2-7b-hf`, `Llama-2-13b-hf`	Link

General Usage

Memory-Constrained Mixed-Scheme Quantization

Example usage:

python solve_mem_const.py --model meta-llama/Llama-3.1-8B --target_bitwidth 3.25

--model: Hugging Face model name or path
--target_bitwidth: Target average bitwidth for memory constraint

The resulting quantization configuration is stored at:

codes/msq_results/3_8b/mem_constrained/default/3.25bit.pt

Evaluate the resulting model's perplexity on WikiText2 (takes about 1–2 hours for the 1st quantization):

python eval_qdict.py --qdict_path msq_results/3_8b/mem_constrained/default/3.25bit.pt

Expected perplexity: ~6.10

Latency-Constrained Mixed-Scheme Quantization (MSQ)

Example usage:

python solve_lat_const.py --target_thp 200

--target_thp: Target throughput (tokens/sec) at batch size=1 on an RTX 4090 GPU.
--use_cc: Flag that enables CUDA-Core implementation (optional).
--no_fuse: Flag that disables fusion-aware MSQ.

The resulting configurations are quickly saved at:

codes/msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt
codes/msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp_merge_info.pt

Evaluate WikiText2 perplexity (about 1–2 hours for the 1st quantization):

python eval_qdict.py --qdict_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt

Expected perplexity: ~6.37

Evaluate throughput:

python eval/measure_latency_merge_simt.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --qdict_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt \
    --use_inc_mlp --use_inc_attn \
    --merge_info_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp_merge_info.pt \
    --print_result

(You may need to modify PYTHONPATH to contain this codebase, i.e., export PYTHONPATH=./:$PYTHONPATH)

Expected throughput: ~190–200 tokens/sec (RTX 4090 GPU)

Commands for Reproducing Speedup Experiments (Figure 1)

Figure 1 (b): Single-Scheme Quantization with TCQ-3.25

Evaluate perplexity:

python eval_qdict.py --quantizer_key tcq-3.25

Evaluate throughput:

python eval/measure_latency.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --quantizer_str tcq-3.25 \
    --use_inc_mlp --use_inc_attn --print_result

Figure 1 (c): Latency-Aware MSQ without Fusion

We provide the exact quantization configuration we used for figure 1 (c) in this codebase as follows:

Evaluate perplexity:

python eval_qdict.py --qdict_path msq_results/figure1c/0.0_8.0bit_1.11.pt

Evaluate throughput:

python eval/measure_latency_merge_simt.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --qdict_path msq_results/figure1c/0.0_8.0bit_1.11.pt \
    --use_inc_mlp --use_inc_attn \
    --merge_info_path msq_results/figure1c/0.0_8.0bit_1.11_merge_info.pt \
    --print_result

Figure 1 (d): Latency-Aware MSQ with Fusion

We provide the exact quantization configuration we used for figure 1 (d) in this codebase as follows:

Evaluate perplexity:

python eval_qdict.py --qdict_path msq_results/figure1d/0.0_8.0bit_1.17.pt

Evaluate throughput:

python eval/measure_latency_merge_simt.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --qdict_path msq_results/figure1d/0.0_8.0bit_1.17.pt \
    --use_inc_mlp --use_inc_attn \
    --merge_info_path msq_results/figure1d/0.0_8.0bit_1.17_merge_info.pt \
    --print_result

Planned Updates

The current release provides a minimal implementation to reproduce the main results of the paper.

Planned updates include:

Setup support with uv
Upload HuggingFace checkpoints for some key results
Add tutorials and usage examples for practitioners
Broader model, kernel, and usability support
Clean evaluation code
Release code for loss and cost term computation

Stay tuned!

Citation

@inproceedings{lee2025qpalette,
      title={Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment}, 
      author={Deokjae Lee and Hyun Oh Song},
      booktitle = {Advances in Neural Information Processing Systems},
      year={2025},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Q-Palette

News

Overview

Machine Information

Setup

Option 1. Using Conda

Option 2. Using uv

Pre-quantized Models

General Usage

Memory-Constrained Mixed-Scheme Quantization

Latency-Constrained Mixed-Scheme Quantization (MSQ)

Commands for Reproducing Speedup Experiments (Figure 1)

Figure 1 (b): Single-Scheme Quantization with TCQ-3.25

Figure 1 (c): Latency-Aware MSQ without Fusion

Figure 1 (d): Latency-Aware MSQ with Fusion

Planned Updates

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
eval		eval
kernels		kernels
lib		lib
model		model
msq_results		msq_results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_qdict.py		eval_qdict.py
eval_qdict_zeroshot.py		eval_qdict_zeroshot.py
pyproject.toml		pyproject.toml
qpal_modelling_llama.py		qpal_modelling_llama.py
quantize_layer.py		quantize_layer.py
requirements.txt		requirements.txt
solve_lat_const.py		solve_lat_const.py
solve_mem_const.py		solve_mem_const.py

License

snu-mllab/Q-Palette

Folders and files

Latest commit

History

Repository files navigation

Q-Palette

News

Overview

Machine Information

Setup

Option 1. Using Conda

Option 2. Using uv

Pre-quantized Models

General Usage

Memory-Constrained Mixed-Scheme Quantization

Latency-Constrained Mixed-Scheme Quantization (MSQ)

Commands for Reproducing Speedup Experiments (Figure 1)

Figure 1 (b): Single-Scheme Quantization with TCQ-3.25

Figure 1 (c): Latency-Aware MSQ without Fusion

Figure 1 (d): Latency-Aware MSQ with Fusion

Planned Updates

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages