Skip to content

Official PyTorch implementation of "Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment" (NeurIPS 2025)

License

Notifications You must be signed in to change notification settings

snu-mllab/Q-Palette

Repository files navigation

Q-Palette

This repository contains code necessary to reproduce the experimental results presented in our paper.

News

  • Sep, 2025: Q-Palette is accepted to NeurIPS 2025.

Overview

We introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Built on Q-Palette, we propose a novel mixed-scheme quantization (MSQ) framework, fusion-aware MSQ, that jointly optimizes quantizer selection and layer fusion.

Machine Information

We provide our machine information below for reference to facilitate reproduction.

  • GPU: NVIDIA RTX4090 GPU
  • CPU: AMD EPYC 7B13 64-Core Processor
  • OS: Ubuntu 22.04.5
  • CUDA Version: 12.4

Setup

Option 1. Using Conda

  1. Initialize environment and install CUDA kernels:

    Navigate to kernels/ and follow the instructions in the README.md file there. This step sets up the conda environment qpal and installs the optimized CUDA kernels for Q-Palette.

  2. Install Python dependencies:

    Return to the current directory and run:

    pip install -r requirements.txt

Option 2. Using uv

  1. Initialize environmental variable:

    export CUDA_HOME=<YOUR_CUDA12.4_HOME>
    export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:.venv/lib/python3.11/site-packages/torch/lib
  2. Install Python dependencies and CUDA kernels

    uv sync --preview-features extra-build-dependencies --extra kernels 
    source .venv/bin/activate

Pre-quantized Models

Type Models 🤗 Hugging Face
Data-free quantization (Figure 1, 5, Table 3) Llama-3.1-8B, Llama-3.1-70B, Llama-3.2-1B, Llama-3.2-3B, Qwen-2.5-7B Link
Data-aware quantization (Table 4) Llama-2-7b-hf, Llama-2-13b-hf Link

General Usage

Memory-Constrained Mixed-Scheme Quantization

Example usage:

python solve_mem_const.py --model meta-llama/Llama-3.1-8B --target_bitwidth 3.25
  • --model: Hugging Face model name or path
  • --target_bitwidth: Target average bitwidth for memory constraint

The resulting quantization configuration is stored at:

codes/msq_results/3_8b/mem_constrained/default/3.25bit.pt

Evaluate the resulting model's perplexity on WikiText2 (takes about 1–2 hours for the 1st quantization):

python eval_qdict.py --qdict_path msq_results/3_8b/mem_constrained/default/3.25bit.pt

Expected perplexity: ~6.10

Latency-Constrained Mixed-Scheme Quantization (MSQ)

Example usage:

python solve_lat_const.py --target_thp 200 
  • --target_thp: Target throughput (tokens/sec) at batch size=1 on an RTX 4090 GPU.
  • --use_cc: Flag that enables CUDA-Core implementation (optional).
  • --no_fuse: Flag that disables fusion-aware MSQ.

The resulting configurations are quickly saved at:

codes/msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt
codes/msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp_merge_info.pt

Evaluate WikiText2 perplexity (about 1–2 hours for the 1st quantization):

python eval_qdict.py --qdict_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt

Expected perplexity: ~6.37

Evaluate throughput:

python eval/measure_latency_merge_simt.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --qdict_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp.pt \
    --use_inc_mlp --use_inc_attn \
    --merge_info_path msq_results/3_8b/lat_constrained/4090_cc/default_err/200.0thp_merge_info.pt \
    --print_result

(You may need to modify PYTHONPATH to contain this codebase, i.e., export PYTHONPATH=./:$PYTHONPATH)

Expected throughput: ~190–200 tokens/sec (RTX 4090 GPU)

Commands for Reproducing Speedup Experiments (Figure 1)

Figure 1 (b): Single-Scheme Quantization with TCQ-3.25

Evaluate perplexity:

python eval_qdict.py --quantizer_key tcq-3.25

Evaluate throughput:

python eval/measure_latency.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --quantizer_str tcq-3.25 \
    --use_inc_mlp --use_inc_attn --print_result

Figure 1 (c): Latency-Aware MSQ without Fusion

We provide the exact quantization configuration we used for figure 1 (c) in this codebase as follows:

Evaluate perplexity:

python eval_qdict.py --qdict_path msq_results/figure1c/0.0_8.0bit_1.11.pt

Evaluate throughput:

python eval/measure_latency_merge_simt.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --qdict_path msq_results/figure1c/0.0_8.0bit_1.11.pt \
    --use_inc_mlp --use_inc_attn \
    --merge_info_path msq_results/figure1c/0.0_8.0bit_1.11_merge_info.pt \
    --print_result

Figure 1 (d): Latency-Aware MSQ with Fusion

We provide the exact quantization configuration we used for figure 1 (d) in this codebase as follows:

Evaluate perplexity:

python eval_qdict.py --qdict_path msq_results/figure1d/0.0_8.0bit_1.17.pt

Evaluate throughput:

python eval/measure_latency_merge_simt.py \
    --hf_path meta-llama/Llama-3.1-8B \
    --qdict_path msq_results/figure1d/0.0_8.0bit_1.17.pt \
    --use_inc_mlp --use_inc_attn \
    --merge_info_path msq_results/figure1d/0.0_8.0bit_1.17_merge_info.pt \
    --print_result

Planned Updates

The current release provides a minimal implementation to reproduce the main results of the paper.

Planned updates include:

  • Setup support with uv
  • Upload HuggingFace checkpoints for some key results
  • Add tutorials and usage examples for practitioners
  • Broader model, kernel, and usability support
  • Clean evaluation code
  • Release code for loss and cost term computation

Stay tuned!

Citation

@inproceedings{lee2025qpalette,
      title={Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment}, 
      author={Deokjae Lee and Hyun Oh Song},
      booktitle = {Advances in Neural Information Processing Systems},
      year={2025},
}

About

Official PyTorch implementation of "Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment" (NeurIPS 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published