Skip to content

NVlabs/EoRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Star on GitHub

Shih-Yang Liu*, Maksim Khadkevich, Nai Chit FUNG, Charbel Sakr, Chao-Han Huck Yang,Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
(*Work done during the internship at NVIDIA Research)

[Paper] [NV Blog] [BibTeX]

EoRA is a novel fine-tuning-free method that augments compressed LLMs with low-rank matrices, allowing users to rapidly enhance task-specific performance and freely balance the trade-off between accuracy and computational overhead beyond the constraints of compression formats. EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs, achieving notable accuracy improvements (e.g., 10.84% on ARC-Challenge, 6.74% on MathQA, and 11.45% on GSM8K for LLaMA3-8B compressed to 3-bit). We also introduce an optimized CUDA kernel, accelerating inference by up to 1.4x and reducing memory overhead through quantizing EoRA. Overall, EoRA offers a prompt solution for improving the accuracy of compressed models under varying user requirements, enabling more efficient and flexible deployment of LLMs.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing.

💥 News 💥

  • [March 2, 2026] 🔥🔥 EoRA is accepted to ICLR 2026 for both ICBINB and TTU workshops!! See you in Rio de Janeiro, Brazil!!
  • [June 13, 2025] 🔥🔥 Release the code for reproducing the paper's results!!
  • [June 9, 2025] 🔥🔥 The official NVIDIA Tech Blog of DoRA is released HERE!!
  • [May 15, 2025] 🔥🔥 Check out an awesome blog post 2-bit+EoRA which shows that EoRA significantly boost 2-Bit LLM accuracy without training!!
  • [February 24, 2025] 🔥🔥 EoRA has been integrated into GPTQModel HERE!!

🔧 GPTQModel Support

EoRA is now seamlessly integrated into GPTQModel(HERE), Check here for detailed instructions on running EoRA with GPTQModel.

🛠 Installation

# clone repo
git clone https://github.com/ModelCloud/GPTQModel.git && cd GPTQModel

# pip: compile and install
# You can install optional modules like autoround, ipex, vllm, sglang, bitblas, and ipex.
# Example: pip install -v --no-build-isolation .[vllm,sglang,bitblas,ipex,auto_round]
pip install -v . --no-build-isolation

⚡ Quick Start

Step 1: Quantize the model with GPTQModel

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "meta-llama/Llama-3.2-3B"
quant_path = "Llama-3.2-3B-gptqmodel-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)

Step 2: Generate EoRA given the quantized model

from gptqmodel.adapter.adapter import Lora
from gptqmodel import GPTQModel, QuantizeConfig

eora = Lora(
  # for eora generation, path is adapter save path; for load, it is loading path
  path=f"{quant_path}/eora_rank16", 
  rank=16,
)

# provide a previously gptq quantized model path
GPTQModel.adapter.generate(
  adapter=eora,
  model_id_or_path=model_id,
  quantized_model_id_or_path=quant_path,
  calibration_dataset=calibration_dataset,
  calibration_dataset_concat_size=0,
  auto_gc=False)

# post-eora inference
model = GPTQModel.load(
  model_id_or_path=quant_path,
  adapter=eora
)

Step 3: Evaluate the accuracy of the original quantized model vs EoRA

Evaluting the original quantized model

python GPTQModel/examples/eora/evaluation.py --quantized_model quant_path

Evaluting EoRA

python GPTQModel/examples/eora/evaluation.py --quantized_model quant_path \
    --eora_save_path {quant_path}/eora_rank32 \
    --eora_rank 16

Reproducing Paper Results

You can find full reproduction instructions in the EoRA directory.

Star History

Star History Chart

Contact

Shih-Yang Liu: shihyangl@nvidia.com or sliuau@connect.ust.hk

Citation

If you find EoRA useful, please consider giving a star and citation:

@article{liu2024eora,
  title={Eora: Training-free compensation for compressed llm with eigenspace low-rank approximation},
  author={Liu, Shih-Yang and Khadkevich, Maksim and Fung, Nai Chit and Sakr, Charbel and Yang, Chao-Han Huck and Wang, Chien-Yi and Muralidharan, Saurav and Yin, Hongxu and Cheng, Kwang-Ting and Kautz, Jan and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

Licenses

Copyright © 2025, NVIDIA Corporation. All rights reserved.

This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.

About

[ICLRW'26] EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Topics

Resources

License

Stars

Watchers

Forks

Contributors