Skip to content

Latest commit

 

History

History
62 lines (41 loc) · 1.6 KB

File metadata and controls

62 lines (41 loc) · 1.6 KB

G-KV

Introduction

This library provides comprehensive support for various KV cache compression algorithms, including H2O, SnapKV, R-KV, StreamingLLM, and the proposed G-KV. It is compatible with a wide range of models, such as the Qwen 2 series, Qwen3 (inference only), and the Llama series (versions 1 to 3).

The library also supports post-training for KV cache compression models. It includes a complete GRPO reinforcement learning pipeline, enabling generation with KV cache compression and constructing sparse attention masks for training. Additionally, the library offers pipelines for supervised fine-tuning (SFT) and distillation training, ensuring adaptability and optimization of models under KV cache compression settings.

Environment

python >= 3.10

pip install -r requirement.txt
pip install flash-attn==2.7.4.post1 --no-cache-dir

Quick Start

The scripts contain detailed descriptions of parameter settings.

Inference

bash scripts/inference.sh

Train (SFT or Distillation)

bash scripts/sft.sh

Train (RL)

bash scripts/rl.sh

evaluate on LiveCodeBench

python datasets/lcb_precess.py

bash scripts/lcb_eval.sh

Citation

@misc{liao2025gkvdecodingtimekvcache,
      title={G-KV: Decoding-Time KV Cache Eviction with Global Attention}, 
      author={Mengqi Liao and Lu Wang and Chaoyun Zhang and Zekai Shen and Xiaowei Mao and Si Qin and Qingwei Lin and Saravan Rajmohan and Dongmei Zhang and Huaiyu Wan},
      year={2025},
      eprint={2512.00504},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.00504}, 
}