ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution (Paper)
Accepted at ICML 2026
This repository contains the training and evaluation code for ForesightKV.
Install the Python dependencies first:
conda create -n foresightkv python=3.10
conda activate foresightkv
pip install -r requirements.txtThe training scripts use flash_attention_2. Install a compatible flash-attn
build separately if you plan to run supervised training or reinforcement
learning on GPU.
cd supervised_training
python train.py \
--model_name path/to/qwen3-base-model \
--dataset path/to/supervised-data \
--checkpoint_path checkpoints/r1kv-slQwen2 variant:
cd supervised_training
python train_qwen2.py \
--model_name path/to/qwen2-base-model \
--dataset path/to/supervised-data \
--checkpoint_path checkpoints/r1kv-qwen2-slNotes:
--datasetshould point to a Hugging Face dataset saved withload_from_disk.train.pyandtrain_qwen2.pyinfer layer count and KV head layout from the loaded config, so they are not limited to a single model size.- the current script expects at least 2 CUDA devices because it places the
train model on
cuda:0and the reference model oncuda:1
cd reinforcment_learning
torchrun --nproc_per_node=NUM_GPUS train.py \
--model_name checkpoints/r1kv-sl \
--data_name path/to/reinforcement-data \
--checkpoint_path checkpoints/r1kv-rlQwen2 variant:
cd reinforcment_learning
torchrun --nproc_per_node=NUM_GPUS train_qwen2.py \
--model_name checkpoints/r1kv-qwen2-sl \
--data_name path/to/reinforcement-data \
--checkpoint_path checkpoints/r1kv-qwen2-rl \
--judge_init_path checkpoints/r1kv-qwen2-slNotes:
- the directory name is
reinforcment_learningin this repository --data_nameshould point to a Hugging Face dataset saved withload_from_disktrain.pyandtrain_qwen2.pyboth accept--total_training_steps,--rollouts_per_step,--checkpoint_interval, and related RL hyperparameters as CLI arguments
Generation:
cd evaluation
python run_math.py \
--dataset_path ./data/aime24.jsonl \
--save_path ./outputs/example.jsonl \
--model_path path/to/model \
--method fullkvCommon arguments:
--method: KV cache strategy. Supported choices arefullkv,rkv,snapkv,streamingllm,h2o,foresightkv, andforesightkv_topk.--kv_budget: KV retention budget used by compressed methods. Leave it unset forfullkv.--max_length: maximum sequence length during generation. We recommend using32768for long-context reasoning evaluation.--eval_batch_size: evaluation batch size. The default is1.--times: repeat count per example, useful when sampling multiple outputs from the same prompt.--attn_implementation: attention backend, with choicesflash_attention_2,sdpa, andeager.
Method-related hyperparameters:
--window_size: local sliding-window size used by compressed KV methods. Default is8.--first_tokens: always-retained prefix token count for some methods. Default is4.--mix_lambda: mixing weight used by specific heuristics such ash2o. Default is0.1.--retain_ratio: token retention ratio used byrkv. Default is0.2.--retain_direction: retention direction, eitherlastorfirst. Default islast.--update_kv: whether to update the KV cache online during generation. Default isTrue.
ForesightKV model-side options:
- For
foresightkv,window_sizeshould be larger thankv_budget + divide_length. --divide_method: segment split rule for reasoning traces, with choicesstep_lengthandnewline.--divide_length: segment length whendivide_method=step_length. Default is128.--compression_content: whether to compressallgenerated content or only thethinkpart.
Example with ForesightKV compression:
cd evaluation
python run_math.py \
--dataset_path ./data/aime24.jsonl \
--save_path ./outputs/foresightkv-aime24.jsonl \
--model_path path/to/model \
--method foresightkv \
--max_length 32768 \
--kv_budget 1024 \
--window_size 2048 \
--first_tokens 4 \
--divide_method step_length \
--divide_length 128 \
--compression_content allScoring:
cd evaluation
python evaluation/eval_math.py \
--exp_name example \
--output_dir ./eval_outputs_example \
--base_dir ./outputs \
--dataset aime24GPQA data is available at evaluation/data/gpqa.jsonl. The public copy keeps
only the question (input) and gold answer (output).
GPQA generation:
cd evaluation
MODEL_PATH=path/to/model bash scripts/run_gpqa.shGPQA scoring:
cd evaluation
BASE_DIR=./outputs OUTPUT_DIR=./eval_outputs_gpqa bash scripts/eval_gpqa.shIf you use this repository, please cite:
@article{dong2026foresightkv,
title={ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution},
author={Dong, Zican and Liu, Peiyu and Li, Junyi and Chen, Zhipeng and Peng, Han and Wang, Shuo and Zhao, Wayne Xin},
journal={arXiv preprint arXiv:2602.03203},
year={2026}
}