This repostiory contains the code and dataset for our paper: HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization.
-
Run
nvcc -Vand make sure your system CUDA version is 12.1 or newer. -
Clone this repository, and then create a new conda environment with the required packages
conda create -n hapo python=3.11
conda activate hapo
cd HAPO
pip install -r requirements.txt
- Sign in to HuggingFace and Weight and Biases with your respective tokens
huggingface-cli login
wandb init
Our training and validation set is in the datasets folder. We use train_samples_dsr_2000.json for training and valid_samples_dsr_500.json for validation in our main experiments. The examples are sampled from DeepScaleR-Preview-Dataset.
Each example in the dataset has a question in the "question" key, answer in the "ground_truth" key, and optionally complete solution in the "solution" key.
To train with GRPO, run bash train_grpo.sh. To train with PPO, run bash train_ppo.sh.
w_lr: Weight of the length reward. Defaults to1.0. A higher value means more emphasis of the length reard. The weight needs to be in the range [0, 1], to ensure that any correct response has a higher combined reward than any incorrect response.type_lr: The type of length reward function. Defaults to"cosine". The other option"linear"produces a picece wise linear function for length reward computation.mode: The update mode, also denoted as the aggregate function in the paper. Defaults to"min"."mean"makesh_ikeep track of the mean length of historical correct responses, instead of the minimum length of them.rep_ngram_size: Size of ngram for the repetition reward. Defaults to3. A higher value means only penalizing repeated ngrams that are larger than or equal to the specified size.rep_penalty: Maximum penalty for the repetition reward. Defaults to0.0, meaning no penalty. The penalty needs to be a non positive number. A smaller value means more emphasis on the reptition reward.
We use lighteval to evaluate the trained models. The evaluation scripts are provided in the evaluation folder.
-
Run
bash evaluate_{benchmark}.shto evaluate the model, where{benchmark}can begsm8k(GSM8k),math_500(MATH500),ae_2024(AIME2024),gpqa(GPQA), orlcb(LiveCodeBench). -
Since
lightevalonly computes accuracy, we provide additional scripts to compute response lengths. Runpython compute_length.py --path {path_to_parquet}where{path_to_parquet}points to theparquetfile produced bylightevalfor a specific evaluation run. -
Additionally, we notice that
lightevalproduces an accuracy of 0 for GSM8K benchmark, and therefore we provide another script to fix this. Runpython verify.py --path {path_to_parquet}to comput the correct GSM8K accuracy and length of responses.
The trained model checkpoints are provided for DeepSeek-R1-1.5B, DeepScaleR-1.5B, and Qwen-2.5-1.5B-Inst.
@article{huang2025hapo,
title={HAPO: Training Language Models to Reason Concisely via History-Aware Policy Optimization},
author={Huang, Chengyu and Zhang, Zhengxin and Cardie, Claire},
journel={arXiv preprint arXiv:2505.11225},
year={2025}
}