Skip to content

Latest commit

 

History

History
95 lines (76 loc) · 5.44 KB

README.md

File metadata and controls

95 lines (76 loc) · 5.44 KB

The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models

Are expensive labeled data and rejection sampling truly necessary for developing self-improving reasoning models?

We introduce Unsupervised Prefix Fine-Tuning (UPFT) -- an efficient method that trains models on only the first few tokens of single self-generated solutions, exploiting Prefix Self-Consistency across different reasoning trajectories. This repo contains the source code of UPFT.

The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models
Ke Ji, Jiahao Xu, Tian Liang, Qiuzhi Liu, Zhiwei He, Xingyu Chen, Xiaoyuan Liu, Zhijie Wang, Junying Chen, Benyou Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Paper: https://arxiv.org/abs/2503.02875

The experiment results under the unsupervised sampling settings:

Fine-Tuning Method Data Avg. Length GSM8K MATH500 AIME2024 GPQA Ave.
Llama-3.1-8B-Instruct 82.0 51.0 3.3 8.6 36.2
+ SFT PRM (12K) 175.8 83.8 48.4 3.3 8.6 36.0
+ UPFT PRM (12K) 15.8 85.4 52.0 6.7 9.1 38.3
Qwen2.5-Math-7B-Instruct 95.2 84.0 16.7 9.6 51.4
+ SFT PRM (12K) 300.1 95.8 83.4 13.3 9.1 50.4
+ UPFT PRM (12K) 51.4 95.5 85.6 20.0 9.6 52.6
+ SFT OMI2 (600K) 533.2 95.4 83.4 13.3 6.6 49.7
+ UPFT OMI2 (600K) 67.5 95.4 86.4 20.0 9.6 52.9
+ SFT LIMO 491.8 95.8 84.2 20.0 7.6 51.9
+ UPFT LIMO (0.8K) 77.8 95.6 85.8 20.0 8.6 52.5
+ SFT U-Hard 393.3 95.5 83.4 16.7 9.6 51.3
+ UPFT U-Hard (100K) 68.2 96.0 85.6 26.6 9.6 54.5
DeepSeek-R1-Distill-Qwen-7B 88.6 87.0 40.0 13.1 57.2
+ SFT LIMO (0.8K) 2029.5 89.7 87.0 40.0 12.1 57.2
+ UPFT LIMO (0.8K) 757.7 92.0 89.4 43.3 17.7 60.6
+ SFT U-Hard (100K) 3440.4 89.7 87.0 36.7 12.1 56.4
+ UPFT U-Hard (100K) 561.7 91.4 89.2 50.0 15.7 61.6

For supervised sampling settings:

Method #Tokens
GSM8K MATH500 AIME2024 GPQA Avg.
Sampling Tuning
Llama-3.1-8B-Instruct 82.0 51.0 3.3 8.6 36.2
+ RFT 36.9M 2.3M 86.0 52.0 6.7 9.1 38.5
+ V-STaR 36.9M 6.8M 85.4 52.6 6.7 8.6 38.3
+ UPFT (Ours) 0.2M
0.2M 85.4 52.0 6.7 9.1 38.3
+ Lable Filter 36.9M 0.2M 85.8 53.4 6.7 9.1 38.8
Qwen2.5-Math-7B-Instruct 95.2 84.0 16.7 9.6 51.4
+ RFT 51.7M 3.2M 95.7 85.2 20.0 9.6 52.6
+ V-STaR 51.7M 9.6M 96.0 85.4 20.0 10.1 52.9
+ UPFT (Ours) 0.6M
0.6M 95.5 85.6 20.0 9.6 52.6
+ Lable Filter 51.7M 0.6M 96.0 85.6 20.0 10.1 52.9
DeepSeek-R1-Distill-Qwen-7B 88.6 87.0 40.0 13.1 57.2
+ RFT 318.0M 19.9M 90.7 87.0 40.0 11.1 57.2
+ UPFT (Ours) 5.0M
5.0M 91.9 88.4 40.0 14.6 58.7
+ Lable Filter 318.0M 4.5M 92.3 89.2 40.0 13.6 58.8

Supervised experiments show that our method still achieves competitive performance under settings with label filtering, while reducing the number of training tokens by 75%.

Getting Started

git clone https://github.com/1KE-JI/UPFT.git
cd UPFT
conda create -n torch240 python==3.10.13
conda activate torch240
pip install -r requirements.txt

Training

We utilize open-source framework OpenRLHF to conduct our training process.

Step 1: Use vllm for sampling

python inference/vllm_inference.py

Step 2: Run command below to train from a 7B model.

bash scripts/train_sft.sh

Citation

@article{ji2025first,
  title={The first few tokens are all you need: An efficient and effective unsupervised prefix fine-tuning method for reasoning models},
  author={Ji, Ke and Xu, Jiahao and Liang, Tian and Liu, Qiuzhi and He, Zhiwei and Chen, Xingyu and Liu, Xiaoyuan and Wang, Zhijie and Chen, Junying and Wang, Benyou and others},
  journal={arXiv preprint arXiv:2503.02875},
  year={2025}
}