- [2026.05.27] We open-source the code of ICPO!
- [2025.12.01] We have published the paper of ICPO!
We propose ICPO (Intrinsic Confidence-Driven Group Relative Preference Optimization), a method designed to enhance the reasoning capabilities of large language models (LLMs). ICPO leverages intrinsic confidence to drive policy exploration, effectively encouraging novel and pedagogically valuable responses while suppressing overconfident erroneous predictions. This approach mitigates key challenges in GRPO training—namely, sparse rewards, noisy reward signals, and entropy collapse. ICPO delivers robust reasoning improvements across general domains and can be seamlessly integrated into existing reinforcement learning frameworks. Key advantages of ICPO include:
💡 Stronger Reasoning Enhancement.
ICPO achieves superior reasoning performance gains on both mathematical and general-domain reasoning benchmarks, outperforming a wide range of state-of-the-art baselines.
🛠️ Simple and Highly Extensible.
ICPO introduces a mechanism that maps generation probabilities to intrinsic rewards, modeling the relative learning value among responses within a group based on their generation probabilities. The ICPO method can be readily implemented by simply modifying the final reward computation component.
🚀 More Fine-grained Rewards and Robust Training.
Compared to the standard GRPO approach based on binary rewards, ICPO demonstrates stronger policy exploration drive.
We evaluated GRPO and ICPO under noisy reward conditions and found that ICPO still achieves effective exploration and positive performance gains in such environments.
In our experiments, we used the RLPR Train Dataset and evaluation benchmarks.
- Install package
pip3 install -e .[vllm] pip3 install -e .[sglang]
-
Prepare data:Download the train and test dataset. Move
rlpr_train.parquetto./datasets/train, and move all the test datasets to./datasets/test.huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Train-Dataset --local-dir ./datasets/train huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Evaluation --local-dir ./datasets/test
-
Specify the base model path in
examples/ICPO_<model>.sh, where<model>can beqwen,llamaandgemma.export BASE_MODEL=path_to_base_model -
(Optional) Login wandb, modify WANDB_MODE and WANDB_API_KEY in the
examples/ICPO_<model>.shif you want to use wandb for logging.export WANDB_MODE="offline" export WANDB_API_KEY="xxxxx"
-
(Optional) Follow the following steps to use the
llm as a judgeeval method. Skip this step if you want to use a rule-based verifier to judge the answer.-
Open-Source Model as judge
-
Create a new environment for the server and deploy the model. (Specify judge model, host and port in the
setup_server.sh)bash scripts/setup_server.sh
-
Modify the evaluation model configuration in the
ICPO/config/eval.yaml; by default,qwen_modelis used.base_url: "http://localhost:port/v1" api_key: "EMPTY" model_name: "default"
-
-
API-Based Model (gpt-4o / 4pt-4.1) as judge
Specify token and the judge model in the
examples/ICPO_<model>.shto use OpenAI API.export OPENAI_API_KEY=your_api_token export OPENAI_API_BASE=your_api_base # default is https://api.openai.com/v1 export USED_MODEL=gpt-4.1
-
-
Run the training script
bash examples/ICPO_<model>.sh
-
The core modifications are based on the following code files
ICPO/verl/trainer/ppo/ray_trainer.py -- Line 1006-1046 ICPO/verl/trainer/ppo/core_algos.py -- Line 559-625
If you need to continue training from a specific training step, navigate to the checkpoint save directory (default is data/checkpoints/), modify the value in the latest_checkpointed_iteration.txt file to the target step, and then rerun the training script.
Run the code below:
python scripts/model_merger.py --local_dir <checkpoint_folder>/<exp_name>/global_step_<step>/actor --target_dir <target_dir>- veRL: The codebase we built upon.
- RLPR: The training and evaluation data sources employed in this project.
If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️!
@misc{wang2025icpo,
title={ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning},
author={Jinpeng Wang and Chao Li and Ting Ye and Mengyuan Zhang and Wei Liu and Jian Luan},
year={2025},
eprint={2511.21005},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.21005},
}


