ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

中文 | English

🎊 News

[2026.05.27] We open-source the code of ICPO!
[2025.12.01] We have published the paper of ICPO!

📜 Brief Introduction

We propose ICPO (Intrinsic Confidence-Driven Group Relative Preference Optimization), a method designed to enhance the reasoning capabilities of large language models (LLMs). ICPO leverages intrinsic confidence to drive policy exploration, effectively encouraging novel and pedagogically valuable responses while suppressing overconfident erroneous predictions. This approach mitigates key challenges in GRPO training—namely, sparse rewards, noisy reward signals, and entropy collapse. ICPO delivers robust reasoning improvements across general domains and can be seamlessly integrated into existing reinforcement learning frameworks. Key advantages of ICPO include:

💡 Stronger Reasoning Enhancement.

ICPO achieves superior reasoning performance gains on both mathematical and general-domain reasoning benchmarks, outperforming a wide range of state-of-the-art baselines.

🛠️ Simple and Highly Extensible.

ICPO introduces a mechanism that maps generation probabilities to intrinsic rewards, modeling the relative learning value among responses within a group based on their generation probabilities. The ICPO method can be readily implemented by simply modifying the final reward computation component.

🚀 More Fine-grained Rewards and Robust Training.

Compared to the standard GRPO approach based on binary rewards, ICPO demonstrates stronger policy exploration drive.

We evaluated GRPO and ICPO under noisy reward conditions and found that ICPO still achieves effective exploration and positive performance gains in such environments.

Dataset

In our experiments, we used the RLPR Train Dataset and evaluation benchmarks.

Install

Install package

pip3 install -e .[vllm]
pip3 install -e .[sglang]

Train

Prepare data：Download the train and test dataset. Move rlpr_train.parquet to ./datasets/train, and move all the test datasets to ./datasets/test.

huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Train-Dataset --local-dir ./datasets/train
huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Evaluation --local-dir ./datasets/test

Specify the base model path in examples/ICPO_<model>.sh, where <model> can be qwen, llama and gemma.
```
export BASE_MODEL=path_to_base_model
```
(Optional) Login wandb, modify WANDB_MODE and WANDB_API_KEY in the examples/ICPO_<model>.sh if you want to use wandb for logging.
```
export WANDB_MODE="offline"
export WANDB_API_KEY="xxxxx"
```
(Optional) Follow the following steps to use the llm as a judge eval method. Skip this step if you want to use a rule-based verifier to judge the answer.
- Open-Source Model as judge
  1. Create a new environment for the server and deploy the model. (Specify judge model, host and port in the setup_server.sh)
```
bash scripts/setup_server.sh
```
  2. Modify the evaluation model configuration in the ICPO/config/eval.yaml; by default, qwen_model is used.
```
base_url: "http://localhost:port/v1"
api_key: "EMPTY"
model_name: "default"
```
- API-Based Model (gpt-4o / 4pt-4.1) as judge
  
  Specify token and the judge model in the examples/ICPO_<model>.sh to use OpenAI API.
```
export OPENAI_API_KEY=your_api_token
export OPENAI_API_BASE=your_api_base  # default is https://api.openai.com/v1
export USED_MODEL=gpt-4.1
```
Run the training script
```
bash examples/ICPO_<model>.sh
```

The core modifications are based on the following code files

ICPO/verl/trainer/ppo/ray_trainer.py -- Line 1006-1046
ICPO/verl/trainer/ppo/core_algos.py -- Line 559-625

Resuming Training

If you need to continue training from a specific training step, navigate to the checkpoint save directory (default is data/checkpoints/), modify the value in the latest_checkpointed_iteration.txt file to the target step, and then rerun the training script.

Convert checkpoints to HuggingFace format model

Run the code below:

python scripts/model_merger.py --local_dir <checkpoint_folder>/<exp_name>/global_step_<step>/actor --target_dir <target_dir>

Acknowledgement

veRL: The codebase we built upon.
RLPR: The training and evaluation data sources employed in this project.

Citation

If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️！

@misc{wang2025icpo,
      title={ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning}, 
      author={Jinpeng Wang and Chao Li and Ting Ye and Mengyuan Zhang and Wei Liu and Jian Luan},
      year={2025},
      eprint={2511.21005},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.21005}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
ICPO		ICPO
assets		assets
examples		examples
scripts		scripts
verl		verl
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py
textfile.txt		textfile.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

中文 | English

🎊 News

📜 Brief Introduction

Dataset

Install

Train

Resuming Training

Convert checkpoints to HuggingFace format model

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

中文 | English

🎊 News

📜 Brief Introduction

Dataset

Install

Train

Resuming Training

Convert checkpoints to HuggingFace format model

Acknowledgement

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages