Skip to content

xiaomi-research/icpo

Repository files navigation

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

License

中文 | English

🎊 News

  • [2026.05.27] We open-source the code of ICPO!
  • [2025.12.01] We have published the paper of ICPO!

📜 Brief Introduction

We propose ICPO (Intrinsic Confidence-Driven Group Relative Preference Optimization), a method designed to enhance the reasoning capabilities of large language models (LLMs). ICPO leverages intrinsic confidence to drive policy exploration, effectively encouraging novel and pedagogically valuable responses while suppressing overconfident erroneous predictions. This approach mitigates key challenges in GRPO training—namely, sparse rewards, noisy reward signals, and entropy collapse. ICPO delivers robust reasoning improvements across general domains and can be seamlessly integrated into existing reinforcement learning frameworks. Key advantages of ICPO include:

💡 Stronger Reasoning Enhancement.

ICPO achieves superior reasoning performance gains on both mathematical and general-domain reasoning benchmarks, outperforming a wide range of state-of-the-art baselines.

🛠️ Simple and Highly Extensible.

ICPO introduces a mechanism that maps generation probabilities to intrinsic rewards, modeling the relative learning value among responses within a group based on their generation probabilities. The ICPO method can be readily implemented by simply modifying the final reward computation component.

🚀 More Fine-grained Rewards and Robust Training.

Compared to the standard GRPO approach based on binary rewards, ICPO demonstrates stronger policy exploration drive.

We evaluated GRPO and ICPO under noisy reward conditions and found that ICPO still achieves effective exploration and positive performance gains in such environments.

Dataset

In our experiments, we used the RLPR Train Dataset and evaluation benchmarks.

Install

  1. Install package
    pip3 install -e .[vllm]
    pip3 install -e .[sglang]

Train

  1. Prepare data:Download the train and test dataset. Move rlpr_train.parquet to ./datasets/train, and move all the test datasets to ./datasets/test.

    huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Train-Dataset --local-dir ./datasets/train
    huggingface-cli download --repo-type dataset --resume-download openbmb/RLPR-Evaluation --local-dir ./datasets/test
  2. Specify the base model path in examples/ICPO_<model>.sh, where <model> can be qwen, llama and gemma.

    export BASE_MODEL=path_to_base_model
  3. (Optional) Login wandb, modify WANDB_MODE and WANDB_API_KEY in the examples/ICPO_<model>.sh if you want to use wandb for logging.

    export WANDB_MODE="offline"
    export WANDB_API_KEY="xxxxx"
  4. (Optional) Follow the following steps to use the llm as a judge eval method. Skip this step if you want to use a rule-based verifier to judge the answer.

    • Open-Source Model as judge

      1. Create a new environment for the server and deploy the model. (Specify judge model, host and port in the setup_server.sh)

        bash scripts/setup_server.sh
      2. Modify the evaluation model configuration in the ICPO/config/eval.yaml; by default, qwen_model is used.

        base_url: "http://localhost:port/v1"
        api_key: "EMPTY"
        model_name: "default"
    • API-Based Model (gpt-4o / 4pt-4.1) as judge

      Specify token and the judge model in the examples/ICPO_<model>.sh to use OpenAI API.

      export OPENAI_API_KEY=your_api_token
      export OPENAI_API_BASE=your_api_base  # default is https://api.openai.com/v1
      export USED_MODEL=gpt-4.1
  5. Run the training script

    bash examples/ICPO_<model>.sh
  6. The core modifications are based on the following code files

    ICPO/verl/trainer/ppo/ray_trainer.py -- Line 1006-1046
    ICPO/verl/trainer/ppo/core_algos.py -- Line 559-625

Resuming Training

If you need to continue training from a specific training step, navigate to the checkpoint save directory (default is data/checkpoints/), modify the value in the latest_checkpointed_iteration.txt file to the target step, and then rerun the training script.

Convert checkpoints to HuggingFace format model

Run the code below:

python scripts/model_merger.py --local_dir <checkpoint_folder>/<exp_name>/global_step_<step>/actor --target_dir <target_dir>

Acknowledgement

  • veRL: The codebase we built upon.
  • RLPR: The training and evaluation data sources employed in this project.

Citation

If you find our model/code/data/paper helpful, please consider cite our papers 📝 and star us ⭐️!

@misc{wang2025icpo,
      title={ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning}, 
      author={Jinpeng Wang and Chao Li and Ting Ye and Mengyuan Zhang and Wei Liu and Jian Luan},
      year={2025},
      eprint={2511.21005},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2511.21005}, 
}

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors