Skip to content

NeurIPS 2025 Accepted Paper NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Notifications You must be signed in to change notification settings

Artanic30/NoisyGRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimatio

(NeurIPs 2025)



reflectiva

This repository contains the reference code for the paper NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimatio.

🎯 Project web page | Paper |

🤗 HuggingFace Model |

Table of Contents

  1. ToDo
  2. Citation
  3. Overview
  4. Installation
  5. Model
  6. Dataset
  7. Training
  8. Inference
  9. Acknowledgements

ToDo

  • Release the Training Code.
  • Release the Training Data.
  • Release the Models Weights.
  • Release the Training Scripts.
  • Release the Evaluation Code.

Citation

Please cite this work with the following BibTeX:

@article{qiu2025noisygrpo,
  title={NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation},
  author={Qiu, Longtian and Ning, Shan and Sun, Jiaxuan and He, Xuming},
  journal={arXiv preprint arXiv:2510.21122},
  year={2025},
}

Overview

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones.

Installation

To create the conda environment named reflectiva use the following instructions. With this environment you have all the packages to run the code inside this repo.

conda create -n noisygrpo python=3.10
conda activate noisygrpo
bash setup.sh

Model

You can access the official model weights for the NoisyGRPO model on 🤗 Hugging Face.

Dataset

The annotation of training dataset is provided in annotations/mm_rlhf_train13k.json. Please note that the JSON file includes only the absolute paths to the images. You may need to change it to fit your own system.

The images can be downloaded from MM-RLHF. After downloading the .zip files, unzip the images in one file and change the image path in annotation file accordingly.

The directory for the image should be as following:

MM_RLHF
├── long
├── mcq
├── safety
├── short

Training

Before starting the training of NoisyGRPO, make sure to set up the environment and download the dataset to your local machine. Additionally, update the absolute paths in the functions starting with fill_abs_path to correctly point to the image locations in your configuration.

Once everything is set up, you can launch the training job using the following command:

To train the NoisyGRPO 3B, use the following scripts:

cd ./NoisyGRPO

bash scripts/noisy_grpo_3B_8gpu.sh

We also provide the scripts for vanilla GRPO, all the scripts are under scripts/.

Acknowledgements

We would like to express our sincere gratitude to DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal and VLM-R1 for providing open-source resources that contributed to the development of this project.

About

NeurIPS 2025 Accepted Paper NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages