This repository contains the reference code for the paper NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimatio.
- Release the Training Code.
- Release the Training Data.
- Release the Models Weights.
- Release the Training Scripts.
- Release the Evaluation Code.
Please cite this work with the following BibTeX:
@article{qiu2025noisygrpo,
title={NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation},
author={Qiu, Longtian and Ning, Shan and Sun, Jiaxuan and He, Xuming},
journal={arXiv preprint arXiv:2510.21122},
year={2025},
}
Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones.
To create the conda environment named reflectiva use the following instructions. With this environment you have all the packages to run the code inside this repo.
conda create -n noisygrpo python=3.10
conda activate noisygrpo
bash setup.sh
You can access the official model weights for the NoisyGRPO model on 🤗 Hugging Face.
The annotation of training dataset is provided in annotations/mm_rlhf_train13k.json.
Please note that the JSON file includes only the absolute paths to the images. You may need to change it to fit your own system.
The images can be downloaded from MM-RLHF. After downloading the .zip files, unzip the images in one file and change the image path in annotation file accordingly.
The directory for the image should be as following:
MM_RLHF
├── long
├── mcq
├── safety
├── short
Before starting the training of NoisyGRPO, make sure to set up the environment and download the dataset to your local machine. Additionally, update the absolute paths in the functions starting with fill_abs_path to correctly point to the image locations in your configuration.
Once everything is set up, you can launch the training job using the following command:
To train the NoisyGRPO 3B, use the following scripts:
cd ./NoisyGRPO
bash scripts/noisy_grpo_3B_8gpu.sh
We also provide the scripts for vanilla GRPO, all the scripts are under scripts/.
We would like to express our sincere gratitude to DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal and VLM-R1 for providing open-source resources that contributed to the development of this project.
