A novel policy gradient reinforcement learning algorithm to extend GRPO to discrete diffusion. We feature efficient sequence level log probability estimation as well as training code for GDPO in standard reasoning and coding benchmarks.
This repository implements GDPO (Group Diffusion Policy Optimization), to scale reasoning in diffusion large language models. Our codebase is built on top of the D1 repository.
We use uv to set up our environment, alternatively if you prefer using conda you can use the same environment as d1. You can run this through:
uv syncFor training, you can submit a job via sbatch using gdpo/slurm_scripts/countdown_base.sbatch. Modify the contents of the file to adjust hyperparameters as needed.
For evaluation, you can use the eval.py script. A sample usage is:
cd eval/
uv run torchrun --nproc_per_node=1 eval.py \
--dataset math \
--gen_length 256 \
--output_dir sample-generations
--model_path GSAI-ML/LLada-8B-Instruct \
--checkpoint_path path-to-lora-adapterIf evaluating the base model do not pass the --checkpoint_path flag. To compute accuracy, modify line 511 of eval/parse_and_get_acc.py with the directory used above and run:
python3 parse_and_get_acc.pyThis will print the results for all evaluated checkpoints.
The reinforcement learning code is located in the gdpo directory.
gdpo/slurm_scripts/contains a sample SLURM script used to run the RL experimentsgdpo/likelihood_estimators.pycontains the likelihood estimators implemented in our paper. Implement your own and train using GDPO!
If you find this work useful, please consider citing.
@article{rojas2025improving,
title={Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization},
author={Rojas, Kevin and Lin, Jiahe and Rasul, Kashif and Schneider, Anderson and Nevmyvaka, Yuriy and Tao, Molei and Deng, Wei},
journal={arXiv preprint arXiv:2510.08554},
year={2025}
}This work is built on top of the D1 repository by Zhao et al. (2025). The coding part also benefited from open-r1 and ml-diffucoder. We gratefully acknowledge the original authors for their pioneering work on scaling reasoning in diffusion large language models via reinforcement learning.
Please refer to the LICENSE file for details on the license terms.
