Weijia Mao1 Hao Chen2✉ Zhenheng Yang2 Mike Zheng Shou1✉
1 Show Lab, National University of Singapore, 2 ByteDance
We introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. Our method Adv-GRPO improves text-to-image (T2I) generation in three ways:
-
Alleviate Reward Hacking, achieving higher perceptual quality while maintaining comparable benchmark performance (e.g., PickScore, OCR), as shown in the top-left human evaluation panel;
-
Visual Foundation Model as Reward, leveraging visual foundation models (e.g., DINO) for rich visual priors, leading to overall improvements as shown in middle-top human evaluation results;
-
RL-based Distribution Transfer, enabling style customization by aligning generations with reference domains
2026-2-21
🎉 Our paper Adv-GRPO has been accepted to CVPR 2026 (Main Conference / Main Track)!
2025-12-15
- Released the checkpoints trained with OCR and GenEval prompts under the DINO reward framework:
2025-11-27
-
We also released our reference datasets.
-
The Adv-GRPO demo is now available on Hugging Face:
2025-11-25
- We release the code of Adv-GRPO training code, inference code and the pretrained ckpt.
- Release the reference dataset used in our work
- Release the DINO reward checkpoint trained with GenEval and OCR prompts
- Release the style transfer checkpoint
- Try more base models like QWen-Image
| Task | Model |
|---|---|
| PickScore | 🤗PickScore |
| DINOv2 | 🤗DINOv2 |
Clone this repository and install packages.
git clone https://github.com/showlab/Adv-GRPO.git
cd Adv-GRPO
conda create -n adv_grpo python=3.10.16 -y
pip install -e .We use the Qwen-Image model (https://github.com/QwenLM/Qwen-Image) to generate reference images.
First, install the dependencies required by Qwen-Image.
python reference_imgs_scripts/qwen_generate_multi.py \
--node_rank 0 \
--num_nodes 1 \
--num_variations 8 \
--output_dir "" \
--text_file ""The reference images will be saved in output_dir and the json file will be like this:
{
"middle-aged man with a beard giving a thumbs up, upper body, green fields in the background": [
"node0_rank3_00000_0.png",
"node0_rank3_00000_1.png",
...
],
"king charles spaniel with planets for eyes, ethereal, midjourney style lighting and shadows, insanely detailed, 8k, photorealistic": [
"node0_rank3_00001_0.png",
"node0_rank3_00001_1.png",
...
],
...
}And if you do not want to generate, you can use our generated images: 🤗QWen_PickScore
Some tips:
-
Our reference dataset is relatively large — the full set is about 60 GB if you choose to download it.
-
Actually, we do not use all images during training. Similarly, not all prompts are covered when using DINOv2.
-
Based on our ablation studies, using a smaller subset of reference images and prompts can still achieve comparable performance to using the full dataset when observing the DINO similarity.
-
If you prefer not to use our dataset or have a better alternative, you can use your own dataset and simply adapt it to the required format.
Firstly, we set the config file .config/grpo.py
def eval_sd3_fast():
...
config.train.lora_path = ""
config.save_folder = ""
config.json_path = ""
config.reference_image_path = ""
config.test_reference_image_path = ""
...lora_path: LoRA checkpoint pathsave_folder: Output directoryjson_path: JSON metadata file where each key is a prompt and each value is a list containing the file paths of the corresponding reference images.reference_image_path: Reference images for inference (optional)test_reference_image_path: Test-time reference images (optional)
Secondly,
bash scripts/multi_node/sd3_fast/eval.sh
You can modify the value after --prompts to try any text prompt you like.
If you want to generate one image,
python3 inference_t2i.py --config config/grpo.py:eval_sd3_fast --prompts "a flower on a planet"You can modify the value after --prompts to try any text prompt you like.
The config file is in the .config/grpo.py
def dino_cotrain_sd3_patch_fast():
...
config.json_path = ""
config.refernce_image_path = ""
config.test_reference_image_path = ""
...We use deepspeed stage2 to save the memory.
# zero2
accelerate launch --config_file scripts/accelerate_configs/deepspeed_zero2.yaml
# zero3
accelerate launch --config_file scripts/accelerate_configs/deepspeed_zero3.yamlSingle-node training:
# sd3 grpo with DINO reward
bash scripts/grpo_dino.sh# sd3 grpo with PickScore reward
bash scripts/grpo_pickscore.sh
- You can adjust the parameters in
config/grpo.pyto tune different hyperparameters.
This repo is based on Flow-GRPO . We thank the authors for their valuable contributions to the AIGC community.
If you find Adv-GRPO useful for your research or projects, we would greatly appreciate it if you could cite the following paper:
@article{mao2025image,
title={The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation},
author={Mao, Weijia and Chen, Hao and Yang, Zhenheng and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2511.20256},
year={2025}
}
