[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage by zhtmike · Pull Request #5297 · verl-project/verl

zhtmike · 2026-02-12T07:04:21Z

What does this PR do?

Follow-up Work for #4639

A trainable script for the FlowGRPO algorithm for Qwen-Image is provided.
Support for vLLM-Omni has been added to the rollout engine.
Diffusers has been integrated as the training engine for the diffusion model.
The scripts have been tested with training using cfg/non-cfg, w/o kl loss, lora/full-model, fsdp/fsdp2, collocated mode, and async reward with a standalone resource pool.
Unit tests for diffusion rollout/reward and diffusers training engines have been added.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Install Method

We provide a quick way to test with our examples codes. The formal support will be released together with vllm-omni 0.17.

pip install vllm==0.16
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni && git reset --hard 92e1a544 && pip install . && cd ../

Experiments were conducted on 4 NVIDIA A100 GPUs using the OCR reward (Levenshtein distance) with Qwen3-VL-8B-Instruct model.

Test Dataset: OCR Dataset
Training Sample Size: 19653
Test Sample Size: 1018

Training GPU hours required to reach and maintain a validation reward score of approximately 0.9:

Repo	Model	Algorithm	Hybrid Engine	# Cards	Reward Fn	# GPUs for Actor	# GPUs for Rollout	# GPUs for Async Reward	Batch Size	`rollout.n`	Learning Rate	# Val Samples	Throughput (Samples / Seconds)	# GPU Hour
verl-omni (ours)	Qwen-Image	Flow-GRPO-Fast	True	4+1	qwenvl-ocr-vllm	4	4	1	32	16	3e-4	1k (full set)	0.04	49
Flow-GRPO	Qwen-Image	Flow-GRPO-Fast	True	4+1	qwenvl-ocr-vllm	4	4	1	32	16	3e-4	1k (full set)	0.03	65

Result reported based on the script examples/flowgrpo_trainer/run_flowgrpo_async_reward.sh

Some results from wandb logging:

Loss:

Validation Score

Performance:

Visualization Result:

Prompt: A medieval knight holds a shield emblazoned with the crest "Fortis et Fidelis", standing resolutely in a sunlit courtyard, surrounded by ancient stone walls and banners fluttering in the breeze.

Before RL	After RL

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Please refer to #4639

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

* add training engine * fix init * fix typs

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn

* Update 20260109 * update * fix CI

* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * add new config; debug actor * debug; add reward config; add adv, policy loss * debug reward loop * init diffusers engine UT * debug * debug * deubg actor forward * debug * merge * add UT for adv and loss * pass adv&loss UTs; pass engine backward UT * clean debug code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * update to align verl data format * debug --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add agent loop * add server manager * Add single turn loop * add test case * add replica * clean dummy input * fix bugs * fix bugs 2 * fix bugs 3 * fix bugs 4 and add vllm-omni patch * implement sde * add custom_pipeline option in verl * fix some bugs in custom pipeline * fix OOM * add intermediate outputs * support inputs without mask * clean & bug fix * rebase master * fix some bugs * fix chat template (temproraly fix) * fix several bugs & add custom pipeline * fix several bugs * fix reward loop * pass CI (single card) * minor fix * fix import * fix bugs * fix import * merge master * add sleep mode back * merge main * support passing num_inference steps * update accoriding to suggestion * align with master * add input_id & attention_mask back, drop hard code of chat template * support varlen prompt input

* update scripts * fix engine name & use image compressibility temporarily * fix some bugs * clean uncessary change * fix some bugs * fix bugs & clean configs * add autogen * fix CI * clean args * fix typo * update script * fix update weight * add hijack * fix checkpoint loading * disable free cache engine temporaily

* support wandb val visual log; support async genrm/rule reward_loop in val * update script * add comment

* enable reward loop * add timeout check for replica sleep * fix train script * consistent naming & fix mask * fix UT for multi-card * fix seq_len & clean files * drop sleep due to bug fix in vllm-omni side

* fix bugs * fix timesteps * fix lora * consistent script * fix image size * fix pipeline parse * add max model len to qwen-image * by pass bug * fix misc. bugs

* fix bugs * fix bugs * fix advantage cal

* support sync reward for val * wake up rollout after reward in val * debug

* fix sleep mode & non-lora weight update * fix from review

* fix bugs * update UT * fix config * update config * fix lora weight exporting * revert noise * revert size * format

* fix training * update script

* merge main * fix

* revert python change * Api compatible with vllm omni Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Bug fix on qwen image transformers Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Bug fix on verl Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix import path bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Address PR review comments: rename sp_kwargs, restore priority, real finish_reason, delete local data.py, revert build_app import * Minor fix Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> --------- Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Co-authored-by: Mike Cheung <zhtmike@gmail.com>

* fix CI * fix CI * fix bug * update script

lrq619 · 2026-03-13T03:12:14Z

Hi thanks for the great PR, I'm also trying out RL postraining for flowGRPO
Is it possible for you to post the reward curve during training steps? I want to validate whether my training is correct, thank you!

zhtmike · 2026-03-13T03:48:34Z

Hi thanks for the great PR, I'm also trying out RL postraining for flowGRPO Is it possible for you to post the reward curve during training steps? I want to validate whether my training is correct, thank you!

The reward curve can be found in the chart val-aux/flow_grpo/ocr/score/mean@1 mentioned in the PR description. It was run based on the script examples/flowgrpo_trainer/run_flowgrpo.sh with the training with CFG :)

lrq619 · 2026-03-13T14:12:24Z

Hi thanks for the great PR, I'm also trying out RL postraining for flowGRPO Is it possible for you to post the reward curve during training steps? I want to validate whether my training is correct, thank you!

The reward curve can be found in the chart val-aux/flow_grpo/ocr/score/mean@1 mentioned in the PR description. It was run based on the script examples/flowgrpo_trainer/run_flowgrpo.sh with the training with CFG :)

Thanks! Is it possible to get the reward on training set e.g. (critic/rewards/mean)? My reward on training set cannot converge. Is it normal that in the initial steps, the reward starts from 0.2? Because I see your validation set reward is around 0.9 even in step 0

zhtmike · 2026-03-13T15:13:34Z

Thanks! Is it possible to get the reward on training set e.g. (critic/rewards/mean)? My reward on training set cannot converge. Is it normal that in the initial steps, the reward starts from 0.2? Because I see your validation set reward is around 0.9 even in step 0

This is the reward on training set.

If you are training Qwen-Image with CFG, the score should be sufficiently high because the model already performs quite well on OCR tasks. If you are training without CFG, a relatively low score at the beginning should be resonable.

lrq619 · 2026-03-14T04:46:27Z

Thanks! Is it possible to get the reward on training set e.g. (critic/rewards/mean)? My reward on training set cannot converge. Is it normal that in the initial steps, the reward starts from 0.2? Because I see your validation set reward is around 0.9 even in step 0

This is the reward on training set.
If you are training Qwen-Image with CFG, the score should be sufficiently high because the model already performs quite well on OCR tasks. If you are training without CFG, a relatively low score at the beginning should be resonable.

Thanks for the information, that's really helpful. I have fixed the model and it starts from ~0.8 now.
I have a question regarding the performance. I see that your performance is around 500s/step under 4 A100 GPUs. But I tried my performance is around 2000s/step even under 4 H800 GPUs. May I know did you use the full denoising steps during rollout phase (50 steps) as in the configuration?

zhtmike · 2026-03-14T05:01:57Z

Thanks for the information, that's really helpful. I have fixed the model and it starts from ~0.8 now. I have a question regarding the performance. I see that your performance is around 500s/step under 4 A100 GPUs. But I tried my performance is around 2000s/step even under 4 H800 GPUs. May I know did you use the full denoising steps during rollout phase (50 steps) as in the configuration?

We follow the paper setting, use 10 steps at the rollout phase at training stage, 50 steps at the evaluation stage.

BesmingY · 2026-03-16T03:08:10Z

Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help.

zhtmike · 2026-03-16T03:12:47Z

Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help.

Hi, you can refer to the PR branch for validating. And the version of the vllm and vllm-omni is listed in the Install Method under Test in PR description.

BesmingY · 2026-03-16T03:28:36Z

Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help.

Hi, you can refer to the PR branch for validating. And the version of the vllm and vllm-omni is listed in the Install Method under Test in PR description.

Hi, thank you for your reply! I noticed that your PR repository is still being updated. Could you please confirm whether I should use the latest vllm-omni-pr or the vllm-omni-20260211 branch to reproduce the results?

zhtmike · 2026-03-16T03:34:11Z

Hi, thank you for your reply! I noticed that your PR repository is still being updated. Could you please confirm whether I should use the latest vllm-omni-pr or the vllm-omni-20260211 branch to reproduce the results?

This PR will be no longer updated except small bugs/typo fixes (which will not affect the reproduction result). You can use the latest version of vllm-omni-pr for your test.

zhtmike and others added 30 commits January 26, 2026 09:46

add entroypoint (#1)

70a155a

add training engine (#2)

62c5286

* add training engine * fix init * fix typs

move folders & make for two-forward pass in training loop (#4)

c0150da

Add diffusion reward loop (#3)

43915bc

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright

[fix] update customized reward func in UT (#5)

0833f81

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn

Update 20260109 (#8)

4d0a8d8

* Update 20260109 * update * fix CI

small fix after rebase (#12)

3c354d1

Merge branch 'main' into verl-omni

b418656

merge main (#13)

7d522ee

Merge remote-tracking branch 'origin/main' into verl-omni

abdb5d4

fix worker extension (#15)

80738a3

fix worker extension

a9b88f3

Merge branch 'main' into verl-omni

cf314d0

merge main

6eb395a

[reward, misc] fix: support async reward loop for validation (#18)

24d00a7

* support wandb val visual log; support async genrm/rule reward_loop in val * update script * add comment

[rollout] feat: enable reward model (#17)

be667a3

* enable reward loop * add timeout check for replica sleep * fix train script * consistent naming & fix mask * fix UT for multi-card * fix seq_len & clean files * drop sleep due to bug fix in vllm-omni side

[trainer] feat: fix training loop (#19)

8edd6d5

* fix bugs * fix timesteps * fix lora * consistent script * fix image size * fix pipeline parse * add max model len to qwen-image * by pass bug * fix misc. bugs

[rollout] fix: fix misc. bugs (#20)

b008b15

* fix bugs * fix bugs * fix advantage cal

turn on offload to avoid oom

46ffce8

[misc] feat: support sync reward loop for validation (#21)

af7ab01

* support sync reward for val * wake up rollout after reward in val * debug

[rollout] fix: fix sleep mode & non-lora weight update (#22)

109427b

* fix sleep mode & non-lora weight update * fix from review

add padding conversion (#24)

37f60a3

[rollout] fix: fix lora weight export from trainer (#23)

8fe64da

* fix bugs * update UT * fix config * update config * fix lora weight exporting * revert noise * revert size * format

[trainer] fix: fix training (#25)

838e28c

* fix training * update script

Merge branch 'main' into verl-omni-main

ac8122a

zhtmike added 4 commits March 6, 2026 09:53

Merge branch 'verl-omni' into verl-omni-pr

cad2165

Merge branch 'main' into verl-omni

79e6427

Merge branch 'main' into verl-omni

36177ff

[misc] chore: merge main (#46)

d1379df

* merge main * fix

chenyingshu mentioned this pull request Mar 10, 2026

[RFC] Support Diffusion Generative Models and Async Reward during Rollout #4557

Closed

zhtmike and others added 4 commits March 10, 2026 14:06

Merge branch 'verl-omni' into verl-omni-pr

716436b

[misc] chore: fix CI & bugs after vllm-omni upgrade (#47)

a5fdd4b

* fix CI * fix CI * fix bug * update script

Merge branch 'verl-omni' into verl-omni-pr

2e428f5

zhtmike marked this pull request as ready for review March 11, 2026 08:56

zhtmike requested review from ArronHZG, PeterSH6, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992 and wuxibin89 as code owners March 11, 2026 08:56

fix mask

6b7a4f0

princepride mentioned this pull request Mar 15, 2026

[RFC]: Reinforcement learning support for multi-stage models (Bagel) in vLLM-Omni vllm-project/vllm-omni#1904

Open

5 tasks

Merge branch 'verl-omni' into verl-omni-pr

1aa6693

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297
zhtmike wants to merge 70 commits intoverl-project:mainfrom
zhtmike:verl-omni-pr

zhtmike commented Feb 12, 2026 •

edited

Loading

Uh oh!

lrq619 commented Mar 13, 2026

Uh oh!

zhtmike commented Mar 13, 2026

Uh oh!

lrq619 commented Mar 13, 2026

Uh oh!

zhtmike commented Mar 13, 2026 •

edited

Loading

Uh oh!

lrq619 commented Mar 14, 2026

Uh oh!

zhtmike commented Mar 14, 2026

Uh oh!

BesmingY commented Mar 16, 2026

Uh oh!

zhtmike commented Mar 16, 2026

Uh oh!

BesmingY commented Mar 16, 2026

Uh oh!

zhtmike commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

zhtmike commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

Install Method

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

lrq619 commented Mar 13, 2026

Uh oh!

zhtmike commented Mar 13, 2026

Uh oh!

lrq619 commented Mar 13, 2026

Uh oh!

zhtmike commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lrq619 commented Mar 14, 2026

Uh oh!

zhtmike commented Mar 14, 2026

Uh oh!

BesmingY commented Mar 16, 2026

Uh oh!

zhtmike commented Mar 16, 2026

Uh oh!

BesmingY commented Mar 16, 2026

Uh oh!

zhtmike commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhtmike commented Feb 12, 2026 •

edited

Loading

zhtmike commented Mar 13, 2026 •

edited

Loading

zhtmike commented Mar 16, 2026 •

edited

Loading