Skip to content

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297

Open
zhtmike wants to merge 70 commits intoverl-project:mainfrom
zhtmike:verl-omni-pr
Open

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297
zhtmike wants to merge 70 commits intoverl-project:mainfrom
zhtmike:verl-omni-pr

Conversation

@zhtmike
Copy link

@zhtmike zhtmike commented Feb 12, 2026

What does this PR do?

Follow-up Work for #4639

  • A trainable script for the FlowGRPO algorithm for Qwen-Image is provided.
  • Support for vLLM-Omni has been added to the rollout engine.
  • Diffusers has been integrated as the training engine for the diffusion model.
  • The scripts have been tested with training using cfg/non-cfg, w/o kl loss, lora/full-model, fsdp/fsdp2, collocated mode, and async reward with a standalone resource pool.
  • Unit tests for diffusion rollout/reward and diffusers training engines have been added.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Install Method

We provide a quick way to test with our examples codes. The formal support will be released together with vllm-omni 0.17.

pip install vllm==0.16
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni && git reset --hard 92e1a544 && pip install . && cd ../

Experiments were conducted on 4 NVIDIA A100 GPUs using the OCR reward (Levenshtein distance) with Qwen3-VL-8B-Instruct model.

Test Dataset: OCR Dataset
Training Sample Size: 19653
Test Sample Size: 1018

Training GPU hours required to reach and maintain a validation reward score of approximately 0.9:

Repo Model Algorithm Hybrid Engine # Cards Reward Fn # GPUs for Actor # GPUs for Rollout # GPUs for Async Reward Batch Size rollout.n Learning Rate # Val Samples Throughput (Samples / Seconds) # GPU Hour
verl-omni (ours) Qwen-Image Flow-GRPO-Fast True 4+1 qwenvl-ocr-vllm 4 4 1 32 16 3e-4 1k (full set) 0.04 49
Flow-GRPO Qwen-Image Flow-GRPO-Fast True 4+1 qwenvl-ocr-vllm 4 4 1 32 16 3e-4 1k (full set) 0.03 65
  • Result reported based on the script examples/flowgrpo_trainer/run_flowgrpo_async_reward.sh

Some results from wandb logging:

Loss:

Validation Score
螢幕截圖 2026-03-11 下午5 53 20

Performance:
螢幕截圖 2026-03-11 下午5 54 18

Visualization Result:

Prompt: A medieval knight holds a shield emblazoned with the crest "Fortis et Fidelis", standing resolutely in a sunlit courtyard, surrounded by ancient stone walls and banners fluttering in the breeze.

Before RL After RL
99184b5cc69b40af9da8ce6f98b7a5f6 e5d071355aa740698045faeed45795cf

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Please refer to #4639

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

zhtmike and others added 30 commits January 26, 2026 09:46
* add training engine

* fix init

* fix typs
* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright
* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn
* Update 20260109

* update

* fix CI
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* Update 20260109 (#8)

* Update 20260109

* update

* fix CI

* [data] feat: Add dataset for Qwen-Image (#6)

* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add new config; debug actor

* debug; add reward config; add adv, policy loss

* debug reward loop

* init diffusers engine UT

* debug

* debug

* deubg actor forward

* debug

* merge

* add UT for adv and loss

* pass adv&loss UTs; pass engine backward UT

* clean debug code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* Update 20260109 (#8)

* Update 20260109

* update

* fix CI

* [data] feat: Add dataset for Qwen-Image (#6)

* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* update to align verl data format

* debug

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add agent loop

* add server manager

* Add single turn loop

* add test case

* add replica

* clean dummy input

* fix bugs

* fix bugs 2

* fix bugs 3

* fix bugs 4 and add vllm-omni patch

* implement sde

* add custom_pipeline option in verl

* fix some bugs in custom pipeline

* fix OOM

* add intermediate outputs

* support inputs without mask

* clean & bug fix

* rebase master

* fix some bugs

* fix chat template (temproraly fix)

* fix several bugs & add custom pipeline

* fix several bugs

* fix reward loop

* pass CI (single card)

* minor fix

* fix import

* fix bugs

* fix import

* merge master

* add sleep mode back

* merge main

* support passing num_inference steps

* update accoriding to suggestion

* align with master

* add input_id & attention_mask back, drop hard code of chat template

* support varlen prompt input
* update scripts

* fix engine name & use image compressibility temporarily

* fix some bugs

* clean uncessary change

* fix some bugs

* fix bugs & clean configs

* add autogen

* fix CI

* clean args

* fix typo

* update script

* fix update weight

* add hijack

* fix checkpoint loading

* disable free cache engine temporaily
* support wandb val visual log; support async genrm/rule reward_loop in val

* update script

* add comment
* enable reward loop

* add timeout check for replica sleep

* fix train script

* consistent naming & fix mask

* fix UT for multi-card

* fix seq_len & clean files

* drop sleep due to bug fix in vllm-omni side
* fix bugs

* fix timesteps

* fix lora

* consistent script

* fix image size

* fix pipeline parse

* add max model len to qwen-image

* by pass bug

* fix misc. bugs
* fix bugs

* fix bugs

* fix advantage cal
* support sync reward for val

* wake up rollout after reward in val

* debug
* fix sleep mode & non-lora weight update

* fix from review
* fix bugs

* update UT

* fix config

* update config

* fix lora weight exporting

* revert noise

* revert size

* format
* fix training

* update script
zhtmike and others added 4 commits March 10, 2026 14:06
* revert python change

* Api compatible with vllm omni

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Bug fix on qwen image transformers

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Bug fix on verl

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Fix import path bug

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

* Address PR review comments: rename sp_kwargs, restore priority, real finish_reason, delete local data.py, revert build_app import

* Minor fix

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>

---------

Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com>
Co-authored-by: Mike Cheung <zhtmike@gmail.com>
* fix CI

* fix CI

* fix bug

* update script
@zhtmike zhtmike marked this pull request as ready for review March 11, 2026 08:56
@lrq619
Copy link

lrq619 commented Mar 13, 2026

Hi thanks for the great PR, I'm also trying out RL postraining for flowGRPO
Is it possible for you to post the reward curve during training steps? I want to validate whether my training is correct, thank you!

@zhtmike
Copy link
Author

zhtmike commented Mar 13, 2026

Hi thanks for the great PR, I'm also trying out RL postraining for flowGRPO Is it possible for you to post the reward curve during training steps? I want to validate whether my training is correct, thank you!

The reward curve can be found in the chart val-aux/flow_grpo/ocr/score/mean@1 mentioned in the PR description. It was run based on the script examples/flowgrpo_trainer/run_flowgrpo.sh with the training with CFG :)

@lrq619
Copy link

lrq619 commented Mar 13, 2026

Hi thanks for the great PR, I'm also trying out RL postraining for flowGRPO Is it possible for you to post the reward curve during training steps? I want to validate whether my training is correct, thank you!

The reward curve can be found in the chart val-aux/flow_grpo/ocr/score/mean@1 mentioned in the PR description. It was run based on the script examples/flowgrpo_trainer/run_flowgrpo.sh with the training with CFG :)

Thanks! Is it possible to get the reward on training set e.g. (critic/rewards/mean)? My reward on training set cannot converge. Is it normal that in the initial steps, the reward starts from 0.2? Because I see your validation set reward is around 0.9 even in step 0

@zhtmike
Copy link
Author

zhtmike commented Mar 13, 2026

Thanks! Is it possible to get the reward on training set e.g. (critic/rewards/mean)? My reward on training set cannot converge. Is it normal that in the initial steps, the reward starts from 0.2? Because I see your validation set reward is around 0.9 even in step 0

This is the reward on training set.

屏幕截图 2026-03-13 230854

If you are training Qwen-Image with CFG, the score should be sufficiently high because the model already performs quite well on OCR tasks. If you are training without CFG, a relatively low score at the beginning should be resonable.

@lrq619
Copy link

lrq619 commented Mar 14, 2026

Thanks! Is it possible to get the reward on training set e.g. (critic/rewards/mean)? My reward on training set cannot converge. Is it normal that in the initial steps, the reward starts from 0.2? Because I see your validation set reward is around 0.9 even in step 0

This is the reward on training set.

屏幕截图 2026-03-13 230854 If you are training Qwen-Image with CFG, the score should be sufficiently high because the model already performs quite well on OCR tasks. If you are training without CFG, a relatively low score at the beginning should be resonable.

Thanks for the information, that's really helpful. I have fixed the model and it starts from ~0.8 now.
I have a question regarding the performance. I see that your performance is around 500s/step under 4 A100 GPUs. But I tried my performance is around 2000s/step even under 4 H800 GPUs. May I know did you use the full denoising steps during rollout phase (50 steps) as in the configuration?

@zhtmike
Copy link
Author

zhtmike commented Mar 14, 2026

Thanks for the information, that's really helpful. I have fixed the model and it starts from ~0.8 now. I have a question regarding the performance. I see that your performance is around 500s/step under 4 A100 GPUs. But I tried my performance is around 2000s/step even under 4 H800 GPUs. May I know did you use the full denoising steps during rollout phase (50 steps) as in the configuration?

We follow the paper setting, use 10 steps at the rollout phase at training stage, 50 steps at the evaluation stage.

@BesmingY
Copy link

Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help.

@zhtmike
Copy link
Author

zhtmike commented Mar 16, 2026

Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help.

Hi, you can refer to the PR branch for validating. And the version of the vllm and vllm-omni is listed in the Install Method under Test in PR description.

@BesmingY
Copy link

Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help.

Hi, you can refer to the PR branch for validating. And the version of the vllm and vllm-omni is listed in the Install Method under Test in PR description.

Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help.

Hi, you can refer to the PR branch for validating. And the version of the vllm and vllm-omni is listed in the Install Method under Test in PR description.

Hi, thank you for your reply! I noticed that your PR repository is still being updated. Could you please confirm whether I should use the latest vllm-omni-pr​ or the vllm-omni-20260211​ branch to reproduce the results?

@zhtmike
Copy link
Author

zhtmike commented Mar 16, 2026

Hi, thank you for your reply! I noticed that your PR repository is still being updated. Could you please confirm whether I should use the latest vllm-omni-pr​ or the vllm-omni-20260211​ branch to reproduce the results?

This PR will be no longer updated except small bugs/typo fixes (which will not affect the reproduction result). You can use the latest version of vllm-omni-pr​ for your test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants