[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297
[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297zhtmike wants to merge 70 commits intoverl-project:mainfrom
Conversation
* add training engine * fix init * fix typs
* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright
* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn
* Update 20260109 * update * fix CI
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * add new config; debug actor * debug; add reward config; add adv, policy loss * debug reward loop * init diffusers engine UT * debug * debug * deubg actor forward * debug * merge * add UT for adv and loss * pass adv&loss UTs; pass engine backward UT * clean debug code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * update to align verl data format * debug --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add agent loop * add server manager * Add single turn loop * add test case * add replica * clean dummy input * fix bugs * fix bugs 2 * fix bugs 3 * fix bugs 4 and add vllm-omni patch * implement sde * add custom_pipeline option in verl * fix some bugs in custom pipeline * fix OOM * add intermediate outputs * support inputs without mask * clean & bug fix * rebase master * fix some bugs * fix chat template (temproraly fix) * fix several bugs & add custom pipeline * fix several bugs * fix reward loop * pass CI (single card) * minor fix * fix import * fix bugs * fix import * merge master * add sleep mode back * merge main * support passing num_inference steps * update accoriding to suggestion * align with master * add input_id & attention_mask back, drop hard code of chat template * support varlen prompt input
* update scripts * fix engine name & use image compressibility temporarily * fix some bugs * clean uncessary change * fix some bugs * fix bugs & clean configs * add autogen * fix CI * clean args * fix typo * update script * fix update weight * add hijack * fix checkpoint loading * disable free cache engine temporaily
* support wandb val visual log; support async genrm/rule reward_loop in val * update script * add comment
* enable reward loop * add timeout check for replica sleep * fix train script * consistent naming & fix mask * fix UT for multi-card * fix seq_len & clean files * drop sleep due to bug fix in vllm-omni side
* fix bugs * fix timesteps * fix lora * consistent script * fix image size * fix pipeline parse * add max model len to qwen-image * by pass bug * fix misc. bugs
* fix bugs * fix bugs * fix advantage cal
* support sync reward for val * wake up rollout after reward in val * debug
* fix sleep mode & non-lora weight update * fix from review
* fix bugs * update UT * fix config * update config * fix lora weight exporting * revert noise * revert size * format
* fix training * update script
* revert python change * Api compatible with vllm omni Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Bug fix on qwen image transformers Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Bug fix on verl Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Fix import path bug Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> * Address PR review comments: rename sp_kwargs, restore priority, real finish_reason, delete local data.py, revert build_app import * Minor fix Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> --------- Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com> Co-authored-by: Mike Cheung <zhtmike@gmail.com>
* fix CI * fix CI * fix bug * update script
|
Hi thanks for the great PR, I'm also trying out RL postraining for flowGRPO |
The reward curve can be found in the chart |
Thanks! Is it possible to get the reward on training set e.g. (critic/rewards/mean)? My reward on training set cannot converge. Is it normal that in the initial steps, the reward starts from 0.2? Because I see your validation set reward is around 0.9 even in step 0 |
We follow the paper setting, use 10 steps at the rollout phase at training stage, 50 steps at the evaluation stage. |
|
Hello, may I ask which repository should be cloned to directly reproduce the training curves you reported? Additionally, could you please specify the exact versions of vllm and vllm_omnithat were used? Thank you for your help. |
Hi, you can refer to the PR branch for validating. And the version of the vllm and vllm-omni is listed in the |
Hi, thank you for your reply! I noticed that your PR repository is still being updated. Could you please confirm whether I should use the latest vllm-omni-pr or the vllm-omni-20260211 branch to reproduce the results? |
This PR will be no longer updated except small bugs/typo fixes (which will not affect the reproduction result). You can use the latest version of vllm-omni-pr for your test. |


What does this PR do?
Follow-up Work for #4639
vLLM-Omnihas been added to the rollout engine.Diffusershas been integrated as the training engine for the diffusion model.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Install Method
We provide a quick way to test with our examples codes. The formal support will be released together with vllm-omni 0.17.
Experiments were conducted on 4 NVIDIA A100 GPUs using the OCR reward (Levenshtein distance) with Qwen3-VL-8B-Instruct model.
Test Dataset: OCR Dataset
Training Sample Size: 19653
Test Sample Size: 1018
Training GPU hours required to reach and maintain a validation reward score of approximately 0.9:
rollout.nexamples/flowgrpo_trainer/run_flowgrpo_async_reward.shSome results from wandb logging:
Loss:
Validation Score

Performance:

Visualization Result:
Prompt: A medieval knight holds a shield emblazoned with the crest "Fortis et Fidelis", standing resolutely in a sunlit courtyard, surrounded by ancient stone walls and banners fluttering in the breeze.
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Please refer to #4639
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.