Skip to content
Open
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
70a155a
add entroypoint (#1)
zhtmike Jan 6, 2026
62c5286
add training engine (#2)
zhtmike Jan 7, 2026
c0150da
move folders & make for two-forward pass in training loop (#4)
zhtmike Jan 8, 2026
43915bc
Add diffusion reward loop (#3)
chenyingshu Jan 8, 2026
0833f81
[fix] update customized reward func in UT (#5)
chenyingshu Jan 8, 2026
4d0a8d8
Update 20260109 (#8)
zhtmike Jan 9, 2026
4480199
[data] feat: Add dataset for Qwen-Image (#6)
chenyingshu Jan 9, 2026
3c354d1
small fix after rebase (#12)
zhtmike Jan 26, 2026
01f6f7c
[trainer, cfg] fix: actor engine and trainer debug (#10)
chenyingshu Jan 27, 2026
b418656
Merge branch 'main' into verl-omni
zhtmike Jan 28, 2026
7d522ee
merge main (#13)
zhtmike Jan 28, 2026
abdb5d4
Merge remote-tracking branch 'origin/main' into verl-omni
zhtmike Jan 28, 2026
647c043
[data] fix: QwenDataset update (#14)
chenyingshu Jan 29, 2026
d3d2ac4
[rollout] feat: Add vllm-omni for rollout (#9)
zhtmike Jan 29, 2026
80738a3
fix worker extension (#15)
zhtmike Jan 29, 2026
a9b88f3
fix worker extension
zhtmike Jan 29, 2026
cf314d0
Merge branch 'main' into verl-omni
zhtmike Feb 2, 2026
6eb395a
merge main
zhtmike Feb 2, 2026
a32de27
[rollout] feat: flowgrpo with vllm-omni (rollout part) (#16)
zhtmike Feb 3, 2026
24d00a7
[reward, misc] fix: support async reward loop for validation (#18)
chenyingshu Feb 5, 2026
be667a3
[rollout] feat: enable reward model (#17)
zhtmike Feb 5, 2026
8edd6d5
[trainer] feat: fix training loop (#19)
zhtmike Feb 6, 2026
b008b15
[rollout] fix: fix misc. bugs (#20)
zhtmike Feb 6, 2026
46ffce8
turn on offload to avoid oom
zhtmike Feb 6, 2026
af7ab01
[misc] feat: support sync reward loop for validation (#21)
chenyingshu Feb 6, 2026
109427b
[rollout] fix: fix sleep mode & non-lora weight update (#22)
zhtmike Feb 9, 2026
37f60a3
add padding conversion (#24)
chenyingshu Feb 11, 2026
8fe64da
[rollout] fix: fix lora weight export from trainer (#23)
zhtmike Feb 11, 2026
838e28c
[trainer] fix: fix training (#25)
zhtmike Feb 11, 2026
ac8122a
Merge branch 'main' into verl-omni-main
zhtmike Feb 11, 2026
c903e63
Merge branch 'main' into verl-omni-main
zhtmike Feb 12, 2026
937abd0
Merge branch 'main' into verl-omni-main
zhtmike Feb 12, 2026
e3b41ff
[fsdp,vllm_omni,algo] fix: Merge main (#26)
zhtmike Feb 12, 2026
1942ed3
revert python change
zhtmike Feb 12, 2026
89b49e5
fix bug during ckpt saving (#27)
chenyingshu Feb 13, 2026
5da0abe
[vllm_omni] fix: add cfg & clean codes (#28)
zhtmike Feb 13, 2026
0c0acfd
update license (#29)
zhtmike Feb 13, 2026
0a2e3b9
Merge branch 'main' into verl-omni
zhtmike Feb 20, 2026
156014f
Merge branch 'main' into verl-omni
zhtmike Feb 20, 2026
4ec5021
[trainer] refactor: support kl training & clean codes (#30)
zhtmike Feb 23, 2026
0905629
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Feb 23, 2026
e53770c
update ocr model (#31)
zhtmike Feb 24, 2026
c59767f
[cfg] refactor: refactor rollout configurations (#32)
chenyingshu Feb 24, 2026
0df76a9
[reward] feat: async reward via a separate api call (#34)
chenyingshu Feb 24, 2026
b4d5f80
[misc] chore: change to fast UT (#33)
zhtmike Feb 24, 2026
d8eb0d2
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Feb 24, 2026
9a81399
[rollout] feat: support bypass mode (#35)
zhtmike Feb 26, 2026
01fe220
[perf] chore: align flowgrpo Qwen-Image training config (#36)
chenyingshu Feb 27, 2026
a80c0c4
Merge branch 'main' into verl-omni
zhtmike Mar 2, 2026
6a7798f
merge main (#37)
zhtmike Mar 2, 2026
8dc25ae
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 2, 2026
3169944
update script (#38)
zhtmike Mar 2, 2026
f89a7e2
[doc] chore: add README (#39)
zhtmike Mar 3, 2026
9ff7986
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 3, 2026
5fd3362
update doc (#40)
zhtmike Mar 3, 2026
574363f
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 3, 2026
41d0173
Merge branch 'main' into verl-omni
zhtmike Mar 4, 2026
edfdee2
merge main (#43)
zhtmike Mar 4, 2026
3e28e06
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 4, 2026
1e0fd88
[misc] chore: misc changes (#44)
zhtmike Mar 5, 2026
cad2165
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 6, 2026
79e6427
Merge branch 'main' into verl-omni
zhtmike Mar 9, 2026
36177ff
Merge branch 'main' into verl-omni
zhtmike Mar 9, 2026
d1379df
[misc] chore: merge main (#46)
zhtmike Mar 9, 2026
716436b
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 10, 2026
9938cd8
[rollout] feat: Rebase with vllm-omni 0.16.0 (#42)
knlnguyen1802 Mar 10, 2026
a5fdd4b
[misc] chore: fix CI & bugs after vllm-omni upgrade (#47)
zhtmike Mar 11, 2026
2e428f5
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 11, 2026
6b7a4f0
fix mask
zhtmike Mar 13, 2026
1aa6693
Merge branch 'verl-omni' into verl-omni-pr
zhtmike Mar 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions examples/flowgrpo_trainer/run_flowgrpo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Qwen-Image lora, vllm_omni rollout
set -x
export TOKENIZERS_PARALLELISM="false"

ENGINE=vllm_omni
REWARD_ENGINE=vllm

reward_path=tests/experimental/reward_loop/reward_fn.py
reward_model_name=$HOME/models/Qwen/Qwen2.5-VL-3B-Instruct


python3 -m verl.trainer.main_flowgrpo \
algorithm.adv_estimator=flow_grpo \
data.train_files=$HOME/dataset/ocr/train.txt \
data.val_files=$HOME/dataset/ocr/test.txt \
data.train_batch_size=32 \
data.val_max_samples=128 \
data.max_prompt_length=1058 \
data.filter_overlong_prompts=True \
data.data_source=ocr \
data.custom_cls.path=verl/utils/dataset/qwen_dataset.py \
data.custom_cls.name=QwenDataset \
+data.apply_chat_template_kwargs.max_length=1058 \
+data.apply_chat_template_kwargs.padding=True \
+data.apply_chat_template_kwargs.truncation=True \
actor_rollout_ref.model.path=$HOME/models/Qwen/Qwen-Image \
actor_rollout_ref.model.tokenizer_path=$HOME/models/Qwen/Qwen-Image/tokenizer \
actor_rollout_ref.model.lora_rank=64 \
actor_rollout_ref.model.lora_alpha=128 \
actor_rollout_ref.model.target_modules="['to_q','to_k','to_v','to_out.0','add_q_proj','add_k_proj','add_v_proj','to_add_out','img_mlp.net.0.proj','img_mlp.net.2','txt_mlp.net.0.proj','txt_mlp.net.2']" \
actor_rollout_ref.actor.optim.lr=3e-4 \
actor_rollout_ref.actor.optim.weight_decay=0.0001 \
actor_rollout_ref.actor.ppo_mini_batch_size=16 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
actor_rollout_ref.actor.use_kl_loss=False \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.actor.policy_loss.loss_mode=flow_grpo \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=$ENGINE \
actor_rollout_ref.rollout.n=16 \
actor_rollout_ref.rollout.guidance_scale=1.0 \
actor_rollout_ref.rollout.agent.default_agent_loop=diffusion_single_turn_agent \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.rollout.layered_summon=True \
actor_rollout_ref.rollout.max_model_len=1058 \
actor_rollout_ref.rollout.sde_window_size=3 \
actor_rollout_ref.rollout.sde_window_range="[0,5]" \
+actor_rollout_ref.rollout.engine_kwargs.vllm_omni.custom_pipeline=verl.workers.utils.vllm_omni_patch.pipelines.pipeline_qwenimage.QwenImagePipelineWithLogProb \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
reward.reward_manager.name=diffusion \
reward.reward_model.model_path=$reward_model_name \
reward.reward_model.enable=True \
reward.reward_model.rollout.name=$REWARD_ENGINE \
reward.custom_reward_function.path=$reward_path \
reward.custom_reward_function.name=compute_score_ocr \
trainer.use_legacy_worker_impl=disable \
trainer.logger='["console", "wandb"]' \
trainer.project_name=flow_grpo \
trainer.experiment_name=qwen_image_ocr \
trainer.log_val_generations=8 \
trainer.n_gpus_per_node=4 \
trainer.nnodes=1 \
trainer.save_freq=100 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
1 change: 1 addition & 0 deletions scripts/generate_trainer_config.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -euox pipefail
CONFIG_SPECS=(
"ppo_trainer:_generated_ppo_trainer.yaml:"
"ppo_megatron_trainer:_generated_ppo_megatron_trainer.yaml:--config-name=ppo_megatron_trainer.yaml"
"ppo_diffusion_trainer:_generated_ppo_diffusion_trainer.yaml:--config-name=ppo_diffusion_trainer.yaml"
"ppo_trainer:_generated_ppo_veomni_trainer.yaml:model_engine=veomni"
)

Expand Down
131 changes: 131 additions & 0 deletions tests/experimental/agent_loop/test_diffusion_agent_loop.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os

import numpy as np
import pytest
import ray
from omegaconf import DictConfig
from PIL import Image

from verl.experimental.agent_loop.diffusion_agent_loop import DiffusionAgentLoopManager
from verl.protocol import DataProto


@pytest.fixture
def init_config() -> DictConfig:
from hydra import compose, initialize_config_dir

with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
config = compose(config_name="ppo_diffusion_trainer")

model_path = os.path.expanduser("~/models/Qwen/Qwen-Image")
config.actor_rollout_ref.model.path = model_path
config.actor_rollout_ref.model.tokenizer_path = os.path.join(model_path, "tokenizer")
config.actor_rollout_ref.rollout.name = "vllm_omni"
config.actor_rollout_ref.rollout.mode = "async"
config.actor_rollout_ref.rollout.enforce_eager = True
config.actor_rollout_ref.rollout.n = 4
config.actor_rollout_ref.rollout.num_inference_steps = 10
config.actor_rollout_ref.rollout.guidance_scale = 1.0
config.actor_rollout_ref.rollout.agent.num_workers = 2
config.actor_rollout_ref.rollout.skip_tokenizer_init = True
config.actor_rollout_ref.rollout.agent.default_agent_loop = "diffusion_single_turn_agent"
config.actor_rollout_ref.rollout.sde_window_size = 3
config.actor_rollout_ref.rollout.sde_window_range = [0, 5]

qwen_pipeline = "verl.workers.utils.vllm_omni_patch.pipelines.pipeline_qwenimage.QwenImagePipelineWithLogProb"
config.actor_rollout_ref.rollout.engine_kwargs.vllm_omni = {"custom_pipeline": qwen_pipeline}
config.data.custom_cls.path = "verl/utils/dataset/qwen_dataset.py"
config.data.custom_cls.name = "QwenDataset"
config.reward.reward_manager.name = "diffusion"
config.trainer.n_gpus_per_node = 4

tokenizer_max_length = 1024
prompt_template_encode_start_idx = 34
max_length = tokenizer_max_length + prompt_template_encode_start_idx

config.data.apply_chat_template_kwargs = dict(max_length=max_length, padding=True, truncation=True)
config.data.max_prompt_length = max_length
config.actor_rollout_ref.rollout.max_model_len = max_length

# TODO (mike): test with TP later
config.actor_rollout_ref.rollout.tensor_model_parallel_size = 1
return config


def test_single_turn(init_config):
ray.init(
runtime_env={
"env_vars": {
"TOKENIZERS_PARALLELISM": "true",
"NCCL_DEBUG": "WARN",
"VLLM_LOGGING_LEVEL": "INFO",
}
}
)

agent_loop_manager = DiffusionAgentLoopManager(init_config)

system_prompt = (
"Describe the image by detailing the color, shape, size, texture, quantity, text, "
"spatial relationships of the objects and background:"
)
user_prompts = ["A photo of cute cat with long fur and big eyes.", "A photo of cute dog with short hair."]

raw_prompts = []
for user_prompt in user_prompts:
raw_prompts.append(
[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
]
)

batch = DataProto(
non_tensor_batch={
"raw_prompt": np.array(raw_prompts),
"data_source": np.array(["jpeg_compressibility"] * len(raw_prompts)),
"reward_model": np.array([{"style": "rule", "ground_truth": ""}] * len(raw_prompts)),
},
)
n = init_config.actor_rollout_ref.rollout.n
batch = batch.repeat(n)
result = agent_loop_manager.generate_sequences(prompts=batch)
assert len(result) == len(raw_prompts) * n

expected_batch_keys = [
"responses",
"all_latents",
"all_timesteps",
"prompt_embeds",
"prompt_embeds_mask",
"input_ids",
"attention_mask",
]
for key in expected_batch_keys:
assert key in result.batch, f"Key {key} not found in result batch."

# check turns
num_turns = result.non_tensor_batch["__num_turns__"]
assert np.all(num_turns == 2)

# TODO: for visualization, drop later
images_pil = (result.batch["responses"].permute(0, 2, 3, 1).numpy() * 255.0).astype("uint8")
for i, image in enumerate(images_pil):
image_path = os.path.join(f"{i}.jpg")
Image.fromarray(image).save(image_path)

print("Test passed!")
ray.shutdown()
91 changes: 90 additions & 1 deletion tests/experimental/reward_loop/reward_fn.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Copyright 2025 Bytedance Ltd. and/or its affiliates
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -12,11 +13,16 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import base64
import json
import os
from io import BytesIO

import aiohttp
import numpy as np
import torch
from openai.types.chat import ChatCompletion
from PIL import Image
from transformers import PreTrainedTokenizer

GRM_PROMPT_TEMPLATE = """
Expand All @@ -28,7 +34,7 @@
Solution:
{solution}

Please evaluate how well the solution addresses the problem.
Please evaluate how well the solution addresses the problem.
Give a score from 1 to 10, where:
- 1 means the solution is completely irrelevant or incorrect.
- 5 means the solution is partially correct but incomplete or not well reasoned.
Expand Down Expand Up @@ -98,3 +104,86 @@ def compute_score_math_verify(
model_output=solution_str,
ground_truth=ground_truth,
)


def _pil_image_to_base64(image: Image.Image) -> str:
buffered = BytesIO()
image.save(buffered, format="PNG")
encoded_image_text = base64.b64encode(buffered.getvalue()).decode("utf-8")
base64_image = f"data:image;base64,{encoded_image_text}"
return base64_image


async def compute_score_ocr(
data_source: str,
solution_image: Image.Image | np.ndarray | torch.Tensor,
ground_truth: str,
extra_info: dict,
reward_router_address: str,
reward_model_tokenizer: PreTrainedTokenizer = None,
model_name: str = None,
):
"""Compute the reward score."""
import re

import Levenshtein

from verl.utils.ray_utils import get_event_loop

# preprocess image to base64
image = solution_image
if isinstance(image, torch.Tensor):
image = image.float().permute(1, 2, 0).cpu().numpy()
if isinstance(image, np.ndarray):
assert image.shape[-1] == 3, "must be in HWC format"
image = (image * 255).round().clip(0, 255).astype(np.uint8)
image = Image.fromarray(image)
assert isinstance(image, Image.Image)

image_base64 = await get_event_loop().run_in_executor(None, _pil_image_to_base64, image)

# prepare chat template
grm_prompt = "Please output only the text content from the image without any additional descriptions or formatting."
query = [
{
"type": "image_url",
"image_url": {"url": image_base64},
},
{"type": "text", "text": grm_prompt},
]
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": query,
},
]

sampling_params = {"temperature": 0.7, "top_p": 0.8, "max_tokens": 4096}
model_name = model_name or os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")
chat_complete_request = {
"messages": messages,
"model": model_name,
**sampling_params,
}
result = await chat_complete(
router_address=reward_router_address,
chat_complete_request=chat_complete_request,
)
grm_response = result.choices[0].message.content

# compute OCR score
text = grm_response
# remove any nonvisible characters and convert to lowercase
gt = re.sub(r"\s+", "", ground_truth).lower()
text = re.sub(r"\s+", "", text).lower()
if gt in text:
dist = 0
else:
dist = Levenshtein.distance(text, gt)

# recognized many unrelated characters, only add one character penalty
dist = min(dist, len(gt))
score = 1 - dist / len(gt)

return {"score": score, "acc": score == 1, "genrm_response": grm_response}
Loading