diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 91c0d21f2a3..dcc710d2270 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -6,7 +6,7 @@
 
 - [ ] Search for similar PRs. Paste at least one query link here: ...
 - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI)
-  - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off`
+  - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `vllm_omni`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off`
   - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]`
   - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
   - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title.
diff --git a/docs/algo/flowgrpo.md b/docs/algo/flowgrpo.md
new file mode 100644
index 00000000000..bb719133994
--- /dev/null
+++ b/docs/algo/flowgrpo.md
@@ -0,0 +1,136 @@
+# Training Flow Matching Models via Online RL (Flow-GRPO)
+
+Flow-GRPO ([paper](https://arxiv.org/abs/2505.05470), [code](https://github.com/yifan123/flow_grpo)) is the first method to integrate online policy gradient reinforcement learning into **flow matching** generative models (e.g., Stable Diffusion 3, FLUX). It enables direct reward optimization for tasks such as compositional text-to-image generation, visual text rendering, and human preference alignment, without modifying the standard inference pipeline.
+
+Two core technical contributions make this possible:
+
+1. **ODE-to-SDE Conversion**: Flow matching models natively use a deterministic ODE sampler. Flow-GRPO converts this ODE into an equivalent SDE that preserves the model's marginal distribution at every timestep. This introduces the stochasticity required for group sampling and RL exploration.
+
+2. **Denoising Reduction**: Training on all denoising steps is expensive. Flow-GRPO reduces the number of *training* steps while keeping the original number of *inference* steps, significantly improving sampling efficiency without sacrificing reward performance.
+
+Empirically, RL-tuned SD3.5-M with Flow-GRPO raises GenEval accuracy from 63% to 95% and visual text rendering accuracy from 59% to 92%.
+
+## Key Components
+
+- **Flow Matching Backbone**: operates on continuous-time flow matching models (e.g., SD3.5, FLUX) rather than discrete-token LLMs.
+- **ODE-to-SDE Rollout**: generates a group of diverse image trajectories by injecting controlled noise via SDE sampling at selected denoising steps.
+- **Denoising Reduction**: trains on a reduced subset of denoising steps (configurable via `sde_window_size` and `sde_window_range`) while inference uses the full step count.
+- **Image Reward Models**: rewards are assigned by external reward models (e.g., GenEval, OCR, PickScore, aesthetic score) rather than rule-based verifiers.
+- **No Critic**: like GRPO for LLMs, no separate value network is trained; advantages are computed from group-relative rewards.
+
+## Key Differences: GRPO vs. Flow-GRPO
+
+| Dimension | GRPO (LLM) | Flow-GRPO (Diffusion) |
+|---|---|---|
+| **Model type** | Autoregressive language model | Flow matching / diffusion model |
+| **Action space** | Discrete token sequences | Continuous denoising trajectories (SDE paths) |
+| **Rollout mechanism** | Sample `n` token sequences per prompt | Convert ODE to SDE; sample `n` image trajectories per prompt via stochastic denoising |
+| **Log-probability** | Standard next-token log-prob | Log-prob of the SDE noise prediction at each selected denoising step |
+| **Training steps** | All decoding steps are trivially identical in cost | Denoising Reduction: train on a small window of steps, infer with full steps |
+| **Reward signal** | Rule-based verifiers or LLM judges on text | Image reward models (GenEval, OCR, PickScore, aesthetic, etc.) |
+| **KL regularization** | KL penalty added to reward or directly to loss | KL loss applied to SDE steps; `use_kl_loss=True` recommended |
+| **CFG (guidance)** | Not applicable | CFG distillation occurs naturally; CFG can be disabled at both train and test time |
+| **Advantage estimator** | `algorithm.adv_estimator=grpo` | `algorithm.adv_estimator=flow_grpo` |
+| **Loss mode** | `actor_rollout_ref.actor.policy_loss.loss_mode` not diffusion-specific | `actor_rollout_ref.actor.policy_loss.loss_mode=flow_grpo` |
+
+## Configuration
+
+### Core parameters
+
+- `algorithm.adv_estimator`: Set to `flow_grpo` (instead of `grpo`).
+
+- `actor_rollout_ref.actor.policy_loss.loss_mode`: Set to `flow_grpo`.
+
+- `actor_rollout_ref.rollout.n`: Number of image trajectories to sample per prompt for group-relative advantage computation. Analogous to GRPO's group size; should be > 1 (default in examples: `16`).
+
+- `actor_rollout_ref.rollout.noise_level`: Controls the SDE noise injection level during rollout. Larger values increase diversity but may degrade image quality. Typical value: `1.2`.
+
+- `actor_rollout_ref.rollout.sde_window_size`: Number of denoising steps to train on per trajectory (Denoising Reduction). Reducing this from the full step count speeds up training significantly.
+
+- `actor_rollout_ref.rollout.sde_window_range`: The range of denoising steps from which the training window is sampled, e.g., `[0, 5]` to focus on early (high-noise) steps.
+
+- `actor_rollout_ref.rollout.val_kwargs.num_inference_steps`: Full number of denoising steps used during inference/evaluation. This is kept at its original value (e.g., `50`) and is independent of `sde_window_size`.
+
+- `actor_rollout_ref.rollout.guidance_scale`: Classifier-free guidance scale during rollout. Can be set to `1.0` (no CFG) because the RL process naturally performs CFG distillation.
+
+- `actor_rollout_ref.actor.use_kl_loss`: Set to `True` to add a KL divergence term between the trained policy and the reference policy to the loss.
+
+- `actor_rollout_ref.actor.kl_loss_coef`: Coefficient for the KL loss term.
+
+## Data Preprocessing
+
+All training scripts expect the dataset in parquet format. The examples use an OCR dataset from the [Flow-GRPO repository](https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr). The raw dataset consists of text files where each ground-truth answer is stored in the format `The image displays "xxx".`. Before running any training script, convert it to parquet format using the provided preprocessing script.
+
+### Step 1: Download the raw dataset
+
+Download the OCR dataset from the Flow-GRPO repository and place it at `~/dataset/ocr/` (or any path of your choice):
+
+```bash
+# Clone or download from https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr
+# Place the dataset directory at ~/dataset/ocr/
+# Expected structure:
+#   ~/dataset/ocr/
+#       train/   (or train split files)
+#       test/    (or test split files)
+```
+
+### Step 2: Run the preprocessing script
+
+```bash
+python examples/data_preprocess/qwenimage_ocr.py \
+    --local_dataset_path ~/dataset/ocr \
+    --local_save_dir ~/data/ocr
+```
+
+The output parquet files are consumed directly by all training scripts via `data.train_files` and `data.val_files`.
+
+## Variants
+
+### Flow-GRPO-Fast
+
+Flow-GRPO-Fast accelerates training by confining stochasticity to only one or two denoising steps per trajectory:
+
+1. Generate a deterministic ODE trajectory for each prompt.
+2. At a randomly chosen intermediate step, inject noise and switch to SDE sampling to produce the group.
+3. Continue the remaining steps with ODE sampling.
+
+This significantly reduces training cost: only the selected step(s) require gradient computation, and sampling before the branching point does not need group expansion. Flow-GRPO-Fast with 2 training steps matches full Flow-GRPO reward performance.
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo_fast.sh
+```
+
+### Async Reward
+
+For reward models that are expensive to evaluate (e.g., a VLM judge), the reward model can be allocated its own dedicated GPU resource pool and run asynchronously alongside the policy. This avoids blocking policy training on reward computation.
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo_async_reward.sh
+```
+
+### Full Fine-Tuning
+
+To fine-tune all model weights instead of using LoRA:
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo_full_ft.sh
+```
+
+## Reference Example
+
+Standard LoRA training with OCR reward (Qwen-Image, 4 GPUs) with CFG and KL loss enabled:
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo.sh
+```
+
+## Citation
+
+```bibtex
+@article{liu2025flow,
+  title={Flow-GRPO: Training Flow Matching Models via Online RL},
+  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
+  journal={arXiv preprint arXiv:2505.05470},
+  year={2025}
+}
+```
diff --git a/examples/data_preprocess/qwenimage_ocr.py b/examples/data_preprocess/qwenimage_ocr.py
new file mode 100644
index 00000000000..3953e86a646
--- /dev/null
+++ b/examples/data_preprocess/qwenimage_ocr.py
@@ -0,0 +1,103 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Preprocess the OCR dataset to parquet format (for Qwen-Image training).
+You can obtain the raw dataset from https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr
+"""
+
+import argparse
+import os
+
+import datasets
+
+from verl.utils.hdfs_io import copy, makedirs
+
+
+def extract_solution(solution_str):
+    # The solution is stored in the format: 'The image displays "xxx".'
+    return solution_str.split('"')[1]
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--local_dir", default=None)
+    parser.add_argument("--hdfs_dir", default=None)
+    parser.add_argument(
+        "--local_dataset_path", default="~/dataset/ocr/", help="The local path to the raw dataset, if it exists."
+    )
+    parser.add_argument(
+        "--local_save_dir", default="~/data/ocr", help="The save directory for the preprocessed dataset."
+    )
+
+    args = parser.parse_args()
+    if args.local_dataset_path is not None:
+        local_dataset_path = os.path.expanduser(args.local_dataset_path)
+
+    data_source = "flow_grpo/ocr"
+
+    if local_dataset_path is not None:
+        dataset = datasets.load_dataset(local_dataset_path)
+    else:
+        raise NotImplementedError(
+            "It is not existed in huggingface hub. "
+            "Please get dataset from https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr"
+        )
+
+    train_dataset = dataset["train"]
+    test_dataset = dataset["test"]
+
+    system_prompt = (
+        "Describe the image by detailing the color, shape, size, "
+        "texture, quantity, text, spatial relationships of the objects and background:"
+    )
+    negative_user_prompt = " "
+
+    def make_map_fn(split):
+        def process_fn(example, idx):
+            text = example.pop("text")
+            solution = extract_solution(text)
+            data = {
+                "data_source": data_source,
+                "prompt": [
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": text},
+                ],
+                "negative_prompt": [
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": negative_user_prompt},
+                ],
+                "ability": "ocr",
+                "reward_model": {"style": "model", "ground_truth": solution},
+                "extra_info": {"split": split, "index": idx},
+            }
+            return data
+
+        return process_fn
+
+    train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
+    test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True)
+
+    hdfs_dir = args.hdfs_dir
+    local_save_dir = args.local_dir
+    if local_save_dir is not None:
+        print("Warning: Argument 'local_dir' is deprecated. Please use 'local_save_dir' instead.")
+    else:
+        local_save_dir = args.local_save_dir
+
+    train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
+    test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))
+
+    if hdfs_dir is not None:
+        makedirs(hdfs_dir)
+        copy(src=local_save_dir, dst=hdfs_dir)
diff --git a/examples/flowgrpo_trainer/README.md b/examples/flowgrpo_trainer/README.md
new file mode 100644
index 00000000000..bb719133994
--- /dev/null
+++ b/examples/flowgrpo_trainer/README.md
@@ -0,0 +1,136 @@
+# Training Flow Matching Models via Online RL (Flow-GRPO)
+
+Flow-GRPO ([paper](https://arxiv.org/abs/2505.05470), [code](https://github.com/yifan123/flow_grpo)) is the first method to integrate online policy gradient reinforcement learning into **flow matching** generative models (e.g., Stable Diffusion 3, FLUX). It enables direct reward optimization for tasks such as compositional text-to-image generation, visual text rendering, and human preference alignment, without modifying the standard inference pipeline.
+
+Two core technical contributions make this possible:
+
+1. **ODE-to-SDE Conversion**: Flow matching models natively use a deterministic ODE sampler. Flow-GRPO converts this ODE into an equivalent SDE that preserves the model's marginal distribution at every timestep. This introduces the stochasticity required for group sampling and RL exploration.
+
+2. **Denoising Reduction**: Training on all denoising steps is expensive. Flow-GRPO reduces the number of *training* steps while keeping the original number of *inference* steps, significantly improving sampling efficiency without sacrificing reward performance.
+
+Empirically, RL-tuned SD3.5-M with Flow-GRPO raises GenEval accuracy from 63% to 95% and visual text rendering accuracy from 59% to 92%.
+
+## Key Components
+
+- **Flow Matching Backbone**: operates on continuous-time flow matching models (e.g., SD3.5, FLUX) rather than discrete-token LLMs.
+- **ODE-to-SDE Rollout**: generates a group of diverse image trajectories by injecting controlled noise via SDE sampling at selected denoising steps.
+- **Denoising Reduction**: trains on a reduced subset of denoising steps (configurable via `sde_window_size` and `sde_window_range`) while inference uses the full step count.
+- **Image Reward Models**: rewards are assigned by external reward models (e.g., GenEval, OCR, PickScore, aesthetic score) rather than rule-based verifiers.
+- **No Critic**: like GRPO for LLMs, no separate value network is trained; advantages are computed from group-relative rewards.
+
+## Key Differences: GRPO vs. Flow-GRPO
+
+| Dimension | GRPO (LLM) | Flow-GRPO (Diffusion) |
+|---|---|---|
+| **Model type** | Autoregressive language model | Flow matching / diffusion model |
+| **Action space** | Discrete token sequences | Continuous denoising trajectories (SDE paths) |
+| **Rollout mechanism** | Sample `n` token sequences per prompt | Convert ODE to SDE; sample `n` image trajectories per prompt via stochastic denoising |
+| **Log-probability** | Standard next-token log-prob | Log-prob of the SDE noise prediction at each selected denoising step |
+| **Training steps** | All decoding steps are trivially identical in cost | Denoising Reduction: train on a small window of steps, infer with full steps |
+| **Reward signal** | Rule-based verifiers or LLM judges on text | Image reward models (GenEval, OCR, PickScore, aesthetic, etc.) |
+| **KL regularization** | KL penalty added to reward or directly to loss | KL loss applied to SDE steps; `use_kl_loss=True` recommended |
+| **CFG (guidance)** | Not applicable | CFG distillation occurs naturally; CFG can be disabled at both train and test time |
+| **Advantage estimator** | `algorithm.adv_estimator=grpo` | `algorithm.adv_estimator=flow_grpo` |
+| **Loss mode** | `actor_rollout_ref.actor.policy_loss.loss_mode` not diffusion-specific | `actor_rollout_ref.actor.policy_loss.loss_mode=flow_grpo` |
+
+## Configuration
+
+### Core parameters
+
+- `algorithm.adv_estimator`: Set to `flow_grpo` (instead of `grpo`).
+
+- `actor_rollout_ref.actor.policy_loss.loss_mode`: Set to `flow_grpo`.
+
+- `actor_rollout_ref.rollout.n`: Number of image trajectories to sample per prompt for group-relative advantage computation. Analogous to GRPO's group size; should be > 1 (default in examples: `16`).
+
+- `actor_rollout_ref.rollout.noise_level`: Controls the SDE noise injection level during rollout. Larger values increase diversity but may degrade image quality. Typical value: `1.2`.
+
+- `actor_rollout_ref.rollout.sde_window_size`: Number of denoising steps to train on per trajectory (Denoising Reduction). Reducing this from the full step count speeds up training significantly.
+
+- `actor_rollout_ref.rollout.sde_window_range`: The range of denoising steps from which the training window is sampled, e.g., `[0, 5]` to focus on early (high-noise) steps.
+
+- `actor_rollout_ref.rollout.val_kwargs.num_inference_steps`: Full number of denoising steps used during inference/evaluation. This is kept at its original value (e.g., `50`) and is independent of `sde_window_size`.
+
+- `actor_rollout_ref.rollout.guidance_scale`: Classifier-free guidance scale during rollout. Can be set to `1.0` (no CFG) because the RL process naturally performs CFG distillation.
+
+- `actor_rollout_ref.actor.use_kl_loss`: Set to `True` to add a KL divergence term between the trained policy and the reference policy to the loss.
+
+- `actor_rollout_ref.actor.kl_loss_coef`: Coefficient for the KL loss term.
+
+## Data Preprocessing
+
+All training scripts expect the dataset in parquet format. The examples use an OCR dataset from the [Flow-GRPO repository](https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr). The raw dataset consists of text files where each ground-truth answer is stored in the format `The image displays "xxx".`. Before running any training script, convert it to parquet format using the provided preprocessing script.
+
+### Step 1: Download the raw dataset
+
+Download the OCR dataset from the Flow-GRPO repository and place it at `~/dataset/ocr/` (or any path of your choice):
+
+```bash
+# Clone or download from https://github.com/yifan123/flow_grpo/tree/main/dataset/ocr
+# Place the dataset directory at ~/dataset/ocr/
+# Expected structure:
+#   ~/dataset/ocr/
+#       train/   (or train split files)
+#       test/    (or test split files)
+```
+
+### Step 2: Run the preprocessing script
+
+```bash
+python examples/data_preprocess/qwenimage_ocr.py \
+    --local_dataset_path ~/dataset/ocr \
+    --local_save_dir ~/data/ocr
+```
+
+The output parquet files are consumed directly by all training scripts via `data.train_files` and `data.val_files`.
+
+## Variants
+
+### Flow-GRPO-Fast
+
+Flow-GRPO-Fast accelerates training by confining stochasticity to only one or two denoising steps per trajectory:
+
+1. Generate a deterministic ODE trajectory for each prompt.
+2. At a randomly chosen intermediate step, inject noise and switch to SDE sampling to produce the group.
+3. Continue the remaining steps with ODE sampling.
+
+This significantly reduces training cost: only the selected step(s) require gradient computation, and sampling before the branching point does not need group expansion. Flow-GRPO-Fast with 2 training steps matches full Flow-GRPO reward performance.
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo_fast.sh
+```
+
+### Async Reward
+
+For reward models that are expensive to evaluate (e.g., a VLM judge), the reward model can be allocated its own dedicated GPU resource pool and run asynchronously alongside the policy. This avoids blocking policy training on reward computation.
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo_async_reward.sh
+```
+
+### Full Fine-Tuning
+
+To fine-tune all model weights instead of using LoRA:
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo_full_ft.sh
+```
+
+## Reference Example
+
+Standard LoRA training with OCR reward (Qwen-Image, 4 GPUs) with CFG and KL loss enabled:
+
+```bash
+bash examples/flowgrpo_trainer/run_flowgrpo.sh
+```
+
+## Citation
+
+```bibtex
+@article{liu2025flow,
+  title={Flow-GRPO: Training Flow Matching Models via Online RL},
+  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},
+  journal={arXiv preprint arXiv:2505.05470},
+  year={2025}
+}
+```
diff --git a/examples/flowgrpo_trainer/run_flowgrpo.sh b/examples/flowgrpo_trainer/run_flowgrpo.sh
new file mode 100644
index 00000000000..e52b70e692c
--- /dev/null
+++ b/examples/flowgrpo_trainer/run_flowgrpo.sh
@@ -0,0 +1,74 @@
+# Qwen-Image lora, vllm_omni rollout
+set -x
+
+ocr_train_path=$HOME/data/ocr/train.parquet
+ocr_test_path=$HOME/data/ocr/test.parquet
+
+ENGINE=vllm_omni
+REWARD_ENGINE=vllm
+
+reward_path=tests/experimental/reward_loop/reward_fn.py
+reward_model_name=$HOME/models/Qwen/Qwen3-VL-8B-Instruct
+
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_diffusion_trainer.yaml' \
+    algorithm.adv_estimator=flow_grpo \
+    data.train_files=$ocr_train_path \
+    data.val_files=$ocr_test_path \
+    data.train_batch_size=32 \
+    data.max_prompt_length=1058 \
+    data.filter_overlong_prompts=True \
+    +data.apply_chat_template_kwargs.max_length=1058 \
+    +data.apply_chat_template_kwargs.padding=True \
+    +data.apply_chat_template_kwargs.truncation=True \
+    actor_rollout_ref.model.path=$HOME/models/Qwen/Qwen-Image \
+    actor_rollout_ref.model.tokenizer_path=$HOME/models/Qwen/Qwen-Image/tokenizer \
+    actor_rollout_ref.model.lora_rank=64 \
+    actor_rollout_ref.model.lora_alpha=128 \
+    actor_rollout_ref.model.target_modules="['to_q','to_k','to_v','to_out.0','add_q_proj','add_k_proj','add_v_proj','to_add_out','img_mlp.net.0.proj','img_mlp.net.2','txt_mlp.net.0.proj','txt_mlp.net.2']" \
+    actor_rollout_ref.actor.optim.lr=3e-4 \
+    actor_rollout_ref.actor.optim.weight_decay=0.0001 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=True \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+    actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
+    actor_rollout_ref.actor.policy_loss.loss_mode=flow_grpo \
+    actor_rollout_ref.actor.use_kl_loss=True \
+    actor_rollout_ref.actor.kl_loss_coef=0.04 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.name=$ENGINE \
+    actor_rollout_ref.rollout.n=16 \
+    actor_rollout_ref.rollout.guidance_scale=4.0 \
+    actor_rollout_ref.rollout.agent.default_agent_loop=diffusion_single_turn_agent \
+    actor_rollout_ref.rollout.agent.num_workers=4 \
+    actor_rollout_ref.rollout.load_format=safetensors \
+    actor_rollout_ref.rollout.layered_summon=True \
+    actor_rollout_ref.rollout.max_model_len=1058 \
+    actor_rollout_ref.rollout.noise_level=1.2 \
+    actor_rollout_ref.rollout.sde_window_size=2 \
+    actor_rollout_ref.rollout.sde_window_range="[0,5]" \
+    actor_rollout_ref.rollout.val_kwargs.num_inference_steps=50 \
+    +actor_rollout_ref.rollout.engine_kwargs.vllm_omni.custom_pipeline=verl.utils.vllm_omni.pipelines.QwenImagePipelineWithLogProb \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
+    reward.num_workers=4 \
+    reward.reward_manager.name=image \
+    reward.reward_model.enable=True \
+    reward.reward_model.model_path=$reward_model_name \
+    reward.reward_model.rollout.name=$REWARD_ENGINE \
+    reward.reward_model.rollout.tensor_model_parallel_size=4 \
+    reward.custom_reward_function.path=$reward_path \
+    reward.custom_reward_function.name=compute_score_ocr \
+    trainer.use_legacy_worker_impl=disable \
+    trainer.logger='["console", "wandb"]' \
+    trainer.project_name=flow_grpo \
+    trainer.experiment_name=qwen_image_ocr \
+    trainer.log_val_generations=8 \
+    trainer.val_before_train=False \
+    trainer.n_gpus_per_node=4 \
+    trainer.nnodes=1 \
+    trainer.save_freq=30 \
+    trainer.test_freq=30 \
+    trainer.total_epochs=15 $@
diff --git a/examples/flowgrpo_trainer/run_flowgrpo_async_reward.sh b/examples/flowgrpo_trainer/run_flowgrpo_async_reward.sh
new file mode 100644
index 00000000000..f7923046700
--- /dev/null
+++ b/examples/flowgrpo_trainer/run_flowgrpo_async_reward.sh
@@ -0,0 +1,78 @@
+# Qwen-Image lora, vllm_omni rollout
+set -x
+
+ocr_train_path=$HOME/data/ocr/train.parquet
+ocr_test_path=$HOME/data/ocr/test.parquet
+
+ENGINE=vllm_omni
+REWARD_ENGINE=vllm
+
+reward_path=tests/experimental/reward_loop/reward_fn.py
+reward_model_name=$HOME/models/Qwen/Qwen3-VL-8B-Instruct
+
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_diffusion_trainer.yaml' \
+    algorithm.adv_estimator=flow_grpo \
+    data.train_files=$ocr_train_path \
+    data.val_files=$ocr_test_path \
+    data.train_batch_size=32 \
+    data.max_prompt_length=1058 \
+    data.filter_overlong_prompts=True \
+    +data.apply_chat_template_kwargs.max_length=1058 \
+    +data.apply_chat_template_kwargs.padding=True \
+    +data.apply_chat_template_kwargs.truncation=True \
+    actor_rollout_ref.model.path=$HOME/models/Qwen/Qwen-Image \
+    actor_rollout_ref.model.tokenizer_path=$HOME/models/Qwen/Qwen-Image/tokenizer \
+    actor_rollout_ref.model.lora_rank=64 \
+    actor_rollout_ref.model.lora_alpha=128 \
+    actor_rollout_ref.model.target_modules="['to_q','to_k','to_v','to_out.0','add_q_proj','add_k_proj','add_v_proj','to_add_out','img_mlp.net.0.proj','img_mlp.net.2','txt_mlp.net.0.proj','txt_mlp.net.2']" \
+    actor_rollout_ref.actor.optim.lr=3e-4 \
+    actor_rollout_ref.actor.optim.weight_decay=0.0001 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=True \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+    actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
+    actor_rollout_ref.actor.policy_loss.loss_mode=flow_grpo \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.name=$ENGINE \
+    actor_rollout_ref.rollout.n=16 \
+    actor_rollout_ref.rollout.guidance_scale=1.0 \
+    actor_rollout_ref.rollout.agent.default_agent_loop=diffusion_single_turn_agent \
+    actor_rollout_ref.rollout.agent.num_workers=4 \
+    actor_rollout_ref.rollout.load_format=safetensors \
+    actor_rollout_ref.rollout.layered_summon=True \
+    actor_rollout_ref.rollout.max_model_len=1058 \
+    actor_rollout_ref.rollout.noise_level=1.2 \
+    actor_rollout_ref.rollout.sde_window_size=2 \
+    actor_rollout_ref.rollout.sde_window_range="[0,5]" \
+    actor_rollout_ref.rollout.val_kwargs.num_inference_steps=50 \
+    +actor_rollout_ref.rollout.engine_kwargs.vllm_omni.custom_pipeline=verl.utils.vllm_omni.pipelines.QwenImagePipelineWithLogProb \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
+    reward.num_workers=4 \
+    reward.reward_manager.name=image \
+    reward.reward_model.enable=True \
+    reward.reward_model.model_path=$reward_model_name \
+    reward.reward_model.rollout.name=$REWARD_ENGINE \
+    reward.reward_model.enable_resource_pool=True \
+    reward.reward_model.nnodes=1 \
+    reward.reward_model.n_gpus_per_node=1 \
+    reward.reward_model.rollout.gpu_memory_utilization=0.9 \
+    reward.reward_model.rollout.free_cache_engine=False \
+    reward.reward_model.rollout.tensor_model_parallel_size=1 \
+    reward.reward_model.rollout.enforce_eager=False \
+    reward.custom_reward_function.path=$reward_path \
+    reward.custom_reward_function.name=compute_score_ocr \
+    trainer.use_legacy_worker_impl=disable \
+    trainer.logger='["console", "wandb"]' \
+    trainer.project_name=flow_grpo \
+    trainer.experiment_name=qwen_image_ocr \
+    trainer.log_val_generations=8 \
+    trainer.val_before_train=False \
+    trainer.n_gpus_per_node=4 \
+    trainer.nnodes=1 \
+    trainer.save_freq=30 \
+    trainer.test_freq=30 \
+    trainer.total_epochs=15 $@
diff --git a/examples/flowgrpo_trainer/run_flowgrpo_fast.sh b/examples/flowgrpo_trainer/run_flowgrpo_fast.sh
new file mode 100644
index 00000000000..57074241954
--- /dev/null
+++ b/examples/flowgrpo_trainer/run_flowgrpo_fast.sh
@@ -0,0 +1,72 @@
+# Qwen-Image lora, vllm_omni rollout
+set -x
+
+ocr_train_path=$HOME/data/ocr/train.parquet
+ocr_test_path=$HOME/data/ocr/test.parquet
+
+ENGINE=vllm_omni
+REWARD_ENGINE=vllm
+
+reward_path=tests/experimental/reward_loop/reward_fn.py
+reward_model_name=$HOME/models/Qwen/Qwen3-VL-8B-Instruct
+
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_diffusion_trainer.yaml' \
+    algorithm.adv_estimator=flow_grpo \
+    data.train_files=$ocr_train_path \
+    data.val_files=$ocr_test_path \
+    data.train_batch_size=32 \
+    data.max_prompt_length=1058 \
+    data.filter_overlong_prompts=True \
+    +data.apply_chat_template_kwargs.max_length=1058 \
+    +data.apply_chat_template_kwargs.padding=True \
+    +data.apply_chat_template_kwargs.truncation=True \
+    actor_rollout_ref.model.path=$HOME/models/Qwen/Qwen-Image \
+    actor_rollout_ref.model.tokenizer_path=$HOME/models/Qwen/Qwen-Image/tokenizer \
+    actor_rollout_ref.model.lora_rank=64 \
+    actor_rollout_ref.model.lora_alpha=128 \
+    actor_rollout_ref.model.target_modules="['to_q','to_k','to_v','to_out.0','add_q_proj','add_k_proj','add_v_proj','to_add_out','img_mlp.net.0.proj','img_mlp.net.2','txt_mlp.net.0.proj','txt_mlp.net.2']" \
+    actor_rollout_ref.actor.optim.lr=3e-4 \
+    actor_rollout_ref.actor.optim.weight_decay=0.0001 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=True \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+    actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
+    actor_rollout_ref.actor.policy_loss.loss_mode=flow_grpo \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.name=$ENGINE \
+    actor_rollout_ref.rollout.n=16 \
+    actor_rollout_ref.rollout.guidance_scale=1.0 \
+    actor_rollout_ref.rollout.agent.default_agent_loop=diffusion_single_turn_agent \
+    actor_rollout_ref.rollout.agent.num_workers=4 \
+    actor_rollout_ref.rollout.load_format=safetensors \
+    actor_rollout_ref.rollout.layered_summon=True \
+    actor_rollout_ref.rollout.max_model_len=1058 \
+    actor_rollout_ref.rollout.noise_level=1.2 \
+    actor_rollout_ref.rollout.sde_window_size=2 \
+    actor_rollout_ref.rollout.sde_window_range="[0,5]" \
+    actor_rollout_ref.rollout.val_kwargs.num_inference_steps=50 \
+    +actor_rollout_ref.rollout.engine_kwargs.vllm_omni.custom_pipeline=verl.utils.vllm_omni.pipelines.QwenImagePipelineWithLogProb \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
+    reward.num_workers=4 \
+    reward.reward_manager.name=image \
+    reward.reward_model.enable=True \
+    reward.reward_model.model_path=$reward_model_name \
+    reward.reward_model.rollout.name=$REWARD_ENGINE \
+    reward.reward_model.rollout.tensor_model_parallel_size=4 \
+    reward.custom_reward_function.path=$reward_path \
+    reward.custom_reward_function.name=compute_score_ocr \
+    trainer.use_legacy_worker_impl=disable \
+    trainer.logger='["console", "wandb"]' \
+    trainer.project_name=flow_grpo \
+    trainer.experiment_name=qwen_image_ocr \
+    trainer.log_val_generations=8 \
+    trainer.val_before_train=False \
+    trainer.n_gpus_per_node=4 \
+    trainer.nnodes=1 \
+    trainer.save_freq=30 \
+    trainer.test_freq=30 \
+    trainer.total_epochs=15 $@
diff --git a/examples/flowgrpo_trainer/run_flowgrpo_full_ft.sh b/examples/flowgrpo_trainer/run_flowgrpo_full_ft.sh
new file mode 100644
index 00000000000..691d5ed2042
--- /dev/null
+++ b/examples/flowgrpo_trainer/run_flowgrpo_full_ft.sh
@@ -0,0 +1,69 @@
+# Qwen-Image full weight finetuning, vllm_omni rollout
+set -x
+
+ocr_train_path=$HOME/data/ocr/train.parquet
+ocr_test_path=$HOME/data/ocr/test.parquet
+
+ENGINE=vllm_omni
+REWARD_ENGINE=vllm
+
+reward_path=tests/experimental/reward_loop/reward_fn.py
+reward_model_name=$HOME/models/Qwen/Qwen3-VL-8B-Instruct
+
+
+python3 -m verl.trainer.main_ppo --config-path=config \
+    --config-name='ppo_diffusion_trainer.yaml' \
+    algorithm.adv_estimator=flow_grpo \
+    data.train_files=$ocr_train_path \
+    data.val_files=$ocr_test_path \
+    data.train_batch_size=32 \
+    data.max_prompt_length=1058 \
+    data.filter_overlong_prompts=True \
+    +data.apply_chat_template_kwargs.max_length=1058 \
+    +data.apply_chat_template_kwargs.padding=True \
+    +data.apply_chat_template_kwargs.truncation=True \
+    actor_rollout_ref.model.path=$HOME/models/Qwen/Qwen-Image \
+    actor_rollout_ref.model.tokenizer_path=$HOME/models/Qwen/Qwen-Image/tokenizer \
+    actor_rollout_ref.actor.optim.lr=3e-5 \
+    actor_rollout_ref.actor.optim.weight_decay=0.0001 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
+    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
+    actor_rollout_ref.actor.fsdp_config.param_offload=True \
+    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
+    actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
+    actor_rollout_ref.actor.policy_loss.loss_mode=flow_grpo \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.name=$ENGINE \
+    actor_rollout_ref.rollout.n=16 \
+    actor_rollout_ref.rollout.guidance_scale=1.0 \
+    actor_rollout_ref.rollout.agent.default_agent_loop=diffusion_single_turn_agent \
+    actor_rollout_ref.rollout.agent.num_workers=4 \
+    actor_rollout_ref.rollout.load_format=safetensors \
+    actor_rollout_ref.rollout.layered_summon=True \
+    actor_rollout_ref.rollout.max_model_len=1058 \
+    actor_rollout_ref.rollout.noise_level=1.2 \
+    actor_rollout_ref.rollout.sde_window_size=2 \
+    actor_rollout_ref.rollout.sde_window_range="[0,5]" \
+    actor_rollout_ref.rollout.val_kwargs.num_inference_steps=50 \
+    +actor_rollout_ref.rollout.engine_kwargs.vllm_omni.custom_pipeline=verl.utils.vllm_omni.pipelines.QwenImagePipelineWithLogProb \
+    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
+    reward.num_workers=4 \
+    reward.reward_manager.name=image \
+    reward.reward_model.enable=True \
+    reward.reward_model.model_path=$reward_model_name \
+    reward.reward_model.rollout.name=$REWARD_ENGINE \
+    reward.reward_model.rollout.tensor_model_parallel_size=4 \
+    reward.custom_reward_function.path=$reward_path \
+    reward.custom_reward_function.name=compute_score_ocr \
+    trainer.use_legacy_worker_impl=disable \
+    trainer.logger='["console", "wandb"]' \
+    trainer.project_name=flow_grpo \
+    trainer.experiment_name=qwen_image_ocr \
+    trainer.log_val_generations=8 \
+    trainer.val_before_train=False \
+    trainer.n_gpus_per_node=8 \
+    trainer.nnodes=1 \
+    trainer.save_freq=30 \
+    trainer.test_freq=30 \
+    trainer.total_epochs=15 $@
diff --git a/scripts/generate_trainer_config.sh b/scripts/generate_trainer_config.sh
index c4c89cdbdba..bfd3ba12ef3 100755
--- a/scripts/generate_trainer_config.sh
+++ b/scripts/generate_trainer_config.sh
@@ -6,6 +6,7 @@ set -euox pipefail
 CONFIG_SPECS=(
     "ppo_trainer:_generated_ppo_trainer.yaml:"
     "ppo_megatron_trainer:_generated_ppo_megatron_trainer.yaml:--config-name=ppo_megatron_trainer.yaml"
+    "ppo_diffusion_trainer:_generated_ppo_diffusion_trainer.yaml:--config-name=ppo_diffusion_trainer.yaml"
     "ppo_trainer:_generated_ppo_veomni_trainer.yaml:model_engine=veomni"
     "ppo_trainer:_generated_ppo_torchtitan_trainer.yaml:model_engine=torchtitan"
 )
diff --git a/tests/experimental/agent_loop/test_diffusion_agent_loop.py b/tests/experimental/agent_loop/test_diffusion_agent_loop.py
new file mode 100644
index 00000000000..6f615b2db29
--- /dev/null
+++ b/tests/experimental/agent_loop/test_diffusion_agent_loop.py
@@ -0,0 +1,135 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import numpy as np
+import pytest
+import ray
+from omegaconf import DictConfig
+
+from verl.experimental.agent_loop.agent_loop import AgentLoopManager
+from verl.protocol import DataProto
+
+
+@pytest.fixture
+def init_config() -> DictConfig:
+    from hydra import compose, initialize_config_dir
+
+    with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
+        config = compose(config_name="ppo_diffusion_trainer")
+
+    model_path = os.path.expanduser("~/models/tiny-random/Qwen-Image")
+    config.actor_rollout_ref.model.path = model_path
+    config.actor_rollout_ref.model.tokenizer_path = os.path.join(model_path, "tokenizer")
+    config.actor_rollout_ref.rollout.name = "vllm_omni"
+    config.actor_rollout_ref.rollout.mode = "async"
+    config.actor_rollout_ref.rollout.enforce_eager = True
+    config.actor_rollout_ref.rollout.n = 4
+    config.actor_rollout_ref.rollout.num_inference_steps = 10
+    config.actor_rollout_ref.rollout.guidance_scale = 4.0
+    config.actor_rollout_ref.rollout.agent.num_workers = 2
+    config.actor_rollout_ref.rollout.agent.default_agent_loop = "diffusion_single_turn_agent"
+    config.actor_rollout_ref.rollout.noise_level = 1.0
+    config.actor_rollout_ref.rollout.sde_window_size = 2
+    config.actor_rollout_ref.rollout.sde_window_range = [0, 5]
+    config.actor_rollout_ref.rollout.calculate_log_probs = True
+    config.actor_rollout_ref.rollout.nnodes = 1
+
+    qwen_pipeline = "verl.utils.vllm_omni.pipelines.QwenImagePipelineWithLogProb"
+    config.actor_rollout_ref.rollout.engine_kwargs.vllm_omni = {"custom_pipeline": qwen_pipeline}
+    config.reward.reward_manager.name = "image"
+    config.trainer.n_gpus_per_node = 4
+
+    tokenizer_max_length = 1024
+    prompt_template_encode_start_idx = 34
+    max_length = tokenizer_max_length + prompt_template_encode_start_idx
+
+    config.data.apply_chat_template_kwargs = dict(max_length=max_length, padding=True, truncation=True)
+    config.data.max_prompt_length = max_length
+    config.actor_rollout_ref.rollout.max_model_len = max_length
+
+    # TODO (mike): test with TP later
+    config.actor_rollout_ref.rollout.tensor_model_parallel_size = 1
+    return config
+
+
+def test_single_turn(init_config):
+    ray.init(
+        runtime_env={
+            "env_vars": {
+                "TOKENIZERS_PARALLELISM": "true",
+                "NCCL_DEBUG": "WARN",
+                "VLLM_LOGGING_LEVEL": "INFO",
+            }
+        }
+    )
+
+    agent_loop_manager = AgentLoopManager.create(init_config)
+
+    system_prompt = (
+        "Describe the image by detailing the color, shape, size, texture, quantity, text, "
+        "spatial relationships of the objects and background:"
+    )
+    user_prompts = ["A photo of cute cat with long fur and big eyes.", "A photo of cute dog with short hair."]
+
+    raw_prompts = []
+    for user_prompt in user_prompts:
+        raw_prompts.append(
+            [
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": user_prompt},
+            ]
+        )
+
+    raw_negative_prompts = []
+    for user_prompt in user_prompts:
+        raw_negative_prompts.append(
+            [
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": " "},
+            ]
+        )
+
+    batch = DataProto(
+        non_tensor_batch={
+            "raw_prompt": np.array(raw_prompts),
+            "raw_negative_prompt": np.array(raw_negative_prompts),
+            "data_source": np.array(["jpeg_compressibility"] * len(raw_prompts)),
+            "reward_model": np.array([{"style": "rule", "ground_truth": ""}] * len(raw_prompts)),
+        },
+    )
+    n = init_config.actor_rollout_ref.rollout.n
+    batch = batch.repeat(n)
+    result = agent_loop_manager.generate_sequences(prompts=batch)
+    assert len(result) == len(raw_prompts) * n
+
+    expected_batch_keys = [
+        "responses",
+        "all_latents",
+        "all_timesteps",
+        "prompt_embeds",
+        "prompt_embeds_mask",
+        "input_ids",
+        "attention_mask",
+        "rollout_log_probs",
+    ]
+    for key in expected_batch_keys:
+        assert key in result.batch, f"Key {key} not found in result batch with keys {list(result.batch.keys())}."
+
+    # check turns
+    num_turns = result.non_tensor_batch["__num_turns__"]
+    assert np.all(num_turns == 2)
+
+    print("Test passed!")
+    ray.shutdown()
diff --git a/tests/experimental/reward_loop/assets/ocr.jpg b/tests/experimental/reward_loop/assets/ocr.jpg
new file mode 100644
index 00000000000..3d80bacfdf5
Binary files /dev/null and b/tests/experimental/reward_loop/assets/ocr.jpg differ
diff --git a/tests/experimental/reward_loop/reward_fn.py b/tests/experimental/reward_loop/reward_fn.py
index 27da6ff1884..6e24782e3d5 100644
--- a/tests/experimental/reward_loop/reward_fn.py
+++ b/tests/experimental/reward_loop/reward_fn.py
@@ -12,11 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import base64
 import json
 import os
+from io import BytesIO
 
 import aiohttp
+import numpy as np
+import torch
 from openai.types.chat import ChatCompletion
+from PIL import Image
 from transformers import PreTrainedTokenizer
 
 GRM_PROMPT_TEMPLATE = """
@@ -98,3 +103,86 @@ def compute_score_math_verify(
         model_output=solution_str,
         ground_truth=ground_truth,
     )
+
+
+def _pil_image_to_base64(image: Image.Image) -> str:
+    buffered = BytesIO()
+    image.save(buffered, format="PNG")
+    encoded_image_text = base64.b64encode(buffered.getvalue()).decode("utf-8")
+    base64_image = f"data:image;base64,{encoded_image_text}"
+    return base64_image
+
+
+async def compute_score_ocr(
+    data_source: str,
+    solution_image: Image.Image | np.ndarray | torch.Tensor,
+    ground_truth: str,
+    extra_info: dict,
+    reward_router_address: str,
+    reward_model_tokenizer: PreTrainedTokenizer = None,
+    model_name: str = None,
+):
+    """Compute the reward score."""
+    import re
+
+    import Levenshtein
+
+    from verl.utils.ray_utils import get_event_loop
+
+    # preprocess image to base64
+    image = solution_image
+    if isinstance(image, torch.Tensor):
+        image = image.float().permute(1, 2, 0).cpu().numpy()
+    if isinstance(image, np.ndarray):
+        assert image.shape[-1] == 3, "must be in HWC format"
+        image = (image * 255).round().clip(0, 255).astype(np.uint8)
+        image = Image.fromarray(image)
+    assert isinstance(image, Image.Image)
+
+    image_base64 = await get_event_loop().run_in_executor(None, _pil_image_to_base64, image)
+
+    # prepare chat template
+    grm_prompt = "Please output only the text content from the image without any additional descriptions or formatting."
+    query = [
+        {
+            "type": "image_url",
+            "image_url": {"url": image_base64},
+        },
+        {"type": "text", "text": grm_prompt},
+    ]
+    messages = [
+        {"role": "system", "content": "You are a helpful assistant."},
+        {
+            "role": "user",
+            "content": query,
+        },
+    ]
+
+    sampling_params = {"temperature": 0.7, "top_p": 0.8, "max_tokens": 4096}
+    model_name = model_name or os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")
+    chat_complete_request = {
+        "messages": messages,
+        "model": model_name,
+        **sampling_params,
+    }
+    result = await chat_complete(
+        router_address=reward_router_address,
+        chat_complete_request=chat_complete_request,
+    )
+    grm_response = result.choices[0].message.content
+
+    # compute OCR score
+    text = grm_response
+    # remove any nonvisible characters and convert to lowercase
+    gt = re.sub(r"\s+", "", ground_truth).lower()
+    text = re.sub(r"\s+", "", text).lower()
+    if gt in text:
+        dist = 0
+    else:
+        dist = Levenshtein.distance(text, gt)
+
+    # recognized many unrelated characters, only add one character penalty
+    dist = min(dist, len(gt))
+    score = 1 - dist / len(gt)
+
+    return {"score": score, "acc": score == 1, "genrm_response": grm_response}
diff --git a/tests/experimental/reward_loop/test_diffusion_reward_model_genrm.py b/tests/experimental/reward_loop/test_diffusion_reward_model_genrm.py
new file mode 100644
index 00000000000..c20760d7e6d
--- /dev/null
+++ b/tests/experimental/reward_loop/test_diffusion_reward_model_genrm.py
@@ -0,0 +1,111 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import numpy as np
+import ray
+import torch
+from hydra import compose, initialize_config_dir
+from PIL import Image
+
+from verl.experimental.reward_loop import RewardLoopManager
+from verl.protocol import DataProto
+from verl.utils import hf_tokenizer
+
+
+def create_data_samples(tokenizer) -> DataProto:
+    images = ["tests/experimental/reward_loop/assets/ocr.jpg"]
+    prompts = ['a photo of displaying "OCR"']
+    pil_images = [np.array(Image.open(img).convert("RGB").resize((512, 512))) for img in images]
+    responses = [torch.tensor(img).permute(2, 0, 1) / 255.0 for img in pil_images]
+    data_source = ["ocr"] * len(images)
+    reward_info = [{"ground_truth": "OCR"}] * len(images)
+    extra_info = [{}] * len(images)
+
+    responses = torch.stack(responses)
+    prompt_length = 1024
+    pad_token_id = tokenizer.pad_token_id
+    prompt_ids = []
+    for prompt in prompts:
+        prompt_tokens = tokenizer.encode(prompt)
+        padded_prompt = [pad_token_id] * (prompt_length - len(prompt_tokens)) + prompt_tokens
+        prompt_ids.append(torch.tensor(padded_prompt))
+    prompt_ids = torch.stack(prompt_ids)
+
+    data = DataProto.from_dict(
+        tensors={
+            "input_ids": prompt_ids,
+            "responses": responses,
+        },
+        non_tensors={
+            "data_source": data_source,
+            "reward_model": reward_info,
+            "extra_info": extra_info,
+        },
+    )
+    return data
+
+
+def test_diffusion_reward_model_manager():
+    ray.init(
+        runtime_env={
+            "env_vars": {
+                "TOKENIZERS_PARALLELISM": "true",
+                "NCCL_DEBUG": "WARN",
+                "VLLM_LOGGING_LEVEL": "INFO",
+                "VLLM_USE_V1": "1",
+            }
+        }
+    )
+    with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
+        config = compose(config_name="ppo_trainer")
+
+    rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image")
+    reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")
+
+    config.actor_rollout_ref.model.path = rollout_model_name
+    config.actor_rollout_ref.model.tokenizer_path = os.path.join(rollout_model_name, "tokenizer")
+    config.reward.custom_reward_function.path = "tests/experimental/reward_loop/reward_fn.py"
+    config.reward.custom_reward_function.name = "compute_score_ocr"
+    config.reward.num_workers = 1
+    config.reward.reward_manager.name = "image"
+    config.reward.reward_model.enable = True
+    config.reward.reward_model.enable_resource_pool = True
+    config.reward.reward_model.n_gpus_per_node = 2
+    config.reward.reward_model.nnodes = 1
+    config.reward.reward_model.model_path = reward_model_name
+    config.reward.reward_model.rollout.name = os.getenv("ROLLOUT_NAME", "vllm")
+    config.reward.reward_model.rollout.gpu_memory_utilization = 0.9
+    config.reward.reward_model.rollout.tensor_model_parallel_size = 2
+    config.reward.reward_model.rollout.skip_tokenizer_init = False
+    config.reward.reward_model.rollout.prompt_length = 2048
+    config.reward.reward_model.rollout.response_length = 4096
+
+    # 1. init reward model manager
+    reward_loop_manager = RewardLoopManager(config)
+
+    # 2. init test data
+    rollout_tokenizer = hf_tokenizer(config.actor_rollout_ref.model.tokenizer_path)
+    data = create_data_samples(rollout_tokenizer)
+
+    # 3. generate responses
+    outputs = reward_loop_manager.compute_rm_score(data)
+
+    for idx, output in enumerate(outputs):
+        print(f"GRM Response {idx}:\n{output.non_tensor_batch['genrm_response']}\n")
+        print(f"Score:\n{output.non_tensor_batch['score']}\n")
+        print("=" * 50 + "\n")
+
+    ray.shutdown()
diff --git a/tests/models/test_diffusers_fsdp_engine.py b/tests/models/test_diffusers_fsdp_engine.py
new file mode 100644
index 00000000000..cd4fc7e50e3
--- /dev/null
+++ b/tests/models/test_diffusers_fsdp_engine.py
@@ -0,0 +1,221 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from functools import partial
+
+import numpy as np
+import pytest
+import ray
+import torch
+
+from verl import DataProto
+from verl.single_controller.ray import RayClassWithInitArgs, RayResourcePool, RayWorkerGroup
+from verl.utils import tensordict_utils as tu
+from verl.utils.diffusers.schedulers import FlowMatchSDEDiscreteScheduler
+from verl.utils.diffusers.utils import set_timesteps
+from verl.workers.config import DiffusersModelConfig, FSDPActorConfig, TrainingWorkerConfig
+from verl.workers.engine_workers import TrainingWorker
+from verl.workers.utils.losses import ppo_loss
+from verl.workers.utils.padding import embeds_padding_2_no_padding
+
+
+def create_training_config(model_type, strategy, device_count, model):
+    if device_count == 1:
+        cp = fsdp_size = 1
+    else:
+        cp = 1  # TODO (mike): test with cp = 2
+        fsdp_size = 4
+    path = os.path.expanduser(model)
+    tokenizer_path = os.path.join(path, "tokenizer")
+    model_config = DiffusersModelConfig(
+        path=path,
+        tokenizer_path=tokenizer_path,
+        use_remove_padding=True,
+    )
+
+    if strategy in ["fsdp", "fsdp2"]:
+        from hydra import compose, initialize_config_dir
+
+        from verl.utils.config import omega_conf_to_dataclass
+
+        with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config/model")):
+            cfg = compose(
+                config_name="diffusion_model",
+                overrides=[
+                    "path=" + path,
+                    "tokenizer_path=" + tokenizer_path,
+                    "lora_rank=8",
+                    "lora_alpha=16",
+                ],
+            )
+        model_config: DiffusersModelConfig = omega_conf_to_dataclass(cfg)
+
+        with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config/actor")):
+            cfg = compose(
+                config_name="dp_actor",
+                overrides=[
+                    "strategy=" + strategy,
+                    "clip_ratio=0.0001",
+                    "clip_ratio_high=5.0",
+                    "ppo_mini_batch_size=4",
+                    "ppo_micro_batch_size_per_gpu=4",
+                    "optim.lr=1e-4",
+                    "optim.weight_decay=0.0001",
+                    "fsdp_config.param_offload=False",
+                    "fsdp_config.optimizer_offload=False",
+                    "fsdp_config.model_dtype='bfloat16'",
+                    "fsdp_config.dtype='bfloat16'",
+                    "+fsdp_config.mixed_precision.param_dtype='bfloat16'",
+                    "fsdp_config.forward_only=False",
+                    "fsdp_config.fsdp_size=" + str(fsdp_size),
+                    "fsdp_config.ulysses_sequence_parallel_size=" + str(cp),
+                    "policy_loss.loss_mode='flow_grpo'",
+                ],
+            )
+        actor_config: FSDPActorConfig = omega_conf_to_dataclass(cfg)
+
+        engine_config = actor_config.engine
+        optimizer_config = actor_config.optim
+        checkpoint_config = actor_config.checkpoint
+    else:
+        raise NotImplementedError(f"strategy {strategy} is not supported")
+
+    training_config = TrainingWorkerConfig(
+        model_type=model_type,
+        model_config=model_config,
+        engine_config=engine_config,
+        optimizer_config=optimizer_config,
+        checkpoint_config=checkpoint_config,
+    )
+    return training_config, actor_config
+
+
+def create_data_samples(num_device: int, model_config: DiffusersModelConfig) -> DataProto:
+    from tensordict import TensorDict
+
+    scheduler = FlowMatchSDEDiscreteScheduler.from_pretrained(
+        pretrained_model_name_or_path=model_config.local_path, subfolder="scheduler"
+    )
+    set_timesteps(scheduler, model_config)
+
+    batch_size = 8 * num_device
+    seq_len = 64
+    img_size = 512
+    latent_dim = 64
+    encoder_latent_dim = 32
+    inference_steps = 40
+    vocab_size = 99
+    vae_scale_factor = 8
+    height, width = img_size, img_size
+    latent_height, latent_width = height // vae_scale_factor // 2, width // vae_scale_factor // 2
+    num_diffusion_steps = 10
+    timesteps = scheduler.timesteps[None].repeat(batch_size, 1)
+
+    torch.manual_seed(1)
+    np.random.seed(1)
+
+    batch = TensorDict(
+        {
+            "input_ids": torch.randint(0, vocab_size, (batch_size, seq_len)),
+            "attention_mask": torch.ones((batch_size, inference_steps)),
+            "response_mask": torch.ones((batch_size, inference_steps)),
+            "old_log_probs": torch.randn((batch_size, num_diffusion_steps)),
+            "advantages": torch.randn((batch_size, num_diffusion_steps)),
+            "responses": torch.randn((batch_size, 3, height, width)),
+            "all_latents": torch.randn((batch_size, inference_steps, latent_height * latent_width, latent_dim)),
+            "rollout_log_probs": torch.randn((batch_size, num_diffusion_steps)),
+            "all_timesteps": timesteps,
+            "prompt_embeds": torch.randn((batch_size, seq_len, encoder_latent_dim)),
+            "prompt_embeds_mask": torch.ones((batch_size, seq_len), dtype=torch.int32),
+            "negative_prompt_embeds": torch.randn((batch_size, seq_len, encoder_latent_dim)),
+            "negative_prompt_embeds_mask": torch.ones((batch_size, seq_len), dtype=torch.int32),
+            "loss_mask": torch.ones((batch_size, inference_steps), dtype=torch.int32),
+        },
+        batch_size=batch_size,
+    )
+    data = DataProto(batch=batch)
+    data.meta_info["global_token_num"] = torch.sum(data.batch["attention_mask"], dim=-1).tolist()
+    data.meta_info["use_dynamic_bsz"] = False
+    data.meta_info["micro_batch_size_per_gpu"] = 4
+    data.meta_info["height"] = height
+    data.meta_info["width"] = width
+
+    return data
+
+
+@pytest.mark.parametrize("strategy", ["fsdp", "fsdp2"])
+def test_diffusers_fsdp_engine(strategy):
+    # Create configs
+    ray.init()
+    device_count = torch.cuda.device_count()
+    training_config, actor_config = create_training_config(
+        model_type="diffusion_model",
+        strategy=strategy,
+        device_count=device_count,
+        model="~/models/tiny-random/Qwen-Image",
+    )
+    # init model
+    ray_cls_with_init = RayClassWithInitArgs(cls=ray.remote(TrainingWorker), config=training_config)
+    resource_pool = RayResourcePool(process_on_nodes=[device_count])
+    wg = RayWorkerGroup(resource_pool=resource_pool, ray_cls_with_init=ray_cls_with_init)  # TrainigWorker
+    wg.reset()
+
+    # forward only without loss function
+    data_td = create_data_samples(device_count, training_config.model_config).to_tensordict()
+    data_td = embeds_padding_2_no_padding(data_td)
+    tu.assign_non_tensor(
+        data_td,
+        compute_loss=False,
+        image_height=training_config.model_config.get("image_height", 512),
+        image_width=training_config.model_config.get("image_width", 512),
+        vae_scale_factor=training_config.model_config.get("vae_scale_factor", 8),
+    )
+    output = wg.infer_batch(data_td)
+    output_dict = output.get()
+
+    print("Output:", output_dict)
+    for key in ["log_probs", "metrics"]:
+        assert key in output_dict
+
+    # forward and backward with loss function
+    # set loss function
+    loss_fn = partial(ppo_loss, config=actor_config)
+    wg.set_loss_fn(loss_fn)
+
+    # train batch
+    data_td = create_data_samples(device_count, training_config.model_config).to_tensordict()
+    data_td = embeds_padding_2_no_padding(data_td)
+    ppo_mini_batch_size = 4
+    ppo_epochs = actor_config.ppo_epochs
+    seed = 42
+    shuffle = actor_config.shuffle
+    tu.assign_non_tensor(
+        data_td,
+        global_batch_size=ppo_mini_batch_size * device_count,
+        mini_batch_size=ppo_mini_batch_size * device_count,
+        epochs=ppo_epochs,
+        seed=seed,
+        dataloader_kwargs={"shuffle": shuffle},
+        image_height=training_config.model_config.get("image_height", 512),
+        image_width=training_config.model_config.get("image_width", 512),
+        vae_scale_factor=training_config.model_config.get("vae_scale_factor", 8),
+    )
+    output = wg.train_mini_batch(data_td)
+    output_dict = output.get()
+
+    print("Output:", output_dict)
+    assert "metrics" in output_dict.keys()
+
+    ray.shutdown()
diff --git a/tests/special_sanity/check_pr_title.py b/tests/special_sanity/check_pr_title.py
index 1153d9d77af..26fb412cba7 100644
--- a/tests/special_sanity/check_pr_title.py
+++ b/tests/special_sanity/check_pr_title.py
@@ -19,7 +19,7 @@
 pr_title = os.environ.get("PR_TITLE", "").strip()
 
 # Define rules
-allowed_modules = ["fsdp", "megatron", "veomni", "sglang", "vllm", "trtllm", "rollout", "trainer"]
+allowed_modules = ["fsdp", "megatron", "veomni", "sglang", "vllm", "vllm_omni", "trtllm", "rollout", "trainer"]
 allowed_modules += ["tests", "training_utils", "recipe", "hardware", "deployment"]
 allowed_modules += ["ray", "worker", "single_controller", "misc", "docker", "ci"]
 allowed_modules += ["perf", "model", "algo", "env", "tool", "ckpt", "doc", "data", "cfg", "reward"]
diff --git a/tests/trainer/ppo/test_flow_grpo_core_algos.py b/tests/trainer/ppo/test_flow_grpo_core_algos.py
new file mode 100644
index 00000000000..8ebdd8ae315
--- /dev/null
+++ b/tests/trainer/ppo/test_flow_grpo_core_algos.py
@@ -0,0 +1,95 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import unittest
+import uuid
+
+import numpy as np
+import pytest
+import torch
+
+from verl.trainer.ppo.core_algos import (
+    compute_flow_grpo_outcome_advantage,
+    compute_policy_loss_flow_grpo,
+)
+from verl.utils.config import omega_conf_to_dataclass
+
+
+@pytest.mark.parametrize("norm_adv_by_std_in_grpo", [True, False])
+@pytest.mark.parametrize("global_std", [True, False])
+def test_flow_grpo_advantage_return(norm_adv_by_std_in_grpo: bool, global_std: bool) -> None:
+    """Test flow-GRPO advantage and return computation."""
+
+    # prepere input
+    batch_size = 8
+    steps = 10
+    token_level_rewards = torch.randn((batch_size, 1), dtype=torch.float32)
+    response_mask = torch.ones((batch_size, steps), dtype=torch.int32)
+    uid = np.array([uuid.uuid4().hex for _ in range(batch_size)])
+
+    advantages, returns = compute_flow_grpo_outcome_advantage(
+        token_level_rewards=token_level_rewards,
+        response_mask=response_mask,
+        index=uid,
+        norm_adv_by_std_in_grpo=norm_adv_by_std_in_grpo,
+        global_std=global_std,
+    )
+
+    assert advantages.shape == returns.shape == (batch_size, steps)
+
+
+def test_compute_policy_loss_flow_grpo() -> None:
+    """Test flow-GRPO policy loss computation."""
+
+    # prepare input
+    batch_size = 8
+    steps = 10
+    rollout_log_probs = torch.randn((batch_size, steps), dtype=torch.float32)
+    current_log_probs = torch.randn((batch_size, steps), dtype=torch.float32)
+    advantages = torch.randn((batch_size, steps), dtype=torch.float32)
+    response_mask = torch.ones((batch_size, steps), dtype=torch.int32)
+    from hydra import compose, initialize_config_dir
+
+    from verl.workers.config.actor import FSDPActorConfig
+
+    with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config/actor")):
+        cfg = compose(
+            config_name="dp_actor",
+            overrides=[
+                "strategy=fsdp",
+                "clip_ratio=0.0001",
+                "clip_ratio_high=5.0",
+                "ppo_micro_batch_size_per_gpu=8",
+            ],
+        )
+    actor_config: FSDPActorConfig = omega_conf_to_dataclass(cfg)
+
+    for step in range(steps):
+        pg_loss, pg_metrics = compute_policy_loss_flow_grpo(
+            old_log_prob=rollout_log_probs[:, step],
+            log_prob=current_log_probs[:, step],
+            advantages=advantages[:, step],
+            response_mask=response_mask[:, step],
+            loss_agg_mode="token-mean",
+            config=actor_config,
+        )
+
+        assert pg_loss.shape == ()
+        assert isinstance(pg_loss.item(), float)
+        assert "actor/ppo_kl" in pg_metrics.keys()
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/verl/experimental/agent_loop/__init__.py b/verl/experimental/agent_loop/__init__.py
index d43683df3e4..e819dd134a5 100644
--- a/verl/experimental/agent_loop/__init__.py
+++ b/verl/experimental/agent_loop/__init__.py
@@ -12,10 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from .agent_loop import AgentLoopBase, AgentLoopManager, AgentLoopWorker, AsyncLLMServerManager
+from .agent_loop import (
+    AgentLoopBase,
+    AgentLoopManager,
+    AgentLoopWorker,
+    AsyncLLMServerManager,
+    DiffusionAgentLoopWorker,
+)
 from .single_turn_agent_loop import SingleTurnAgentLoop
 from .tool_agent_loop import ToolAgentLoop
 
 _ = [SingleTurnAgentLoop, ToolAgentLoop]
 
-__all__ = ["AgentLoopBase", "AgentLoopManager", "AsyncLLMServerManager", "AgentLoopWorker"]
+__all__ = ["AgentLoopBase", "AgentLoopManager", "AsyncLLMServerManager", "AgentLoopWorker", "DiffusionAgentLoopWorker"]
diff --git a/verl/experimental/agent_loop/agent_loop.py b/verl/experimental/agent_loop/agent_loop.py
index 5383ae4a2a5..c17ca93398c 100644
--- a/verl/experimental/agent_loop/agent_loop.py
+++ b/verl/experimental/agent_loop/agent_loop.py
@@ -23,6 +23,7 @@
 import numpy as np
 import ray
 import torch
+import torch.nn.functional as F
 from cachetools import LRUCache
 from omegaconf import DictConfig, OmegaConf
 from PIL import Image
@@ -45,8 +46,8 @@
     rollout_trace_op,
 )
 from verl.utils.tokenizer import normalize_token_ids
-from verl.workers.config import HFModelConfig, RolloutConfig
-from verl.workers.rollout.replica import TokenOutput, get_rollout_replica_class
+from verl.workers.config import DiffusersModelConfig, HFModelConfig, RolloutConfig
+from verl.workers.rollout.replica import ImageOutput, TokenOutput, get_rollout_replica_class
 
 logger = logging.getLogger(__file__)
 logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
@@ -141,7 +142,8 @@ async def generate(
         sampling_params: dict[str, Any],
         image_data: Optional[list[Any]] = None,
         video_data: Optional[list[Any]] = None,
-    ) -> TokenOutput:
+        **kwargs: Any,
+    ) -> TokenOutput | ImageOutput:
         """Generate tokens from prompt ids.
 
         Args:
@@ -150,7 +152,7 @@ async def generate(
             sampling_params (Dict[str, Any]): Sampling parameters for the chat completion.
 
         Returns:
-            TokenOutput: token output
+            TokenOutput | ImageOutput: token or image output
         """
         server_id, server = await self._acquire_server(request_id)
         try:
@@ -160,6 +162,7 @@ async def generate(
                 sampling_params=sampling_params,
                 image_data=image_data,
                 video_data=video_data,
+                **kwargs,
             )
             return output
         finally:
@@ -226,6 +229,48 @@ class _InternalAgentLoopOutput(AgentLoopOutput):
     """Extra fields for dynamic addition."""
 
 
+class DiffusionAgentLoopOutput(BaseModel):
+    """Agent loop output."""
+
+    prompt_ids: list[int]
+    """Prompt token ids."""
+    response_image: list[list[list[float]]]
+    """Response image (CHW format)."""
+    response_logprobs: Optional[list[float]] = None
+    """Log probabilities for the response tokens."""
+    multi_modal_data: Optional[dict[str, Any]] = None
+    """Multi-modal data for multi-modal tools."""
+    reward_score: Optional[float] = None
+    """Reward score for the trajectory."""
+    num_turns: int = 0
+    """Number of chat turns, including user, assistant, tool."""
+    metrics: AgentLoopMetrics
+    """Auxiliary performance metrics"""
+    extra_fields: dict[str, Any] = {}
+    """Extra fields for dynamic addition."""
+
+
+class _InternalDiffusionAgentLoopOutput(DiffusionAgentLoopOutput):
+    """Internal agent loop output with padded sequences."""
+
+    model_config = ConfigDict(arbitrary_types_allowed=True)
+
+    prompt_ids: torch.Tensor
+    """Padded prompt token ids."""
+    response_image: torch.Tensor
+    """Response image (NCHW format)."""
+    input_ids: torch.Tensor
+    """Padded input ids(prompt_ids)."""
+    attention_mask: torch.Tensor
+    """Padded attention mask."""
+    response_logprobs: Optional[torch.Tensor] = None
+    """Log probabilities for the response tokens."""
+    multi_modal_inputs: Optional[dict[str, torch.Tensor]] = None
+    """Multi-modal inputs for processors (e.g., pixel_values, image_grid_thw)."""
+    extra_fields: dict[str, Any] = {}
+    """Extra fields for dynamic addition."""
+
+
 class DictConfigWrap:
     """Wrapper for DictConfig to avoid hydra.utils.instantiate recursive resolve."""
 
@@ -866,6 +911,371 @@ def _postprocess(
         )
 
 
+class DiffusionAgentLoopWorker:
+    """Diffusion Agent loop worker takes a batch of messages and run each message in an agent loop.
+
+    Args:
+        config (DictConfig): whole config for main entrypoint.
+        server_handles (List[ray.actor.ActorHandle]): OpenAI compatible LLM server actor handles.
+        reward_loop_worker_handles (List[ray.actor.ActorHandle]): Actor handles for streaming reward computation.
+    """
+
+    def __init__(
+        self,
+        config: DictConfig,
+        servers: list[tuple[str, ray.actor.ActorHandle]],
+        load_balancer_handle: ray.actor.ActorHandle,
+        reward_loop_worker_handles: list[ray.actor.ActorHandle] = None,
+    ):
+        """Initialize agent loop manager.
+        Args:
+            config (DictConfig): YAML config.
+            servers (list[tuple[str, ray.actor.ActorHandle]]): (address, handle) pairs for each LLM server.
+            load_balancer_handle (ray.actor.ActorHandle): shared global load balancer actor.
+            reward_loop_worker_handles (list[ray.actor.ActorHandle]): Actor handles for streaming reward computation.
+        """
+        self.config = config
+        rollout_config, model_config = _get_rollout_and_model_config(config)
+        self.rollout_config: RolloutConfig = omega_conf_to_dataclass(rollout_config)
+        self.model_config: DiffusersModelConfig = omega_conf_to_dataclass(model_config)
+
+        # for recipe to change
+        if not hasattr(self, "server_manager"):
+            self.server_manager = AsyncLLMServerManager(
+                config,
+                servers,
+                load_balancer_handle=load_balancer_handle,
+            )
+
+        self.dataset_cls = get_dataset_class(config.data)
+        self.reward_loop_worker_handles = reward_loop_worker_handles
+
+        self.tokenizer = self.model_config.tokenizer
+        self.processor = self.model_config.processor
+
+        agent_loop_config_path = self.rollout_config.agent.agent_loop_config_path
+        if agent_loop_config_path:
+            resolved_path = resolve_config_path(agent_loop_config_path)
+            agent_loop_configs = OmegaConf.load(resolved_path)
+            for agent_loop_config in agent_loop_configs:
+                _agent_loop_registry[agent_loop_config.name] = agent_loop_config
+        if self.model_config.get("custom_chat_template", None) is not None:
+            if self.model_config.processor is not None:
+                self.model_config.processor.chat_template = self.model_config.custom_chat_template
+            self.model_config.tokenizer.chat_template = self.model_config.custom_chat_template
+
+        trace_config = self.rollout_config.trace
+        RolloutTraceConfig.init(
+            self.rollout_config.trace.project_name,
+            self.rollout_config.trace.experiment_name,
+            trace_config.get("backend"),
+            trace_config.get("token2text", False),
+            trace_config.get("max_samples_per_step_per_worker", None),
+        )
+
+    async def generate_sequences(self, batch: DataProto) -> DataProto:
+        """Generate sequences from agent loop.
+
+        Args:
+            batch (DataProto): Input batch.
+
+        Returns:
+            DataProto: Output batch.
+            - prompts: [bsz, prompt_length], prompt token ids from dataset.
+            - responses: [bsz, channel, height, width],  output images from diffusion generation.
+            ...
+        """
+        config = self.rollout_config
+
+        # TODO (mike): it is for Qwen-Image only, need to generalize later
+        sampling_params = dict(
+            logprobs=config.calculate_log_probs,
+            height=config.image_height,
+            width=config.image_width,
+            true_cfg_scale=config.guidance_scale,
+            max_sequence_length=config.max_model_len,
+            sde_type=config.sde_type,
+            sde_window_size=config.sde_window_size,
+            sde_window_range=config.sde_window_range,
+        )
+
+        # override sampling params for validation
+        if batch.meta_info.get("validate", False):
+            sampling_params["num_inference_steps"] = config.val_kwargs.num_inference_steps
+            sampling_params["seed"] = config.val_kwargs.seed
+            sampling_params["noise_level"] = config.val_kwargs.noise_level
+        else:
+            sampling_params["num_inference_steps"] = config.num_inference_steps
+            sampling_params["noise_level"] = config.noise_level
+
+        # by default, we assume it's a single turn agent
+        if "agent_name" not in batch.non_tensor_batch:
+            default_agent_loop = config.agent.default_agent_loop
+            batch.non_tensor_batch["agent_name"] = np.array([default_agent_loop] * len(batch), dtype=object)
+
+        if "index" in batch.non_tensor_batch:
+            index = batch.non_tensor_batch["index"]
+        else:
+            index = np.arange(len(batch))
+
+        max_samples_per_worker = RolloutTraceConfig.get_instance().max_samples_per_step_per_worker
+
+        # For n rollouts per sample, we trace all n rollouts for selected samples
+        # Note: This sampling happens per-worker, so total traces = max_samples_per_worker * num_workers * n
+        if max_samples_per_worker is not None:
+            unique_sample_indices = np.unique(index)
+            if max_samples_per_worker < len(unique_sample_indices):
+                selected_samples = set(
+                    np.random.choice(unique_sample_indices, max_samples_per_worker, replace=False).tolist()
+                )
+                traced_indices = set(i for i in range(len(batch)) if index[i] in selected_samples)
+            else:
+                traced_indices = set(range(len(batch)))
+        else:
+            traced_indices = set(range(len(batch)))
+
+        trajectory_info = await get_trajectory_info(
+            batch.meta_info.get("global_steps", -1), index.tolist(), batch.meta_info.get("validate", False)
+        )
+
+        tasks = []
+        for i in range(len(batch)):
+            trace_this_sample = i in traced_indices
+            kwargs = {k: v[i] for k, v in batch.non_tensor_batch.items()}
+            tasks.append(
+                asyncio.create_task(
+                    self._run_agent_loop(sampling_params, trajectory_info[i], trace=trace_this_sample, **kwargs)
+                )
+            )
+        outputs = await asyncio.gather(*tasks)
+
+        output = self._postprocess(outputs, input_non_tensor_batch=batch.non_tensor_batch)
+
+        return output
+
+    async def _run_agent_loop(
+        self,
+        sampling_params: dict[str, Any],
+        trajectory: dict[str, Any],
+        *,
+        agent_name: str,
+        trace: bool = True,
+        **kwargs,
+    ) -> _InternalDiffusionAgentLoopOutput:
+        with rollout_trace_attr(
+            step=trajectory["step"],
+            sample_index=trajectory["sample_index"],
+            rollout_n=trajectory["rollout_n"],
+            validate=trajectory["validate"],
+            name="agent_loop",
+            trace=trace,
+        ):
+            assert agent_name in _agent_loop_registry, (
+                f"Agent loop {agent_name} not registered, registered agent loops: {_agent_loop_registry.keys()}"
+            )
+
+            agent_loop_config = _agent_loop_registry[agent_name]
+            agent_loop = hydra.utils.instantiate(
+                config=agent_loop_config,
+                trainer_config=DictConfigWrap(config=self.config),
+                server_manager=self.server_manager,
+                tokenizer=self.tokenizer,
+                processor=self.processor,
+                dataset_cls=self.dataset_cls,
+                data_config=DictConfigWrap(self.config.data),
+            )
+            output: DiffusionAgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
+            return await self._agent_loop_postprocess(output, **kwargs)
+
+    async def _agent_loop_postprocess(self, output, **kwargs) -> _InternalDiffusionAgentLoopOutput:
+        """Perform post-processing operations on the output of each individual agent loop."""
+        # handling extra tensor ouputs from vllm-omni, like prompt embedding, etc.
+        extra_fields = {}
+        for k, v in output.extra_fields.items():
+            if isinstance(v, torch.Tensor):
+                # handle prompt embedding padding
+                if k in ["prompt_embeds", "negative_prompt_embeds"]:
+                    pad_tuple = (0, 0, 0, self.config.actor_rollout_ref.rollout.prompt_length - v.shape[0])
+                    v = F.pad(v, pad_tuple, value=0)
+                elif k in ["prompt_embeds_mask", "negative_prompt_embeds_mask"]:
+                    pad_tuple = (0, self.config.actor_rollout_ref.rollout.prompt_length - v.shape[0])
+                    v = F.pad(v, pad_tuple, value=0)
+                extra_fields[k] = v.unsqueeze(0)
+            else:
+                extra_fields[k] = v
+
+        extra_fields["raw_prompt"] = kwargs["raw_prompt"]
+
+        # TODO(wuxibin): remove padding and use tensordict.
+        self.tokenizer.padding_side = "left"
+        prompt_output = self.tokenizer.pad(
+            {"input_ids": output.prompt_ids},
+            padding="max_length",
+            max_length=self.rollout_config.prompt_length,
+            return_tensors="pt",
+            return_attention_mask=True,
+        )
+        if prompt_output["input_ids"].dim() == 1:
+            prompt_output["input_ids"] = prompt_output["input_ids"].unsqueeze(0)
+            prompt_output["attention_mask"] = prompt_output["attention_mask"].unsqueeze(0)
+
+        self.tokenizer.padding_side = "right"
+
+        response_image = torch.tensor(output.response_image)
+        if response_image.dim() == 3:
+            response_image = response_image.unsqueeze(0)
+
+        response_logprobs = None
+        if output.response_logprobs is not None:
+            response_logprobs = torch.tensor(output.response_logprobs).unsqueeze(0)
+
+        attention_mask = prompt_output["attention_mask"]
+        input_ids = prompt_output["input_ids"]
+
+        multi_modal_inputs = self._compute_multi_modal_inputs(output, input_ids)
+        await self._compute_score(
+            output,
+            prompts=input_ids,
+            responses=response_image,
+            attention_mask=attention_mask,
+            input_ids=input_ids,
+            kwargs=kwargs,
+        )
+
+        if "reward_extra_info" in output.extra_fields:
+            extra_fields["reward_extra_info"] = output.extra_fields["reward_extra_info"]
+
+        return _InternalDiffusionAgentLoopOutput(
+            prompt_ids=input_ids,
+            response_image=response_image,
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            response_logprobs=response_logprobs,
+            multi_modal_inputs=multi_modal_inputs,
+            multi_modal_data=output.multi_modal_data,
+            reward_score=output.reward_score,
+            num_turns=output.num_turns,
+            metrics=output.metrics,
+            extra_fields=extra_fields,
+        )
+
+    def _compute_multi_modal_inputs(self, output, input_ids) -> dict[str, torch.Tensor]:
+        """Compute multi-modal inputs with image and video."""
+        multi_modal_inputs = {}
+        if self.processor is None:
+            return multi_modal_inputs
+
+        raise NotImplementedError("Multi-modal input processing not implemented yet.")
+
+    async def _compute_score(self, output, prompts, responses, attention_mask, input_ids, kwargs):
+        """Compute reward score for single sample."""
+        enable_async_reward = self.reward_loop_worker_handles is not None
+
+        if output.reward_score is None and enable_async_reward:
+            batch = TensorDict(
+                {
+                    "prompts": prompts,  # [1, prompt_length]
+                    "responses": responses,  # [1, channel, height, width]
+                    "attention_mask": attention_mask,  # [1, prompt_length]
+                    "input_ids": input_ids,  # [1, prompt_length]
+                },
+                batch_size=1,
+            )
+            non_tensor_batch = {
+                **{k: np.array([v]) for k, v in kwargs.items()},
+                "__num_turns__": np.array([output.num_turns]),
+                "tool_extra_fields": np.array([output.extra_fields], dtype=object),
+            }
+
+            data = DataProto(
+                batch=batch,
+                non_tensor_batch=non_tensor_batch,
+            )
+            selected_reward_loop_worker_handle = random.choice(self.reward_loop_worker_handles)
+            result = await selected_reward_loop_worker_handle.compute_score.remote(data)
+            output.reward_score = result["reward_score"]
+            output.extra_fields["reward_extra_info"] = result["reward_extra_info"]
+
+    def _postprocess(
+        self,
+        inputs: list[_InternalDiffusionAgentLoopOutput],
+        input_non_tensor_batch: dict | None = None,
+    ) -> DataProto:
+        """Process the padded outputs from _run_agent_loop and combine them into a batch."""
+        # Convert lists back to tensors and stack them to create a batch.
+        prompt_ids = torch.cat([input.prompt_ids for input in inputs], dim=0)
+        response_image = torch.cat([input.response_image for input in inputs], dim=0)
+        attention_mask = torch.cat([input.attention_mask for input in inputs], dim=0)
+        input_ids = torch.cat([input.input_ids for input in inputs], dim=0)
+        optional_outputs = {}
+        if inputs[0].response_logprobs is not None:
+            optional_outputs["rollout_log_probs"] = torch.cat([input.response_logprobs for input in inputs], dim=0)
+
+        # Handle extra fields that are tensors
+        extra_keys = [k for k, v in inputs[0].extra_fields.items() if isinstance(v, torch.Tensor)]
+        for key in extra_keys:
+            optional_outputs[key] = torch.cat([input.extra_fields[key] for input in inputs], dim=0)
+            for input in inputs:
+                del input.extra_fields[key]
+
+        batch = TensorDict(
+            {
+                "prompts": prompt_ids,  # [bsz, prompt_length]
+                "responses": response_image,  # [bsz, channel, height, width]
+                "input_ids": input_ids,  # [bsz, prompt_length]
+                "attention_mask": attention_mask,  # [bsz, prompt_length]
+                **optional_outputs,
+            },
+            batch_size=len(inputs),
+        )
+
+        scores = [input.reward_score for input in inputs]
+        if all(score is not None for score in scores):
+            rm_scores = torch.tensor(scores, dtype=torch.float32).unsqueeze(-1)
+            batch["rm_scores"] = rm_scores
+
+        non_tensor_batch = {
+            "__num_turns__": np.array([input.num_turns for input in inputs], dtype=np.int32),
+        }
+        if self.reward_loop_worker_handles is None and input_non_tensor_batch:
+            non_tensor_batch.update(input_non_tensor_batch)
+
+        # add reward_extra_info to non_tensor_batch
+        reward_extra_infos = [input.extra_fields.get("reward_extra_info", {}) for input in inputs]
+        reward_extra_keys = list(reward_extra_infos[0].keys())
+        for key in reward_extra_keys:
+            non_tensor_batch[key] = np.array([info[key] for info in reward_extra_infos])
+
+        # Add multi_modal_inputs to non_tensor_batch if any samples have them
+        multi_modal_inputs_list = [input.multi_modal_inputs for input in inputs]
+        if any(mmi is not None for mmi in multi_modal_inputs_list):
+            non_tensor_batch["multi_modal_inputs"] = np.array(multi_modal_inputs_list, dtype=object)
+
+        metrics = [input.metrics.model_dump() for input in inputs]
+        # Collect extra fields from all inputs and convert them to np.ndarray
+        extra_fields = {}
+        all_keys = set(key for input_item in inputs for key in input_item.extra_fields)
+        for key in all_keys:
+            temp_arr = np.empty(len(inputs), dtype=object)
+            temp_arr[:] = [input.extra_fields.get(key) for input in inputs]
+            extra_fields[key] = temp_arr
+
+        non_tensor_batch.update(extra_fields)
+
+        # Only include reward_extra_keys in meta_info if rm_scores is in batch
+        # This avoids conflicts when reward_tensor is merged later in ray_trainer.py
+        if "rm_scores" in batch.keys():
+            meta_info = {"metrics": metrics, "reward_extra_keys": reward_extra_keys}
+        else:
+            meta_info = {"metrics": metrics}
+
+        return DataProto(
+            batch=batch,
+            non_tensor_batch=non_tensor_batch,
+            meta_info=meta_info,
+        )
+
+
 async def get_trajectory_info(step, index, validate):
     """Get trajectory info.
 
@@ -920,7 +1330,10 @@ def __init__(
         if not hasattr(self, "rollout_replica_class"):
             self.rollout_replica_class = get_rollout_replica_class(self.rollout_config.name)
         if not hasattr(self, "agent_loop_workers_class"):
-            self.agent_loop_workers_class = ray.remote(AgentLoopWorker)
+            if self.config.actor_rollout_ref.model.model_type == "diffusion_model":
+                self.agent_loop_workers_class = ray.remote(DiffusionAgentLoopWorker)
+            else:
+                self.agent_loop_workers_class = ray.remote(AgentLoopWorker)
 
     @classmethod
     @auto_await
@@ -1059,14 +1472,16 @@ def _performance_metrics(self, metrics: list[list[dict[str, str]]], output: Data
 
         # batch sequence generation is bounded by the slowest sample
         slowest = np.argmax(t_generate_sequences + t_tool_calls)
-        attention_mask = output.batch["attention_mask"][slowest]
         prompt_length = output.batch["prompts"].shape[1]
         timing["agent_loop/slowest/generate_sequences"] = t_generate_sequences[slowest]
         timing["agent_loop/slowest/tool_calls"] = t_tool_calls[slowest]
-        timing["agent_loop/slowest/prompt_length"] = attention_mask[:prompt_length].sum().item()
-        timing["agent_loop/slowest/response_length"] = attention_mask[prompt_length:].sum().item()
         timing["agent_loop/slowest/num_preempted"] = num_preempted[slowest]
 
+        if "attention_mask" in output.batch:
+            attention_mask = output.batch["attention_mask"][slowest]
+            timing["agent_loop/slowest/prompt_length"] = attention_mask[:prompt_length].sum().item()
+            timing["agent_loop/slowest/response_length"] = attention_mask[prompt_length:].sum().item()
+
         return timing
 
     @auto_await
diff --git a/verl/experimental/agent_loop/single_turn_agent_loop.py b/verl/experimental/agent_loop/single_turn_agent_loop.py
index 6ad3aa429b3..ab045e3f4ad 100644
--- a/verl/experimental/agent_loop/single_turn_agent_loop.py
+++ b/verl/experimental/agent_loop/single_turn_agent_loop.py
@@ -16,7 +16,7 @@
 from typing import Any
 from uuid import uuid4
 
-from verl.experimental.agent_loop.agent_loop import AgentLoopBase, AgentLoopOutput, register
+from verl.experimental.agent_loop.agent_loop import AgentLoopBase, AgentLoopOutput, DiffusionAgentLoopOutput, register
 from verl.utils.profiler import simple_timer
 
 logger = logging.getLogger(__file__)
@@ -80,3 +80,52 @@ async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutpu
         output.extra_fields.update({"turn_scores": [], "tool_rewards": []})
 
         return output
+
+
+@register("diffusion_single_turn_agent")
+class DiffusionSingleTurnAgentLoop(AgentLoopBase):
+    """Agent loop for diffusion model serving."""
+
+    async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput:
+        raw_prompt = kwargs["raw_prompt"]
+
+        if self.config.actor_rollout_ref.rollout.guidance_scale > 0:
+            raw_negative_prompt = kwargs["raw_negative_prompt"]
+        else:
+            raw_negative_prompt = None
+
+        # 1. extract images and videos from messages
+        multi_modal_data = await self.process_vision_info(raw_prompt)
+        images = multi_modal_data.get("images")
+        videos = multi_modal_data.get("videos")
+
+        # 2. apply chat template and tokenize
+        prompt_ids = await self.apply_chat_template(raw_prompt, images=images, videos=videos)
+
+        if raw_negative_prompt is not None:
+            negative_prompt_ids = await self.apply_chat_template(raw_negative_prompt, images=images, videos=videos)
+
+        # 3. generate sequences
+        metrics = {}
+        with simple_timer("generate_sequences", metrics):
+            output = await self.server_manager.generate(
+                request_id=uuid4().hex,
+                prompt_ids=prompt_ids,
+                sampling_params=sampling_params,
+                image_data=images,
+                video_data=videos,
+                negative_prompt_ids=negative_prompt_ids,
+            )
+        if metrics.get("num_preempted") is None:
+            metrics["num_preempted"] = output.num_preempted if output.num_preempted is not None else -1
+
+        output = DiffusionAgentLoopOutput(
+            prompt_ids=prompt_ids,
+            response_image=output.image,
+            response_logprobs=output.log_probs,
+            multi_modal_data=multi_modal_data,
+            num_turns=2,
+            metrics=metrics,
+            extra_fields=output.extra_info,
+        )
+        return output
diff --git a/verl/experimental/reward_loop/reward_loop.py b/verl/experimental/reward_loop/reward_loop.py
index 151089ec5c6..c2ba1e7470c 100644
--- a/verl/experimental/reward_loop/reward_loop.py
+++ b/verl/experimental/reward_loop/reward_loop.py
@@ -13,14 +13,17 @@
 # limitations under the License.
 
 import asyncio
+import base64
 import logging
 import os
+from io import BytesIO
 
 import aiohttp
 import numpy as np
 import ray
 import torch
 from omegaconf import DictConfig, open_dict
+from PIL import Image
 from tensordict import TensorDict
 
 from verl.protocol import DataProto
@@ -28,6 +31,7 @@
 from verl.trainer.ppo.reward import load_reward_manager
 from verl.utils import hf_tokenizer
 from verl.utils.fs import copy_to_local
+from verl.utils.ray_utils import get_event_loop
 
 from .reward_model import RewardModelManager
 
@@ -114,9 +118,10 @@ def __init__(self, config: DictConfig, reward_router_address: str = None):
         self.config = config
         self.reward_router_address = reward_router_address
         self._init_reward_fn()
+        self.loop = get_event_loop()
 
     def _init_reward_fn(self):
-        input_tokenizer_local_path = copy_to_local(self.config.actor_rollout_ref.model.path)
+        input_tokenizer_local_path = copy_to_local(self.config.actor_rollout_ref.model.tokenizer_path)
         self.input_tokenizer = hf_tokenizer(input_tokenizer_local_path, trust_remote_code=True)
         self.reward_model_tokenizer = None
         if self.config.reward.reward_model.enable:
@@ -199,17 +204,32 @@ async def _preprocess_reward_inputs(self, data: DataProto) -> str:
         chat: list = list(data_item.non_tensor_batch["raw_prompt"])
 
         # extract response
-        response_ids = data_item.batch["responses"]
-        response_length = response_ids.shape[-1]
-        valid_response_length = data_item.batch["attention_mask"][-response_length:].sum()
-        valid_response_ids = response_ids[:valid_response_length]
+        response = data_item.batch["responses"]
+        if response.ndim == 3:
+            # handling multi-modal response
+            response_image = response
+            if isinstance(response_image, torch.Tensor):
+                response_image = response_image.float().permute(1, 2, 0).cpu().numpy()
+            assert response_image.shape[-1] == 3, "must be in HWC format"
+            response_image = (response_image * 255).round().clip(0, 255).astype(np.uint8)
+            response_image = Image.fromarray(response_image)
+
+            image_base64 = await self.loop.run_in_executor(None, self._pil_image_to_base64, response_image)
+            query = self.prepare_query_for_multi_modal(image_base64)
+
+            chat.append({"role": "assistant", "content": query})
+        else:
+            response_ids = response
+            response_length = response_ids.shape[-1]
+            valid_response_length = data_item.batch["attention_mask"][-response_length:].sum()
+            valid_response_ids = response_ids[:valid_response_length]
 
-        # decode
-        rollout_response = self.input_tokenizer.decode(valid_response_ids)
-        # remove bos and eos
-        rollout_response = rollout_response.replace(self.input_tokenizer.eos_token, "")
+            # decode
+            rollout_response = self.input_tokenizer.decode(valid_response_ids)
+            # remove bos and eos
+            rollout_response = rollout_response.replace(self.input_tokenizer.eos_token, "")
 
-        chat.append({"role": "assistant", "content": rollout_response})
+            chat.append({"role": "assistant", "content": rollout_response})
 
         rm_prompt = self.reward_model_tokenizer.apply_chat_template(
             chat,
@@ -267,6 +287,22 @@ async def compute_score_disrm(self, data: DataProto) -> dict:
 
         return {"reward_score": rm_score}
 
+    def _pil_image_to_base64(self, image: Image.Image) -> str:
+        buffered = BytesIO()
+        image.save(buffered, format="PNG")
+        encoded_image_text = base64.b64encode(buffered.getvalue()).decode("utf-8")
+        base64_image = f"data:image;base64,{encoded_image_text}"
+        return base64_image
+
+    def prepare_query_for_multi_modal(self, image_base64: str) -> list:
+        query = [
+            {
+                "type": "image_url",
+                "image_url": {"url": image_base64},
+            },
+        ]
+        return query
+
 
 class RewardLoopManager:
     """
@@ -281,7 +317,7 @@ def __init__(self, config: DictConfig, rm_resource_pool: RayResourcePool = None)
             self.reward_router_address = self.reward_model_manager.get_router_address()
         else:
             self.reward_model_manager = None
-            self.reward_router_address = None
+            self.reward_router_address = self.config.reward.reward_model.get("reward_router_address", None)
 
         self.reward_loop_workers_class = ray.remote(RewardLoopWorker)
         self._init_reward_loop_workers()
@@ -319,12 +355,15 @@ def compute_rm_score(self, data: DataProto) -> DataProto:
 
         # compute rm score
         scores = [item["reward_score"] for item in outputs_flat]
-        prompt_length = data.batch["prompts"].size(1)
-        valid_response_length = data.batch["attention_mask"][:, prompt_length:].sum(dim=1)
-        rm_scores = torch.zeros_like(data.batch["responses"], dtype=torch.float32)
-        rm_scores[torch.arange(rm_scores.size(0)), valid_response_length - 1] = torch.tensor(
-            scores, dtype=torch.float32
-        )
+        if self.config.reward.reward_manager.name == "image":
+            rm_scores = torch.tensor(scores, dtype=torch.float32).unsqueeze(-1)
+        else:
+            prompt_length = data.batch["prompts"].size(1)
+            valid_response_length = data.batch["attention_mask"][:, prompt_length:].sum(dim=1)
+            rm_scores = torch.zeros_like(data.batch["responses"], dtype=torch.float32)
+            rm_scores[torch.arange(rm_scores.size(0)), valid_response_length - 1] = torch.tensor(
+                scores, dtype=torch.float32
+            )
         batch = TensorDict({"rm_scores": rm_scores}, batch_size=len(data))
 
         reward_extra_infos = [output.get("reward_extra_info", {}) for output in outputs_flat]
diff --git a/verl/experimental/reward_loop/reward_manager/__init__.py b/verl/experimental/reward_loop/reward_manager/__init__.py
index 75a440a2324..dc0a541c09b 100644
--- a/verl/experimental/reward_loop/reward_manager/__init__.py
+++ b/verl/experimental/reward_loop/reward_manager/__init__.py
@@ -17,12 +17,14 @@
 from .naive import NaiveRewardManager
 from .limited import RateLimitedRewardManager
 from .remote import RemoteRewardManager
+from .image import ImageRewardManager
 
 __all__ = [
     "DAPORewardManager",
     "NaiveRewardManager",
     "RateLimitedRewardManager",
     "RemoteRewardManager",
+    "ImageRewardManager",
     "register",
     "get_reward_manager_cls",
 ]
diff --git a/verl/experimental/reward_loop/reward_manager/image.py b/verl/experimental/reward_loop/reward_manager/image.py
new file mode 100644
index 00000000000..1852c83041d
--- /dev/null
+++ b/verl/experimental/reward_loop/reward_manager/image.py
@@ -0,0 +1,92 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import inspect
+
+from verl import DataProto
+from verl.experimental.reward_loop.reward_manager import register
+from verl.experimental.reward_loop.reward_manager.base import RewardManagerBase
+from verl.utils.reward_score import default_compute_score_image
+
+
+@register("image")
+class ImageRewardManager(RewardManagerBase):
+    """The reward manager for image response."""
+
+    def __init__(self, config, tokenizer, compute_score, reward_router_address=None, reward_model_tokenizer=None):
+        super().__init__(config, tokenizer, compute_score)
+        self.compute_score = compute_score or default_compute_score_image
+        self.is_async_reward_score = inspect.iscoroutinefunction(self.compute_score)
+        self.reward_router_address = reward_router_address
+        self.reward_model_tokenizer = reward_model_tokenizer
+
+    async def run_single(self, data: DataProto) -> dict:
+        assert len(data) == 1, "Only support single data item"
+        data_item = data[0]
+        response_image = data_item.batch["responses"]
+        data_source = data_item.non_tensor_batch["data_source"]
+        ground_truth = data_item.non_tensor_batch["reward_model"]["ground_truth"]
+        extra_info = data_item.non_tensor_batch.get("extra_info", {})
+        tool_extra_fields = data_item.non_tensor_batch.get("tool_extra_fields", None)
+        if tool_extra_fields is not None:
+            extra_info.update(tool_extra_fields.items())
+
+        num_turns = data_item.non_tensor_batch.get("__num_turns__", None)
+        rollout_reward_scores = data_item.non_tensor_batch.get("reward_scores", {})
+        extra_info["num_turns"] = num_turns
+        extra_info["rollout_reward_scores"] = rollout_reward_scores
+
+        extra_reward_kwargs = (
+            {
+                "reward_router_address": self.reward_router_address,
+                "reward_model_tokenizer": self.reward_model_tokenizer,
+                "model_name": self.config.reward.reward_model.model_path,
+            }
+            if self.reward_router_address is not None
+            else {}
+        )
+        if self.is_async_reward_score:
+            result = await self.compute_score(
+                data_source=data_source,
+                solution_image=response_image,
+                ground_truth=ground_truth,
+                extra_info=extra_info,
+                **extra_reward_kwargs,
+            )
+        else:
+            result = await self.loop.run_in_executor(
+                None,
+                lambda: self.compute_score(
+                    data_source=data_source,
+                    solution_image=response_image,
+                    ground_truth=ground_truth,
+                    extra_info=extra_info,
+                    **extra_reward_kwargs,
+                ),
+            )
+
+        reward_extra_info = {}
+
+        score: float
+        if isinstance(result, dict):
+            score = result["score"]
+            for key, value in result.items():
+                reward_extra_info[key] = value
+        else:
+            score = result
+            reward_extra_info["acc"] = score
+
+        reward = score
+
+        return {"reward_score": reward, "reward_extra_info": reward_extra_info}
diff --git a/verl/trainer/config/_generated_ppo_diffusion_trainer.yaml b/verl/trainer/config/_generated_ppo_diffusion_trainer.yaml
new file mode 100644
index 00000000000..42f546d365f
--- /dev/null
+++ b/verl/trainer/config/_generated_ppo_diffusion_trainer.yaml
@@ -0,0 +1,703 @@
+# This reference configration yaml is automatically generated via 'scripts/generate_trainer_config.sh'
+# in which it invokes 'python3 scripts/print_cfg.py --cfg job --config-name=ppo_diffusion_trainer.yaml' to flatten the 'verl/trainer/config/ppo_diffusion_trainer.yaml' config fields into a single file.
+# Do not modify this file directly.
+# The file is usually only for reference and never used.
+
+actor_rollout_ref:
+  actor:
+    optim:
+      _target_: verl.workers.config.FSDPOptimizerConfig
+      optimizer: AdamW
+      optimizer_impl: torch.optim
+      lr: 1.0e-06
+      lr_warmup_steps_ratio: 0.0
+      total_training_steps: -1
+      weight_decay: 0.01
+      lr_warmup_steps: -1
+      betas:
+      - 0.9
+      - 0.999
+      clip_grad: 1.0
+      min_lr_ratio: 0.0
+      num_cycles: 0.5
+      lr_scheduler_type: constant
+      zero_indexed_step: true
+      warmup_style: null
+      override_optimizer_config: null
+    fsdp_config:
+      _target_: verl.workers.config.FSDPEngineConfig
+      wrap_policy:
+        min_num_params: 0
+      param_offload: false
+      optimizer_offload: false
+      offload_policy: false
+      reshard_after_forward: true
+      fsdp_size: -1
+      forward_prefetch: false
+      model_dtype: fp32
+      use_orig_params: false
+      seed: 42
+      full_determinism: false
+      ulysses_sequence_parallel_size: 1
+      entropy_from_logits_with_chunking: false
+      use_torch_compile: true
+      entropy_checkpointing: false
+      forward_only: false
+      strategy: fsdp
+      dtype: bfloat16
+      qat:
+        _target_: verl.workers.config.QATEngineConfig
+        enable: false
+        mode: w4a16
+        group_size: 16
+        ignore_patterns:
+        - lm_head
+        - embed_tokens
+        - re:.*mlp.gate$
+        activation_observer: static_minmax
+        quantization_config_path: null
+    _target_: verl.workers.config.FSDPActorConfig
+    rollout_n: ${oc.select:actor_rollout_ref.rollout.n,1}
+    strategy: fsdp
+    ppo_mini_batch_size: 256
+    ppo_micro_batch_size: null
+    ppo_micro_batch_size_per_gpu: null
+    use_dynamic_bsz: false
+    ppo_max_token_len_per_gpu: 16384
+    clip_ratio: 0.0001
+    clip_ratio_low: 0.2
+    clip_ratio_high: 5.0
+    tau_pos: 1.0
+    tau_neg: 1.05
+    freeze_vision_tower: false
+    policy_loss:
+      _target_: verl.workers.config.PolicyLossConfig
+      loss_mode: vanilla
+      clip_cov_ratio: 0.0002
+      clip_cov_lb: 1.0
+      clip_cov_ub: 5.0
+      kl_cov_ratio: 0.0002
+      ppo_kl_coef: 0.1
+    clip_ratio_c: 3.0
+    loss_agg_mode: token-mean
+    loss_scale_factor: null
+    entropy_coeff: 0
+    calculate_entropy: false
+    use_kl_loss: false
+    use_prefix_grouper: false
+    use_torch_compile: true
+    kl_loss_coef: 0.001
+    kl_loss_type: low_var_kl
+    ppo_epochs: 1
+    shuffle: false
+    data_loader_seed: 42
+    checkpoint:
+      _target_: verl.trainer.config.CheckpointConfig
+      save_contents:
+      - model
+      - optimizer
+      - extra
+      load_contents: ${.save_contents}
+      async_save: false
+      mbridge_config: {}
+    use_fused_kernels: ${oc.select:actor_rollout_ref.model.use_fused_kernels,false}
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: false
+      all_ranks: false
+      ranks: []
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config:
+        nsys:
+          _target_: verl.utils.profiler.config.NsightToolConfig
+          discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
+        npu:
+          _target_: verl.utils.profiler.config.NPUToolConfig
+          contents: []
+          level: level0
+          analysis: true
+          discrete: false
+        torch:
+          _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+          contents: []
+          discrete: false
+        torch_memory:
+          _target_: verl.utils.profiler.config.TorchMemoryToolConfig
+          trace_alloc_max_entries: ${oc.select:global_profiler.global_tool_config.torch_memory.trace_alloc_max_entries,100000}
+          stack_depth: ${oc.select:global_profiler.global_tool_config.torch_memory.stack_depth,32}
+    router_replay:
+      _target_: verl.workers.config.RouterReplayConfig
+      mode: disabled
+      record_file: null
+      replay_file: null
+    grad_clip: 1.0
+    ulysses_sequence_parallel_size: 1
+    entropy_from_logits_with_chunking: false
+    entropy_checkpointing: false
+    use_remove_padding: ${oc.select:actor_rollout_ref.model.use_remove_padding,false}
+    calculate_sum_pi_squared: false
+    sum_pi_squared_checkpointing: false
+    qat:
+      enable: false
+      mode: w4a16
+      group_size: 16
+      ignore_patterns:
+      - lm_head
+      - embed_tokens
+      - re:.*mlp.gate$
+      activation_observer: static_minmax
+      quantization_config_path: null
+  ref:
+    rollout_n: ${oc.select:actor_rollout_ref.rollout.n,1}
+    strategy: ${actor_rollout_ref.actor.strategy}
+    use_torch_compile: ${oc.select:actor_rollout_ref.actor.use_torch_compile,true}
+    log_prob_micro_batch_size: null
+    log_prob_micro_batch_size_per_gpu: null
+    log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
+    log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: false
+      all_ranks: false
+      ranks: []
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config:
+        nsys:
+          _target_: verl.utils.profiler.config.NsightToolConfig
+          discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
+        npu:
+          _target_: verl.utils.profiler.config.NPUToolConfig
+          contents: []
+          level: level0
+          analysis: true
+          discrete: false
+        torch:
+          _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+          contents: []
+          discrete: false
+        torch_memory:
+          _target_: verl.utils.profiler.config.TorchMemoryToolConfig
+          trace_alloc_max_entries: ${oc.select:global_profiler.global_tool_config.torch_memory.trace_alloc_max_entries,100000}
+          stack_depth: ${oc.select:global_profiler.global_tool_config.torch_memory.stack_depth,32}
+    router_replay:
+      _target_: verl.workers.config.RouterReplayConfig
+      mode: disabled
+      record_file: null
+      replay_file: null
+    fsdp_config:
+      _target_: verl.workers.config.FSDPEngineConfig
+      wrap_policy:
+        min_num_params: 0
+      param_offload: false
+      optimizer_offload: false
+      offload_policy: false
+      reshard_after_forward: true
+      fsdp_size: -1
+      forward_prefetch: false
+      model_dtype: fp32
+      use_orig_params: false
+      seed: 42
+      full_determinism: false
+      ulysses_sequence_parallel_size: 1
+      entropy_from_logits_with_chunking: false
+      use_torch_compile: true
+      entropy_checkpointing: false
+      forward_only: true
+      strategy: fsdp
+      dtype: bfloat16
+      qat:
+        _target_: verl.workers.config.QATEngineConfig
+        enable: false
+        mode: w4a16
+        group_size: 16
+        ignore_patterns:
+        - lm_head
+        - embed_tokens
+        - re:.*mlp.gate$
+        activation_observer: static_minmax
+        quantization_config_path: null
+    _target_: verl.workers.config.FSDPActorConfig
+    ulysses_sequence_parallel_size: ${oc.select:actor_rollout_ref.actor.ulysses_sequence_parallel_size,1}
+    entropy_from_logits_with_chunking: false
+    entropy_checkpointing: false
+  rollout:
+    _target_: verl.workers.config.DiffusionRolloutConfig
+    name: ???
+    mode: async
+    nnodes: 0
+    n_gpus_per_node: ${oc.select:trainer.n_gpus_per_node,8}
+    prompt_length: ${oc.select:data.max_prompt_length,512}
+    dtype: bfloat16
+    gpu_memory_utilization: 0.5
+    enforce_eager: false
+    cudagraph_capture_sizes: null
+    free_cache_engine: true
+    tensor_model_parallel_size: 2
+    data_parallel_size: 1
+    expert_parallel_size: 1
+    pipeline_model_parallel_size: 1
+    max_num_batched_tokens: 8192
+    max_model_len: null
+    max_num_seqs: 1024
+    enable_chunked_prefill: true
+    enable_prefix_caching: true
+    logprobs_mode: processed_logprobs
+    scheduling_policy: fcfs
+    load_format: dummy
+    log_prob_micro_batch_size: null
+    log_prob_micro_batch_size_per_gpu: null
+    log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
+    log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
+    disable_log_stats: true
+    do_sample: true
+    'n': 1
+    image_height: 512
+    image_width: 512
+    num_inference_steps: 10
+    noise_level: 0.7
+    guidance_scale: 4.5
+    sde_type: sde
+    sde_window_size: null
+    sde_window_range: null
+    engine_kwargs:
+      vllm_omni: {}
+    val_kwargs:
+      _target_: verl.workers.config.DiffusionSamplingConfig
+      'n': 1
+      do_sample: false
+      num_inference_steps: 40
+      noise_level: 0.0
+      seed: 42
+    multi_turn:
+      _target_: verl.workers.config.MultiTurnConfig
+      enable: false
+      max_assistant_turns: null
+      tool_config_path: null
+      max_user_turns: null
+      max_parallel_calls: 1
+      max_tool_response_length: 256
+      tool_response_truncate_side: middle
+      interaction_config_path: null
+      use_inference_chat_template: false
+      tokenization_sanity_check_mode: strict
+      format: hermes
+      num_repeat_rollouts: null
+    calculate_log_probs: false
+    agent:
+      _target_: verl.workers.config.AgentLoopConfig
+      num_workers: 8
+      default_agent_loop: single_turn_agent
+      agent_loop_config_path: null
+      custom_async_server:
+        _target_: verl.workers.config.CustomAsyncServerConfig
+        path: null
+        name: null
+    checkpoint_engine:
+      _target_: verl.workers.config.CheckpointEngineConfig
+      backend: naive
+      update_weights_bucket_megabytes: 2048
+      engine_kwargs: {}
+    enable_rollout_routing_replay: false
+    profiler:
+      _target_: verl.utils.profiler.ProfilerConfig
+      tool: ${oc.select:global_profiler.tool,null}
+      enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+      all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+      ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+      save_path: ${oc.select:global_profiler.save_path,null}
+      tool_config:
+        npu:
+          _target_: verl.utils.profiler.config.NPUToolConfig
+          contents: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.contents,[]}
+          level: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.level,level0}
+          analysis: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.analysis,false}
+          discrete: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.discrete,false}
+        torch:
+          _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+          contents: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.torch.contents,[]}
+          discrete: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.torch.discrete,false}
+    prometheus:
+      _target_: verl.workers.config.PrometheusConfig
+      enable: false
+      port: 9090
+      file: /tmp/ray/session_latest/metrics/prometheus/prometheus.yml
+      served_model_name: ${oc.select:actor_rollout_ref.model.path,null}
+    quantization: null
+    quantization_config_file: null
+    layered_summon: false
+  model:
+    _target_: verl.workers.config.DiffusersModelConfig
+    path: ~/models/Qwen/Qwen-Image
+    tokenizer_path: null
+    use_shm: false
+    trust_remote_code: false
+    custom_chat_template: null
+    external_lib: null
+    enable_gradient_checkpointing: true
+    enable_activation_offload: false
+    use_remove_padding: true
+    lora_rank: 0
+    lora_alpha: 16
+    target_modules: all-linear
+    exclude_modules: null
+    lora_adapter_path: null
+    use_liger: false
+    use_fused_kernels: false
+    fused_kernel_options:
+      impl_backend: torch
+    tiled_mlp:
+      enabled: false
+      num_shards: 4
+    image_height: ${oc.select:actor_rollout_ref.rollout.image_height,512}
+    image_width: ${oc.select:actor_rollout_ref.rollout.image_width,512}
+    num_inference_steps: ${oc.select:actor_rollout_ref.rollout.num_inference_steps,10}
+    noise_level: ${oc.select:actor_rollout_ref.rollout.noise_level,0.7}
+    guidance_scale: ${oc.select:actor_rollout_ref.rollout.guidance_scale,1.0}
+    sde_type: ${oc.select:actor_rollout_ref.rollout.sde_type,sde}
+    model_type: diffusion_model
+  hybrid_engine: true
+  nccl_timeout: 600
+data:
+  tokenizer: null
+  use_shm: false
+  train_files: ~/data/rlhf/gsm8k/train.parquet
+  val_files: ~/data/rlhf/gsm8k/test.parquet
+  train_max_samples: -1
+  val_max_samples: -1
+  prompt_key: prompt
+  reward_fn_key: data_source
+  max_prompt_length: 512
+  max_response_length: 512
+  train_batch_size: 1024
+  val_batch_size: null
+  tool_config_path: ${oc.select:actor_rollout_ref.rollout.multi_turn.tool_config_path,
+    null}
+  return_raw_input_ids: false
+  return_raw_chat: true
+  return_full_prompt: false
+  shuffle: true
+  seed: null
+  dataloader_num_workers: 8
+  image_patch_size: 14
+  validation_shuffle: false
+  filter_overlong_prompts: false
+  filter_overlong_prompts_workers: 1
+  truncation: error
+  image_key: images
+  video_key: videos
+  trust_remote_code: false
+  custom_cls:
+    path: null
+    name: null
+  return_multi_modal_inputs: true
+  sampler:
+    class_path: null
+    class_name: null
+  datagen:
+    path: null
+    name: null
+  apply_chat_template_kwargs: {}
+  data_source: prompt
+critic:
+  optim:
+    _target_: verl.workers.config.FSDPOptimizerConfig
+    optimizer: AdamW
+    optimizer_impl: torch.optim
+    lr: 1.0e-05
+    lr_warmup_steps_ratio: 0.0
+    total_training_steps: -1
+    weight_decay: 0.01
+    lr_warmup_steps: -1
+    betas:
+    - 0.9
+    - 0.999
+    clip_grad: 1.0
+    min_lr_ratio: 0.0
+    num_cycles: 0.5
+    lr_scheduler_type: constant
+    zero_indexed_step: true
+    warmup_style: null
+    override_optimizer_config: null
+  model:
+    fsdp_config:
+      _target_: verl.workers.config.FSDPEngineConfig
+      wrap_policy:
+        min_num_params: 0
+      param_offload: false
+      optimizer_offload: false
+      offload_policy: false
+      reshard_after_forward: true
+      fsdp_size: -1
+      forward_prefetch: false
+      model_dtype: fp32
+      use_orig_params: false
+      seed: 42
+      full_determinism: false
+      ulysses_sequence_parallel_size: 1
+      entropy_from_logits_with_chunking: false
+      use_torch_compile: true
+      entropy_checkpointing: false
+      forward_only: false
+      strategy: fsdp
+      dtype: bfloat16
+      qat:
+        _target_: verl.workers.config.QATEngineConfig
+        enable: false
+        mode: w4a16
+        group_size: 16
+        ignore_patterns:
+        - lm_head
+        - embed_tokens
+        - re:.*mlp.gate$
+        activation_observer: static_minmax
+        quantization_config_path: null
+    path: ~/models/deepseek-llm-7b-chat
+    tokenizer_path: ${oc.select:actor_rollout_ref.model.path,"~/models/deepseek-llm-7b-chat"}
+    override_config: {}
+    external_lib: ${oc.select:actor_rollout_ref.model.external_lib,null}
+    trust_remote_code: ${oc.select:actor_rollout_ref.model.trust_remote_code,false}
+    _target_: verl.workers.config.FSDPCriticModelCfg
+    use_shm: false
+    enable_gradient_checkpointing: true
+    enable_activation_offload: false
+    use_remove_padding: false
+    lora_rank: 0
+    lora_alpha: 16
+    target_modules: all-linear
+    tiled_mlp:
+      enabled: false
+      num_shards: 4
+  _target_: verl.workers.config.FSDPCriticConfig
+  rollout_n: ${oc.select:actor_rollout_ref.rollout.n,1}
+  strategy: fsdp
+  enable: null
+  ppo_mini_batch_size: ${oc.select:actor_rollout_ref.actor.ppo_mini_batch_size,256}
+  ppo_micro_batch_size: null
+  ppo_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size,null}
+  use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
+  ppo_max_token_len_per_gpu: 32768
+  forward_max_token_len_per_gpu: ${.ppo_max_token_len_per_gpu}
+  ppo_epochs: ${oc.select:actor_rollout_ref.actor.ppo_epochs,1}
+  shuffle: ${oc.select:actor_rollout_ref.actor.shuffle,false}
+  data_loader_seed: 42
+  cliprange_value: 0.5
+  loss_agg_mode: ${oc.select:actor_rollout_ref.actor.loss_agg_mode,token-mean}
+  checkpoint:
+    _target_: verl.trainer.config.CheckpointConfig
+    save_contents:
+    - model
+    - optimizer
+    - extra
+    load_contents: ${.save_contents}
+    async_save: false
+    mbridge_config: {}
+  profiler:
+    _target_: verl.utils.profiler.ProfilerConfig
+    tool: ${oc.select:global_profiler.tool,null}
+    enable: false
+    all_ranks: false
+    ranks: []
+    save_path: ${oc.select:global_profiler.save_path,null}
+    tool_config:
+      nsys:
+        _target_: verl.utils.profiler.config.NsightToolConfig
+        discrete: ${oc.select:global_profiler.global_tool_config.nsys.discrete}
+      npu:
+        _target_: verl.utils.profiler.config.NPUToolConfig
+        contents: []
+        level: level0
+        analysis: true
+        discrete: false
+      torch:
+        _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+        contents: []
+        discrete: false
+      torch_memory:
+        _target_: verl.utils.profiler.config.TorchMemoryToolConfig
+        trace_alloc_max_entries: ${oc.select:global_profiler.global_tool_config.torch_memory.trace_alloc_max_entries,100000}
+        stack_depth: ${oc.select:global_profiler.global_tool_config.torch_memory.stack_depth,32}
+  forward_micro_batch_size: ${oc.select:.ppo_micro_batch_size,null}
+  forward_micro_batch_size_per_gpu: ${oc.select:.ppo_micro_batch_size_per_gpu,null}
+  ulysses_sequence_parallel_size: 1
+  grad_clip: 1.0
+custom_reward_function:
+  path: null
+  name: null
+reward_model:
+  num_workers: null
+  reward_manager: null
+  enable: null
+  enable_resource_pool: null
+  n_gpus_per_node: null
+  nnodes: null
+  reward_loop_source: null
+  reward_loop_module_path: null
+  reward_loop_class_name: null
+  model:
+    path: null
+    external_lib: null
+    trust_remote_code: null
+  rollout:
+    name: null
+    dtype: null
+    gpu_memory_utilization: null
+    enforce_eager: null
+    cudagraph_capture_sizes: null
+    free_cache_engine: null
+    data_parallel_size: null
+    expert_parallel_size: null
+    tensor_model_parallel_size: null
+    max_num_batched_tokens: null
+    max_model_len: null
+    max_num_seqs: null
+    load_format: null
+    engine_kwargs: null
+    limit_images: null
+    enable_chunked_prefill: null
+    enable_prefix_caching: null
+    disable_log_stats: null
+    skip_tokenizer_init: null
+    prompt_length: null
+    response_length: null
+sandbox_fusion:
+  url: null
+  max_concurrent: null
+  memory_limit_mb: null
+reward:
+  num_workers: 8
+  custom_reward_function:
+    path: null
+    name: compute_score
+  reward_manager:
+    _target_: verl.workers.config.reward_model.RewardManagerConfig
+    source: register
+    name: naive
+    module:
+      _target_: verl.trainer.config.config.ModuleConfig
+      path: null
+      name: custom_reward_manager
+  reward_model:
+    enable: false
+    enable_resource_pool: false
+    n_gpus_per_node: 8
+    nnodes: 0
+    model_path: null
+    rollout:
+      _target_: verl.workers.config.RolloutConfig
+      name: ???
+      dtype: bfloat16
+      gpu_memory_utilization: 0.5
+      enforce_eager: true
+      cudagraph_capture_sizes: null
+      free_cache_engine: true
+      data_parallel_size: 1
+      expert_parallel_size: 1
+      tensor_model_parallel_size: 2
+      max_num_batched_tokens: 8192
+      max_model_len: null
+      max_num_seqs: 1024
+      load_format: auto
+      engine_kwargs: {}
+      limit_images: null
+      enable_chunked_prefill: true
+      enable_prefix_caching: true
+      disable_log_stats: true
+      skip_tokenizer_init: false
+      prompt_length: 2048
+      response_length: 2048
+  sandbox_fusion:
+    url: null
+    max_concurrent: 64
+    memory_limit_mb: 1024
+algorithm:
+  rollout_correction:
+    rollout_is: null
+    rollout_is_threshold: 2.0
+    rollout_rs: null
+    rollout_rs_threshold: null
+    bypass_mode: false
+    loss_type: ppo_clip
+    rollout_is_batch_normalize: false
+  _target_: verl.trainer.config.AlgoConfig
+  gamma: 1.0
+  lam: 1.0
+  adv_estimator: gae
+  norm_adv_by_std_in_grpo: true
+  use_kl_in_reward: false
+  kl_penalty: kl
+  kl_ctrl:
+    _target_: verl.trainer.config.KLControlConfig
+    type: fixed
+    kl_coef: 0.001
+    horizon: 10000
+    target_kl: 0.1
+  use_pf_ppo: false
+  pf_ppo:
+    reweight_method: pow
+    weight_pow: 2.0
+  global_std: true
+trainer:
+  balance_batch: true
+  total_epochs: 30
+  total_training_steps: null
+  project_name: verl_examples
+  experiment_name: gsm8k
+  logger:
+  - console
+  - wandb
+  log_val_generations: 0
+  rollout_data_dir: null
+  validation_data_dir: null
+  nnodes: 1
+  n_gpus_per_node: 8
+  save_freq: -1
+  esi_redundant_time: 0
+  resume_mode: auto
+  resume_from_path: null
+  val_before_train: true
+  val_only: false
+  test_freq: -1
+  critic_warmup: 0
+  default_hdfs_dir: null
+  del_local_ckpt_after_load: false
+  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
+  max_actor_ckpt_to_keep: null
+  max_critic_ckpt_to_keep: null
+  ray_wait_register_center_timeout: 300
+  device: cuda
+  use_legacy_worker_impl: auto
+global_profiler:
+  _target_: verl.utils.profiler.ProfilerConfig
+  tool: null
+  steps: null
+  profile_continuous_steps: false
+  save_path: outputs/profile
+  global_tool_config:
+    nsys:
+      _target_: verl.utils.profiler.config.NsightToolConfig
+      discrete: false
+      controller_nsight_options:
+        trace: cuda,nvtx,cublas,ucx
+        cuda-memory-usage: 'true'
+        cuda-graph-trace: graph
+      worker_nsight_options:
+        trace: cuda,nvtx,cublas,ucx
+        cuda-memory-usage: 'true'
+        cuda-graph-trace: graph
+        capture-range: cudaProfilerApi
+        capture-range-end: null
+        kill: none
+    torch_memory:
+      trace_alloc_max_entries: 100000
+      stack_depth: 32
+      context: all
+      stacks: all
+      kw_args: {}
+transfer_queue:
+  enable: false
+ray_kwargs:
+  ray_init:
+    num_cpus: null
+  timeline_json_file: null
diff --git a/verl/trainer/config/algorithm.py b/verl/trainer/config/algorithm.py
index 5aa650d7bf9..1c9497e4b77 100644
--- a/verl/trainer/config/algorithm.py
+++ b/verl/trainer/config/algorithm.py
@@ -612,3 +612,4 @@ class AlgoConfig(BaseConfig):
     # Rollout Correction: corrects off-policy issues (policy mismatch, model staleness, distribution shifts)
     # Set to None to disable, use RolloutCorrectionConfig presets (e.g., .tis(), .mis()), or pass dict
     rollout_correction: Optional[RolloutCorrectionConfig] = None
+    global_std: bool = True
diff --git a/verl/trainer/config/model/diffusion_model.yaml b/verl/trainer/config/model/diffusion_model.yaml
new file mode 100644
index 00000000000..12fd0241a41
--- /dev/null
+++ b/verl/trainer/config/model/diffusion_model.yaml
@@ -0,0 +1,95 @@
+# Format checks enforced on CI:
+# 1. Comments must appear above each field.
+# 2. There must be a blank line between each field.
+# 3. Inline comments (after a field on the same line) are not allowed.
+# 4. Indentation level is respected for nested fields.
+
+_target_: verl.workers.config.DiffusersModelConfig
+
+# path to the huggingface model
+path: ~/models/Qwen/Qwen-Image
+
+# config to the huggingface config. In case it is not the same as path
+# hf_config_path: null
+
+# path to the huggingface tokenizer. In case it is not the same as path
+tokenizer_path: null
+
+# whether to use shared memory for model loading
+use_shm: False
+
+# whether to trust remote code.
+trust_remote_code: False
+
+# custom chat template for the model
+custom_chat_template: null
+
+# whether to use external libs for the model
+external_lib: null
+
+# override hf config
+# override_config: {}
+
+# whether to enable gradient checkpointing. Only valid when we use hf model definition
+enable_gradient_checkpointing: True
+
+# whether to enable activation offload. Only valid when we use hf model definition
+enable_activation_offload: False
+
+# whether to use remove padding. Only valid when we use hf model definition
+use_remove_padding: True
+
+# Set to positive value to enable LoRA (e.g., 32)
+lora_rank: 0
+
+# LoRA scaling factor
+lora_alpha: 16
+
+# Target modules for LoRA adaptation
+target_modules: all-linear
+
+# Exclude modules from LoRA adaptation
+exclude_modules: null
+
+# Path to pre-trained LoRA adapter to load for continued training
+lora_adapter_path: null
+
+# whether to use liger. Only valid when we use hf model definition
+use_liger: False
+
+# whether to use fused kernels.
+use_fused_kernels: False
+
+# fused kernel options.
+fused_kernel_options:
+
+  # the implementation backend for fused kernels.
+  impl_backend: torch
+
+# TiledMLP configuration for memory-efficient MLP computation.
+# Reduces peak memory by processing MLP forward/backward in tiles.
+tiled_mlp:
+
+  # whether to enable TiledMLP
+  enabled: False
+
+  # number of shards to split the input. Higher values reduce peak memory but may slightly impact performance.
+  num_shards: 4
+
+# image height
+image_height: ${oc.select:actor_rollout_ref.rollout.image_height,512}
+
+# image width
+image_width: ${oc.select:actor_rollout_ref.rollout.image_width,512}
+
+# inference steps
+num_inference_steps: ${oc.select:actor_rollout_ref.rollout.num_inference_steps,10}
+
+# noise in SDE
+noise_level:  ${oc.select:actor_rollout_ref.rollout.noise_level,0.7}
+
+# guidance scale for classifier-free guidance
+guidance_scale:  ${oc.select:actor_rollout_ref.rollout.guidance_scale,1.0}
+
+# SDE type during rollout
+sde_type: ${oc.select:actor_rollout_ref.rollout.sde_type,sde}
diff --git a/verl/trainer/config/ppo_diffusion_trainer.yaml b/verl/trainer/config/ppo_diffusion_trainer.yaml
new file mode 100644
index 00000000000..3da7c98adfd
--- /dev/null
+++ b/verl/trainer/config/ppo_diffusion_trainer.yaml
@@ -0,0 +1,332 @@
+# Format checks enforced on CI:
+# 1. Comments must appear above each field.
+# 2. There must be a blank line between each field.
+# 3. Inline comments (after a field on the same line) are not allowed.
+# 4. Indentation level is respected for nested fields.
+
+# specify the default per-component configs
+defaults:
+
+  - model_engine: dp
+
+  # <folder_name>@<field_name>.<field_name>: <yaml_file_name>
+  # actor_rollout_ref.actor: trainer/config/actor/dp_actor.yaml
+  - actor@actor_rollout_ref.actor: ${model_engine}_actor
+
+  # data: trainer/config/data/legacy_data.yaml
+  - data@data: legacy_data
+
+  # Reference model config.
+  # Reference model will be enabled when actor.use_kl_loss or/and algorithm.use_kl_in_reward is/are True.
+  - ref@actor_rollout_ref.ref: ${model_engine}_ref
+
+  # Rollout model config.
+  - rollout@actor_rollout_ref.rollout: diffusion_rollout
+
+  # Model config.
+  - model@actor_rollout_ref.model: diffusion_model
+
+  # Critic model config.
+  - critic@critic: ${model_engine}_critic
+
+  # legacy reward impl config, for backward compatibility
+  - legacy_reward_impl
+
+  # Reward config.
+  - reward@reward: reward
+
+  # Rollout correction config.
+  - algorithm@algorithm.rollout_correction: rollout_correction
+
+  # load the reference default config, then apply the fields in the current yaml
+  # self config override anything above
+  - _self_
+
+data:
+
+  # get ground-truth based on data_source, now support ["ocr", "prompt"]
+  data_source: "prompt"
+
+# config for actor, rollout and reference model
+actor_rollout_ref:
+
+  # Model config
+  model:
+    model_type: "diffusion_model"
+
+  # Whether it's a hybrid engine, currently only supports hybrid engine
+  hybrid_engine: true
+
+  # Timeout for operations executed against the process group
+  nccl_timeout: 600
+
+  # Actor config
+  actor:
+    # PPO clip ratio
+    clip_ratio: 0.0001
+
+    # Maximum absolute value for advantage clipping
+    clip_ratio_high: 5.0
+
+  # Rollout model config.
+  rollout:
+
+    # for huge model, layered summon can save memory (prevent OOM) but make it slower
+    layered_summon: False
+
+# config for the algorithm
+algorithm:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.trainer.config.AlgoConfig
+
+  # Discount factor for future rewards
+  gamma: 1.0
+
+  # Trade-off between bias and variance in the GAE estimator
+  lam: 1.0
+
+  # Advantage estimator type: "gae", "grpo", "reinforce_plus_plus", etc.
+  adv_estimator: gae
+
+  # Whether to normalize advantages by std (specific to GRPO)
+  norm_adv_by_std_in_grpo: True
+
+  # Whether to enable in-reward KL penalty
+  use_kl_in_reward: False
+
+  # How to estimate KL divergence: "kl", "abs", "mse", "low_var_kl", or "full"
+  kl_penalty: kl
+
+  # KL control configuration
+  kl_ctrl:
+
+    # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+    _target_: verl.trainer.config.KLControlConfig
+
+    # KL control type: "fixed" or "adaptive"
+    type: fixed
+
+    # Initial coefficient for KL penalty
+    kl_coef: 0.001
+
+    # Horizon value for adaptive controller (if enabled)
+    horizon: 10000
+
+    # Target KL divergence (used for adaptive controller)
+    target_kl: 0.1
+
+  # Whether to enable preference feedback PPO
+  use_pf_ppo: False
+
+  # Preference feedback PPO settings
+  pf_ppo:
+
+    # Method for reweighting samples: "pow", "max_min", or "max_random"
+    reweight_method: pow
+
+    # Power used for weight scaling in "pow" method
+    weight_pow: 2.0
+
+  # Whether to normalize advantages using global standard deviation
+  global_std: True
+
+# config for the trainer
+trainer:
+
+  # Whether to balance batch sizes across distributed workers
+  balance_batch: True
+
+  # Number of epochs in training
+  total_epochs: 30
+
+  # Total training steps (can be set explicitly or derived from epochs)
+  total_training_steps: null
+
+  # Project name for experiment tracking (e.g., wandb)
+  project_name: verl_examples
+
+  # Experiment name for run identification in tracking tools
+  experiment_name: gsm8k
+
+  # Logging backends to use: "console", "wandb", etc.
+  logger: ["console", "wandb"]
+
+  # Number of generations to log during validation
+  log_val_generations: 0
+
+  # Directory for logging rollout data; no dump if null
+  rollout_data_dir: null
+
+  # Directory for logging validation data; no dump if null
+  validation_data_dir: null
+
+  # Number of nodes used in the training
+  nnodes: 1
+
+  # Number of GPUs per node
+  n_gpus_per_node: 8
+
+  # Save frequency (by iteration) for model checkpoints
+  save_freq: -1
+
+  # ESI refers to the elastic server instance used during training, similar to the training plan. For example,
+  # if you purchase 10 hours of computing power, the ESI will automatically shut down after 10 hours of training.
+  # To ensure a checkpoint is saved before ESI shuts down, the system will start saving a checkpoint in advance.
+  # The advance time is calculated as: Advance Time = Longest historical step duration + Checkpoint save duration + esi_redundant_time.
+  # Here, esi_redundant_time is a user-defined value that further extends the advance time for added safety.
+  esi_redundant_time: 0
+
+  # Resume mode: "auto", "disable", or "resume_path"
+  # "auto": resume from last checkpoint if available
+  # "disable": start from scratch
+  # "resume_path": resume from a user-defined path
+  resume_mode: auto
+
+  # Path to resume training from (only used when resume_mode is "resume_path")
+  resume_from_path: null
+
+  # Whether to run validation before training begins
+  val_before_train: True
+
+  # Whether to run validation only
+  val_only: False
+
+  # Validation frequency (in training iterations)
+  test_freq: -1
+
+  # Number of iterations to warm up the critic before updating policy
+  critic_warmup: 0
+
+  # Default path to distributed filesystem for saving checkpoints
+  default_hdfs_dir: null
+
+  # Whether to delete local checkpoints after loading
+  del_local_ckpt_after_load: False
+
+  # Default local directory for saving checkpoints
+  default_local_dir: checkpoints/${trainer.project_name}/${trainer.experiment_name}
+
+  # Maximum number of actor checkpoints to keep
+  max_actor_ckpt_to_keep: null
+
+  # Maximum number of critic checkpoints to keep
+  max_critic_ckpt_to_keep: null
+
+  # Timeout (in seconds) for Ray worker to wait for registration
+  ray_wait_register_center_timeout: 300
+
+  # Device to run training on (e.g., "cuda", "cpu")
+  device: cuda
+
+  # whether to use legacy worker implementation
+  #  mode: "auto", "enable", or "disable"
+  use_legacy_worker_impl: auto
+
+# profiler configs
+global_profiler:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.utils.profiler.ProfilerConfig
+
+  # Profiling tool: choose between nsys, npu, torch, torch_memory
+  tool: null
+
+  # profile steps
+  steps: null
+
+  # Whether to combine continuous steps into one database.
+  ## If True, worker.profiler.discrete must be False, [1,2] in one, [5] in another.
+  ## If False, [1] in one, [2] in another, [5] in another.
+  profile_continuous_steps: False
+
+  # Path to save profiling contents
+  save_path: "outputs/profile"
+
+  # Specific tool configs, can use +profiler.tool_config.[tool].xxx to config
+  global_tool_config:
+
+    # nsys config
+    nsys:
+
+      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+      _target_: verl.utils.profiler.config.NsightToolConfig
+
+      # True for each task has its own database, False for all tasks in one training step share one database.
+      discrete: False
+
+      # controller Nvidia Nsight Systems Options. Must set when profile_steps is not None.
+      ## reference https://docs.nvidia.com/nsight-systems/UserGuide/index.html
+      ## reference https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html
+      controller_nsight_options:
+
+        # Select the API(s) to be traced.
+        trace: "cuda,nvtx,cublas,ucx"
+
+        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
+        cuda-memory-usage: "true"
+
+        # CUDA graphs will be traced as a whole
+        cuda-graph-trace: "graph"
+
+      # worker Nvidia Nsight Systems Options. Must set when profile_steps is not None.
+      worker_nsight_options:
+
+        # Select the API(s) to be traced.
+        trace: "cuda,nvtx,cublas,ucx"
+
+        # Track the GPU memory usage by CUDA kernels. Must be string type "true" or "false".
+        cuda-memory-usage: "true"
+
+        # CUDA graphs will be traced as a whole
+        cuda-graph-trace: "graph"
+
+        # Profiling only in a range of torch.cuda.profiler.start and stop. Do not change this config.
+        capture-range: "cudaProfilerApi"
+
+        # Specify the desired behavior when a capture range ends.
+        # In verl we need the torch.cuda.profiler.start/stop pair to repeats n times.
+        # valid values are "repeat-shutdown:n" or null.
+        # For normal whole step profiling, n = len(profile_steps);
+        # but for discrete profiling, n = len(profile_steps) * Number(subtasks).
+        # Or you can just leave it null and the program will use n = len(profile_steps) * 6;
+        capture-range-end: null
+
+        # Send signal to the target application's process group. We let the program to exit by itself.
+        kill: none
+
+    # enable memory visualization for debugging memory usage
+    torch_memory:
+
+      #  Maximum number of allocation entries to record
+      trace_alloc_max_entries: 100_000
+
+      # The depth of the call stack to capture for each allocation
+      stack_depth: 32
+
+      # 'alloc': records only allocation events || 'state': records memory state changes || 'all': records both.
+      context: "all"
+
+      # 'python': records Python stacks || 'cpp': records C++ stacks (available in some versions) || 'all': records both.
+      stacks: "all"
+
+      # devices, record_context etc.
+      kw_args: {}
+
+# configs for TransferQueue
+transfer_queue:
+
+  # Whether to enable transfer queue
+  enable: False
+
+# configs related to ray
+ray_kwargs:
+
+  # configs related to ray initialization
+  ray_init:
+
+    # Number of CPUs for Ray. Use a fixed number instead of null when using SLURM.
+    num_cpus: null
+
+  # Path to save Ray timeline JSON for performance profiling
+  timeline_json_file: null
diff --git a/verl/trainer/config/rollout/diffusion_rollout.yaml b/verl/trainer/config/rollout/diffusion_rollout.yaml
new file mode 100644
index 00000000000..669b6616b73
--- /dev/null
+++ b/verl/trainer/config/rollout/diffusion_rollout.yaml
@@ -0,0 +1,371 @@
+# Target class for this configuration
+_target_: verl.workers.config.DiffusionRolloutConfig
+
+# actor_rollout_ref.rollout.name: vllm_omni/hf/vllm/sglang/trtllm. The default value will be removed in the future
+name: ???
+
+# sync: LLM, async: AsyncLLM
+mode: async
+
+# Number of nodes for standalone rollout server, must be > 0 in one-step-off/fully async training.
+nnodes: 0
+
+# Number of GPUs per node for rollout server.
+n_gpus_per_node: ${oc.select:trainer.n_gpus_per_node,8}
+
+# typically the same as data max prompt length
+# same as data.max_prompt_length if it exists
+prompt_length: ${oc.select:data.max_prompt_length,512}
+
+# for vllm rollout
+# Rollout model parameters type. Align with actor model's FSDP/Megatron type.
+dtype: bfloat16
+
+# Fraction of GPU memory used by vLLM/SGLang/TRTLLM for KV cache.
+gpu_memory_utilization: 0.5
+
+# Whether to disable CUDA graph. Default False to best performance.
+enforce_eager: False
+
+# batch size of cudagraph to capture. Require enforce_eager: False to use this option
+# Since cudagraph in inference engine can not be offloaded during update policy,
+# you can use smaller batch size to save memory used in cuda graph, eg: [1 ,2, 4, 8, 16, 32]
+# supported engines: vllm
+cudagraph_capture_sizes: null
+
+# Whether to free engine KVCache after generation.
+free_cache_engine: True
+
+# TP size for rollout. Not effective for hf
+tensor_model_parallel_size: 2
+
+# DP size for rollout
+data_parallel_size: 1
+
+# EP size for rollout
+expert_parallel_size: 1
+
+# PP size for rollout.
+pipeline_model_parallel_size: 1
+
+# max number of tokens in a batch
+max_num_batched_tokens: 8192
+
+# max length for rollout
+max_model_len: null
+
+# max length of sequences
+max_num_seqs: 1024
+
+# may get higher throughput when set to True. When activated, Please increase max_num_batched_tokens or decrease max_model_len.
+enable_chunked_prefill: True
+
+# Prefix caching kv-cache blocks is a popular optimization in LLM inference to avoid redundant prompt computations.
+enable_prefix_caching: True
+
+# logprobs mode for rollout logprobs
+logprobs_mode: processed_logprobs
+
+# scheduling policy for vllm rollout
+scheduling_policy: fcfs
+
+# Which loader to use for rollout model weights: dummy, hf, megatron, etc.
+# safetensors (for huge model, and set use_shm=True); dummy: randomly init model weight
+load_format: dummy
+
+# [Will be deprecated, use log_prob_micro_batch_size_per_gpu] The batch size for one forward pass in the computation of log_prob. Global batch size.
+log_prob_micro_batch_size: null
+
+# The batch size for one forward pass in the computation of log_prob. Local batch size per GPU.
+log_prob_micro_batch_size_per_gpu: null
+
+# enable dynamic batch size (sequence packing) for log_prob computation
+# same as actor_rollout_ref.actor.use_dynamic_bsz if it exists, otherwise false
+log_prob_use_dynamic_bsz: ${oc.select:actor_rollout_ref.actor.use_dynamic_bsz,false}
+
+# max token length for log_prob computation
+# same as actor_rollout_ref.actor.ppo_max_token_len_per_gpu if it exists, otherwise 16384
+log_prob_max_token_len_per_gpu: ${oc.select:actor_rollout_ref.actor.ppo_max_token_len_per_gpu,16384}
+
+# disable logging statistics
+disable_log_stats: True
+
+# for hf rollout
+# Whether to sample during training rollout. False uses greedy sampling.
+do_sample: True
+
+# number of responses (i.e. num sample times). > 1 for grpo
+n: 1
+
+# image height for diffusion model rollout
+image_height: 512
+
+# image width for diffusion model rollout
+image_width: 512
+
+# number of inference steps for diffusion model rollout
+num_inference_steps: 10
+
+# noise level for diffusion model rollout
+noise_level: 0.7
+
+# guidance scale for classifier-free guidance
+guidance_scale: 4.5
+
+# SDE type during rollout
+sde_type: "sde"
+
+# SDE window size
+sde_window_size: null
+
+# SDE window range
+sde_window_range: null
+
+# Extra inference engine arguments (vllm, sglang, trtllm), please refer vllm/sglang/trtllm official doc for detail
+engine_kwargs:
+
+  # vllm-omni engine config
+  vllm_omni: {}
+
+# Sampling parameters used during validation.
+val_kwargs:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.workers.config.DiffusionSamplingConfig
+
+  # whether to repeat n times for validation
+  n: 1
+
+  # Whether to sample during training rollout. False uses greedy sampling.
+  do_sample: False
+
+  # number of inference steps for diffusion model rollout
+  num_inference_steps: 40
+
+  # noise level for diffusion model rollout
+  noise_level: 0.0
+
+  # random seed for validation
+  seed: 42
+
+# Multi-turn interaction config for tools or chat.
+multi_turn:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.workers.config.MultiTurnConfig
+
+  # set to True for multi-turn tool interaction tasks; should set rollout.name to sglang as well
+  enable: False
+
+  # null for no limit (default max_length // 3)
+  max_assistant_turns: null
+
+  # null for no tool
+  tool_config_path: null
+
+  # null for no limit (default max_length // 3)
+  max_user_turns: null
+
+  # max parallel call for tools in single turn
+  max_parallel_calls: 1
+
+  # max length of tool response
+  max_tool_response_length: 256
+
+  # truncate side of tool response: left, middle, right
+  tool_response_truncate_side: middle
+
+  # null for no interaction
+  interaction_config_path: null
+
+  # - When set to True, the model's default chat template is used for multi-turn rollout, which typically matches production behavior.
+  # - When set to False, the token ids recorded for training are used instead; unlike the default chat template, these always include the model's full output,
+  #   which may contain additional content such as reasoning content. This maintains the consistency between training and rollout, but it will lead to longer prompts.
+  use_inference_chat_template: False
+
+  # Tokenization is performed turn by turn and the resulting token ids are concatenated to form the full conversation.
+  # To ensure this matches the result of tokenizing the entire conversation at once, a sanity check is run at the end of each multi-turn rollout to compare the two sets of token ids.
+  # Some models are known to produce different tokenization results when tokenizing turn by turn vs. all at once. aThis behavior has already been validated for them.
+  # To reduce excessive warnings, you can turn off the sanity check for these models if you are using their default chat template:
+  # Qwen/QwQ-32B, Qwen/Qwen3-xxB
+  # - disable: disable tokenization sanity check
+  # - strict: enable strict tokenization sanity check (default)
+  # - ignore_strippable: ignore strippable tokens when checking tokenization sanity
+  tokenization_sanity_check_mode: strict
+
+  # Format of the multi-turn interaction. Options: hermes, llama3_json, ...
+  format: hermes
+
+  # Number of repeat rollouts for each interaction
+  num_repeat_rollouts: null
+
+# support logging rollout prob for debugging purpose
+# "Truncated importance sampling" requires rollout log probs, set to True when turning on Truncated importance sampling
+calculate_log_probs: False
+
+# [Experimental] agent loop based rollout configs
+agent:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.workers.config.AgentLoopConfig
+
+  # Number of agent loop workers
+  num_workers: 8
+
+  # default agent loop to use if `agent_name` not set in RL dataset
+  default_agent_loop: single_turn_agent
+
+  # custom agent loop config path, which should contain list of configs to initialize AgentLoop instances.
+  # https://hydra.cc/docs/advanced/instantiate_objects/overview/
+  #
+  # - name: react_agent
+  #   _target_: recipe.langgraph_agent.react_agent_loop.ReactAgentLoop
+  #   tools: ["get_current_temperature"]
+  # - name: math_expression
+  #   _target_: recipe.langgraph_agent.example.math_expression.MathExpressionReactAgentLoop
+  #   min_terms: 2
+  #   max_terms: 6
+  agent_loop_config_path: null
+
+  # custom async server configs
+  custom_async_server:
+
+    # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+    _target_: verl.workers.config.CustomAsyncServerConfig
+
+    # Path to the custom async server implementation
+    path: null
+
+    # Class name of the custom async server class (e.g. AsyncvLLMServer)
+    name: null
+
+# Checkpoint Engine config for update weights from trainer to rollout
+checkpoint_engine:
+
+  # Target class for checkpoint engine config
+  _target_: verl.workers.config.CheckpointEngineConfig
+
+  # Backend for checkpoint engine: naive, nccl, nixl, hccl
+  backend: naive
+
+  # Specifies the tensor bucket size (in megabytes) for batch weight updates during rollout operations.
+  # This parameter controls the maximum payload size for a single weight update request.
+  # Reference: https://github.com/volcengine/verl/pull/2418
+  # Currently only supported in SGLang rollout implementations
+  # Larger values may improve throughput but increase memory overhead
+  # Detailed performance comparison:
+  # https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/issues/169#issuecomment-3070686720
+  # Default value (512MB) is optimized for typical GPU memory configurations
+  # For the best performance of `rebuild_cuda_tensor`, it is recommended to:
+  # 1. Enable `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES`
+  # 2. Manually set `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`
+  # when using Tensor Parallelism (TP) >= 8.
+  update_weights_bucket_megabytes: 2048
+
+  # Additional keyword arguments to pass to the checkpoint engine constructor
+  engine_kwargs: {}
+
+# trace rollout data
+# trace:
+
+#   # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+#   _target_: verl.workers.config.TraceConfig
+
+#   # trace backend, support mlflow, weave
+#   backend: null
+
+#   # whether translate token id to text in output
+#   token2text: False
+
+#   # Maximum number of unique samples to trace per agent worker per training step.
+#   # If null, all samples are traced. If set to N, each agent loop worker will randomly
+#   # select N unique samples to trace (including all their rollouts for GRPO).
+#   # Total traces per step = max_samples_per_step_per_worker * num_workers * n_rollouts_per_sample
+#   max_samples_per_step_per_worker: null
+
+# Whether to enable rollout routing replay for MoE models
+# When enabled (True), the rollout will record the routing decisions.
+enable_rollout_routing_replay: False
+
+
+# profile the rollout model in `generate_sequence`
+profiler:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.utils.profiler.ProfilerConfig
+
+  # profiler tool, default same as profiler.tool in global config
+  # choices: npu, torch
+  tool: ${oc.select:global_profiler.tool,null}
+
+  # whether enable profile on rollout
+  enable: ${oc.select:actor_rollout_ref.actor.profiler.enable,false}
+
+  # Whether to profile all ranks.
+  all_ranks: ${oc.select:actor_rollout_ref.actor.profiler.all_ranks,false}
+
+  # The ranks that will be profiled. [] or [0,1,...]
+  ranks: ${oc.select:actor_rollout_ref.actor.profiler.ranks,[]}
+
+  # profile results saving path
+  save_path: ${oc.select:global_profiler.save_path,null}
+
+  # specific tool config
+  tool_config:
+
+    # npu config
+    npu:
+
+      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+      _target_: verl.utils.profiler.config.NPUToolConfig
+
+      # Contents to profile, can be empty
+      # options: npu, cpu, memory, shapes, module, stack
+      contents: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.contents,[]}
+
+      # Collection level, optional values: level_none, level0, level1, level2.
+      level: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.level,level0}
+
+      # Whether to automatically parse the data.
+      analysis: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.analysis,false}
+
+      # True for each task has its own database, False for all tasks in one training step share one database.
+      discrete: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.npu.discrete,false}
+
+    # torch profiler config
+    torch:
+
+      # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+      _target_: verl.utils.profiler.config.TorchProfilerToolConfig
+
+      # Contents to profile, can be empty
+      # options: cuda, cpu, memory, shapes, stack
+      contents: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.torch.contents,[]}
+
+      # True for each task has its own database, False for all tasks in one training step share one database.
+      discrete: ${oc.select:actor_rollout_ref.actor.profiler.tool_config.torch.discrete,false}
+
+# prometheus configuration for vllm/sglang server mode
+prometheus:
+
+  # Required when using verl.utils.omega_conf_to_dataclass to instantiate dataclass configs
+  _target_: verl.workers.config.PrometheusConfig
+
+  # whether enable prometheus on server mode rollout
+  enable: false
+
+  # Port number that Prometheus listens on, default is 9090
+  port: 9090
+
+  # Path to Prometheus configuration file
+  file: /tmp/ray/session_latest/metrics/prometheus/prometheus.yml
+
+  # Specify served_model_name to avoid displaying overly long model paths in Grafana
+  served_model_name: ${oc.select:actor_rollout_ref.model.path,null}
+
+# type of quantization in vllm, currently support fp8 and torchao
+quantization: null
+
+# extra quantization information serialized in a config file, e.g. torchao_config.json
+quantization_config_file: null
+
diff --git a/verl/trainer/main_ppo.py b/verl/trainer/main_ppo.py
index 2c84374d245..ea177564057 100644
--- a/verl/trainer/main_ppo.py
+++ b/verl/trainer/main_ppo.py
@@ -25,6 +25,7 @@
 from verl.experimental.dataset.sampler import AbstractSampler
 from verl.experimental.reward_loop import migrate_legacy_reward_impl
 from verl.trainer.constants_ppo import get_ppo_ray_runtime_env
+from verl.trainer.ppo.ray_diffusion_trainer import RayFlowGRPOTrainer
 from verl.trainer.ppo.ray_trainer import RayPPOTrainer
 from verl.trainer.ppo.utils import need_critic, need_reference_policy
 from verl.utils.config import validate_config
@@ -312,9 +313,13 @@ def run(self, config):
         from verl.utils import hf_processor, hf_tokenizer
 
         trust_remote_code = config.data.get("trust_remote_code", False)
-        tokenizer = hf_tokenizer(local_path, trust_remote_code=trust_remote_code)
+        tokenizer = hf_tokenizer(config.actor_rollout_ref.model.tokenizer_path, trust_remote_code=trust_remote_code)
         # Used for multimodal LLM, could be None
-        processor = hf_processor(local_path, trust_remote_code=trust_remote_code, use_fast=True)
+        if os.path.exists(os.path.join(local_path, "processor")):
+            processor_path = os.path.join(local_path, "processor")
+        else:
+            processor_path = local_path
+        processor = hf_processor(processor_path, trust_remote_code=trust_remote_code, use_fast=True)
 
         resource_pool_manager = self.init_resource_pool_mgr(config)
 
@@ -340,7 +345,12 @@ def run(self, config):
         train_sampler = create_rl_sampler(config.data, train_dataset)
 
         # Initialize the PPO trainer.
-        trainer = RayPPOTrainer(
+        trainer_cls = (
+            RayFlowGRPOTrainer
+            if config.actor_rollout_ref.model.get("model_type", None) == "diffusion_model"
+            else RayPPOTrainer
+        )
+        trainer = trainer_cls(
             config=config,
             tokenizer=tokenizer,
             processor=processor,
diff --git a/verl/trainer/ppo/core_algos.py b/verl/trainer/ppo/core_algos.py
index a78bd400e10..046fd8e0728 100644
--- a/verl/trainer/ppo/core_algos.py
+++ b/verl/trainer/ppo/core_algos.py
@@ -107,6 +107,7 @@ class AdvantageEstimator(str, Enum):
     GRPO_VECTORIZED = "grpo_vectorized"
     OPTIMAL_TOKEN_BASELINE = "optimal_token_baseline"
     TIR_OPTIMAL_TOKEN_BASELINE = "tir_optimal_token_baseline"
+    FLOW_GRPO = "flow_grpo"
 
 
 ADV_ESTIMATOR_REGISTRY: dict[str, Any] = {}
@@ -1006,6 +1007,89 @@ def compute_multi_turn_optimal_token_baseline_advantage(
     return advantages, token_returns
 
 
+@register_adv_est(AdvantageEstimator.FLOW_GRPO)
+def compute_flow_grpo_outcome_advantage(
+    token_level_rewards: torch.Tensor,
+    response_mask: torch.Tensor,
+    index: np.ndarray,
+    epsilon: float = 1e-4,
+    norm_adv_by_std_in_grpo: bool = True,
+    global_std: bool = True,
+    config: Optional[DictConfig] = None,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """
+    Compute advantage for GRPO, operating only on Outcome reward
+    (with only one scalar reward for each response).
+
+    Args:
+        token_level_rewards: `(torch.Tensor)`
+            shape is (bs, ), (bs, 1) or (bs, response_length)
+        response_mask: `(torch.Tensor)`
+            shape is (bs, response_length)
+        index: `(np.ndarray)`
+            index array for grouping
+        epsilon: `(float)`
+            small value to avoid division by zero
+        norm_adv_by_std_in_grpo: `(bool)`
+            whether to scale the GRPO advantage
+        global_std: `(bool)`
+            whether to use global std for advantage normalization
+        config: `(Optional[DictConfig])`
+            algorithm configuration object
+
+    Note:
+        If norm_adv_by_std_in_grpo is True, the advantage is scaled by the std, as in the original GRPO.
+        If False, the advantage is not scaled, as in Dr.GRPO (https://arxiv.org/abs/2503.20783).
+
+    Returns:
+        advantages: `(torch.Tensor)`
+            shape is (bs, response_length)
+        Returns: `(torch.Tensor)`
+            shape is (bs, response_length)
+    """
+    scores = token_level_rewards
+    if scores.ndim == 1:
+        scores = scores.unsqueeze(-1)
+    scores = scores.expand_as(response_mask).clone()
+
+    id2score = defaultdict(list)
+    id2mean = {}
+    id2std = {}
+
+    with torch.no_grad():
+        if global_std:
+            batch_std = torch.std(scores)
+        else:
+            batch_std = None
+
+        bsz = scores.shape[0]
+        for i in range(bsz):
+            id2score[index[i]].append(scores[i])
+        for idx in id2score:
+            if len(id2score[idx]) == 1:
+                id2mean[idx] = torch.tensor(0.0)
+                if global_std:
+                    id2std[idx] = batch_std
+                else:
+                    id2std[idx] = torch.tensor(1.0)
+            elif len(id2score[idx]) > 1:
+                scores_tensor = torch.stack(id2score[idx])
+                id2mean[idx] = torch.mean(scores_tensor)
+                if global_std:
+                    id2std[idx] = batch_std
+                else:
+                    id2std[idx] = torch.std(scores_tensor)
+            else:
+                raise ValueError(f"no score in prompt index: {idx}")
+        for i in range(bsz):
+            if norm_adv_by_std_in_grpo:
+                scores[i] = (scores[i] - id2mean[index[i]]) / (id2std[index[i]] + epsilon)
+            else:
+                scores[i] = scores[i] - id2mean[index[i]]
+
+    return scores, scores
+
+
 def compute_rewards(token_level_scores, old_log_prob, ref_log_prob, kl_ratio):
     """Compute token-level rewards with KL penalty.
 
@@ -1951,6 +2035,58 @@ def compute_policy_loss_cispo(
     return pg_loss, pg_metrics
 
 
+@register_policy_loss("flow_grpo")
+def compute_policy_loss_flow_grpo(
+    old_log_prob: torch.Tensor,
+    log_prob: torch.Tensor,
+    advantages: torch.Tensor,
+    response_mask: torch.Tensor,
+    loss_agg_mode: str = "token-mean",
+    config: Optional[DictConfig | ActorConfig] = None,
+    rollout_is_weights: torch.Tensor | None = None,
+) -> tuple[torch.Tensor, dict[str, Any]]:
+    """
+    Compute the clipped policy objective and related metrics for FlowGRPO.
+
+    Adapted from
+    https://github.com/yifan123/flow_grpo/blob/main/scripts/train_sd3_fast.py#L885
+
+    Args:
+        old_log_prob (torch.Tensor):
+            Log-probabilities of actions under the old policy, shape (batch_size,).
+        log_prob (torch.Tensor):
+            Log-probabilities of actions under the current policy, shape (batch_size,).
+        response_mask (torch.Tensor):
+            Not used.
+        loss_agg_mode (str, optional):
+            Not used.
+        advantages (torch.Tensor):
+            Advantage estimates for each action, shape (batch_size,).
+        config: `(verl.trainer.config.ActorConfig)`:
+            config for the actor.
+        rollout_is_weights: `(torch.Tensor, optional)`:
+            Not used.
+    """
+    assert config is not None
+    assert isinstance(config, ActorConfig)
+    advantages = torch.clamp(
+        advantages,
+        -config.clip_ratio_high,
+        config.clip_ratio_high,
+    )
+    ratio = torch.exp(log_prob - old_log_prob)
+    unclipped_loss = -advantages * ratio
+    clipped_loss = -advantages * torch.clamp(
+        ratio,
+        1.0 - config.clip_ratio,
+        1.0 + config.clip_ratio,
+    )
+    pg_loss = torch.mean(torch.maximum(unclipped_loss, clipped_loss))
+
+    pg_metrics = {"actor/ppo_kl": pg_loss.detach().item()}
+    return pg_loss, pg_metrics
+
+
 def compute_entropy_loss(logits, response_mask, loss_agg_mode: str = "token-mean"):
     """Compute categorical entropy loss (For backward compatibility)
 
@@ -2074,6 +2210,20 @@ def kl_penalty_forward(logprob: torch.FloatTensor, ref_logprob: torch.FloatTenso
     raise NotImplementedError
 
 
+def kl_penalty_image(
+    prev_sample_mean: torch.Tensor, ref_prev_sample_mean: torch.Tensor, std_dev_t: torch.Tensor
+) -> torch.Tensor:
+    """Compute KL divergence given previous sample mean and reference previous sample mean (for images or videos).
+
+    Args:
+        prev_sample_mean: (torch.Tensor) shape is (bs, s, c)
+        ref_prev_sample_mean: (torch.Tensor) shape is (bs, s, c)
+        std_dev_t: (torch.Tensor) shape is (bs, 1, 1)
+    """
+    kl_loss = ((prev_sample_mean - ref_prev_sample_mean) ** 2).mean(dim=(1, 2), keepdim=True) / (2 * std_dev_t**2)
+    return kl_loss.mean()
+
+
 def compute_pf_ppo_reweight_data(
     data,
     reweight_method: str = "pow",
diff --git a/verl/trainer/ppo/ray_diffusion_trainer.py b/verl/trainer/ppo/ray_diffusion_trainer.py
new file mode 100644
index 00000000000..c96134c0fa7
--- /dev/null
+++ b/verl/trainer/ppo/ray_diffusion_trainer.py
@@ -0,0 +1,1486 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+PPO Trainer with Ray-based single controller.
+This trainer supports model-agonistic model initialization with huggingface
+"""
+
+import json
+import os
+import uuid
+from collections import defaultdict
+from pprint import pprint
+from typing import Any, Optional
+
+import numpy as np
+import torch
+from omegaconf import OmegaConf, open_dict
+from PIL import Image
+from torch.utils.data import Dataset, Sampler
+from torchdata.stateful_dataloader import StatefulDataLoader
+from tqdm import tqdm
+
+from verl import DataProto
+from verl.checkpoint_engine import CheckpointEngineManager
+from verl.experimental.dataset.sampler import AbstractCurriculumSampler
+from verl.protocol import pad_dataproto_to_divisor, unpad_dataproto
+from verl.single_controller.ray import RayClassWithInitArgs, RayWorkerGroup, ResourcePoolManager
+from verl.single_controller.ray.base import create_colocated_worker_cls
+from verl.trainer.config import AlgoConfig
+from verl.trainer.ppo import core_algos
+from verl.trainer.ppo.core_algos import AdvantageEstimator
+from verl.trainer.ppo.metric_utils import (
+    compute_data_metrics,
+    compute_throughout_metrics,
+    compute_timing_metrics,
+    compute_variance_proxy_metrics,
+    process_validation_metrics,
+)
+from verl.trainer.ppo.reward import extract_reward
+from verl.trainer.ppo.utils import Role, WorkerType, need_critic, need_reference_policy, need_reward_model
+from verl.utils import tensordict_utils as tu
+from verl.utils.checkpoint.checkpoint_manager import find_latest_ckpt_path, should_save_ckpt_esi
+from verl.utils.config import omega_conf_to_dataclass
+from verl.utils.debug import marked_timer
+from verl.utils.import_utils import load_class_from_fqn
+from verl.utils.metric import reduce_metrics
+from verl.utils.py_functional import rename_dict
+from verl.utils.seqlen_balancing import calculate_workload, get_seqlen_balanced_partitions, log_seqlen_unbalance
+from verl.utils.tracking import ValidationGenerationsLogger
+from verl.workers.config import FSDPEngineConfig
+from verl.workers.utils.padding import embeds_padding_2_no_padding
+
+
+def apply_kl_penalty(data: DataProto, kl_ctrl: core_algos.AdaptiveKLController, kl_penalty="kl"):
+    raise NotImplementedError("KL penalty is not supported.")
+
+
+def compute_response_mask(data: DataProto):
+    """Compute the attention mask for latents
+
+    Args:
+        data (DataProto): The data containing batched model outputs and inputs.
+
+    Returns:
+        torch.Tensor: The attention mask for the response tokens.
+    """
+    all_latents = data.batch["all_latents"]
+    b, t, _, _ = all_latents.shape
+    response_mask = torch.ones((b, t), dtype=torch.int32)
+    return response_mask
+
+
+def compute_advantage(
+    data: DataProto,
+    adv_estimator: AdvantageEstimator,
+    norm_adv_by_std_in_grpo: bool = True,
+    global_std: bool = True,
+    config: Optional[AlgoConfig] = None,
+) -> DataProto:
+    """Compute advantage estimates for policy optimization.
+
+    This function computes advantage estimates using various estimators like FlowGRPO, etc.
+    The advantage estimates are used to guide policy optimization in RL algorithms.
+
+    Args:
+        data (DataProto): The data containing batched model outputs and inputs.
+        adv_estimator (AdvantageEstimator): The advantage estimator to use (e.g., GAE, GRPO, REINFORCE++).
+        norm_adv_by_std_in_grpo (bool, optional): Whether to normalize advantages by standard deviation in
+            GRPO. Defaults to True.
+        global_std (bool, optional): Whether to use global standard deviation for advantage normalization.
+        config (dict, optional): Configuration dictionary for algorithm settings. Defaults to None.
+
+    Returns:
+        DataProto: The updated data with computed advantages and returns.
+    """
+    # Back-compatible with trainers that do not compute response mask in fit
+    if "response_mask" not in data.batch.keys():
+        data.batch["response_mask"] = compute_response_mask(data)
+    # prepare response group
+    if adv_estimator == AdvantageEstimator.FLOW_GRPO:
+        # Initialize the mask for GRPO calculation
+        grpo_calculation_mask = data.batch["response_mask"]
+
+        # Call compute_grpo_outcome_advantage with parameters matching its definition
+        advantages, returns = core_algos.compute_flow_grpo_outcome_advantage(
+            token_level_rewards=data.batch["token_level_rewards"],
+            response_mask=grpo_calculation_mask,
+            index=data.non_tensor_batch["uid"],
+            norm_adv_by_std_in_grpo=norm_adv_by_std_in_grpo,
+            global_std=global_std,
+        )
+        data.batch["advantages"] = advantages
+        data.batch["returns"] = returns
+    else:
+        # handle all other adv estimator type other than GAE and GRPO
+        adv_estimator_fn = core_algos.get_adv_estimator_fn(adv_estimator)
+        adv_kwargs = {
+            "token_level_rewards": data.batch["token_level_rewards"],
+            "response_mask": data.batch["response_mask"],
+            "config": config,
+        }
+        if "uid" in data.non_tensor_batch:  # optional
+            adv_kwargs["index"] = data.non_tensor_batch["uid"]
+        if "reward_baselines" in data.batch:  # optional
+            adv_kwargs["reward_baselines"] = data.batch["reward_baselines"]
+
+        # calculate advantage estimator
+        advantages, returns = adv_estimator_fn(**adv_kwargs)
+        data.batch["advantages"] = advantages
+        data.batch["returns"] = returns
+    return data
+
+
+class RayFlowGRPOTrainer:
+    """Distributed Flow-GRPO trainer using Ray for scalable reinforcement learning.
+
+    This trainer orchestrates distributed PPO training across multiple nodes and GPUs,
+    managing actor rollouts, critic training, and reward computation with Ray backend.
+    Supports various model architectures including FSDP, Megatron, vLLM, and SGLang integration.
+    """
+
+    # TODO: support each role have individual ray_worker_group_cls,
+    # i.e., support different backend of different role
+    def __init__(
+        self,
+        config,
+        tokenizer,
+        role_worker_mapping: dict[Role, WorkerType],
+        resource_pool_manager: ResourcePoolManager,
+        ray_worker_group_cls: type[RayWorkerGroup] = RayWorkerGroup,
+        processor=None,
+        train_dataset: Optional[Dataset] = None,
+        val_dataset: Optional[Dataset] = None,
+        collate_fn=None,
+        train_sampler: Optional[Sampler] = None,
+        device_name=None,
+    ):
+        """
+        Initialize distributed PPO trainer with Ray backend.
+        Note that this trainer runs on the driver process on a single CPU/GPU node.
+
+        Args:
+            config: Configuration object containing training parameters.
+            tokenizer: Tokenizer used for encoding and decoding text.
+            role_worker_mapping (dict[Role, WorkerType]): Mapping from roles to worker classes.
+            resource_pool_manager (ResourcePoolManager): Manager for Ray resource pools.
+            ray_worker_group_cls (RayWorkerGroup, optional): Class for Ray worker groups. Defaults to RayWorkerGroup.
+            processor: Optional data processor, used for multimodal data
+            train_dataset (Optional[Dataset], optional): Training dataset. Defaults to None.
+            val_dataset (Optional[Dataset], optional): Validation dataset. Defaults to None.
+            collate_fn: Function to collate data samples into batches.
+            train_sampler (Optional[Sampler], optional): Sampler for the training dataset. Defaults to None.
+            device_name (str, optional): Device name for training (e.g., "cuda", "cpu"). Defaults to None.
+        """
+
+        # Store the tokenizer for text processing
+        self.tokenizer = tokenizer
+        self.processor = processor
+        self.config = config
+
+        self.hybrid_engine = config.actor_rollout_ref.hybrid_engine
+        assert self.hybrid_engine, "Currently, only support hybrid engine"
+
+        if self.hybrid_engine:
+            assert Role.ActorRollout in role_worker_mapping or Role.ActorRolloutRef in role_worker_mapping, (
+                f"{role_worker_mapping.keys()=}"
+            )
+
+        self.role_worker_mapping = role_worker_mapping
+        self.resource_pool_manager = resource_pool_manager
+        self.use_reference_policy = need_reference_policy(self.config)
+
+        self.use_rm = need_reward_model(self.config)
+
+        self.use_critic = need_critic(self.config)
+        self.ray_worker_group_cls = ray_worker_group_cls
+        self.device_name = device_name if device_name else self.config.trainer.device
+        self.validation_generations_logger = ValidationGenerationsLogger(
+            project_name=self.config.trainer.project_name,
+            experiment_name=self.config.trainer.experiment_name,
+        )
+
+        # if ref_in_actor is True, the reference policy will be actor without lora applied
+        lora_rank = config.actor_rollout_ref.model.get("lora", {}).get("rank", 0)
+        if lora_rank <= 0:
+            lora_rank = config.actor_rollout_ref.model.get("lora_rank", 0)
+        self.ref_in_actor = lora_rank > 0 or config.actor_rollout_ref.model.get("lora_adapter_path") is not None
+
+        # define in-reward KL control
+        # kl loss control currently not suppoorted
+        if self.config.algorithm.use_kl_in_reward:
+            self.kl_ctrl_in_reward = core_algos.get_kl_controller(self.config.algorithm.kl_ctrl)
+
+        self.use_prefix_grouper = self.config.actor_rollout_ref.actor.get("use_prefix_grouper", False)
+        self.use_legacy_worker_impl = config.trainer.get("use_legacy_worker_impl", "auto")
+
+        self._create_dataloader(train_dataset, val_dataset, collate_fn, train_sampler)
+
+        self.checkpoint_manager = None
+
+    def _create_dataloader(self, train_dataset, val_dataset, collate_fn, train_sampler: Optional[Sampler]):
+        """
+        Creates the train and validation dataloaders.
+        """
+        # TODO: we have to make sure the batch size is divisible by the dp size
+        from verl.trainer.main_ppo import create_rl_dataset, create_rl_sampler
+
+        if train_dataset is None:
+            train_dataset = create_rl_dataset(
+                self.config.data.train_files,
+                self.config.data,
+                self.tokenizer,
+                self.processor,
+                max_samples=self.config.data.get("train_max_samples", -1),
+            )
+        if val_dataset is None:
+            val_dataset = create_rl_dataset(
+                self.config.data.val_files,
+                self.config.data,
+                self.tokenizer,
+                self.processor,
+                max_samples=self.config.data.get("val_max_samples", -1),
+            )
+        self.train_dataset, self.val_dataset = train_dataset, val_dataset
+
+        if train_sampler is None:
+            train_sampler = create_rl_sampler(self.config.data, self.train_dataset)
+        if collate_fn is None:
+            from verl.utils.dataset.rl_dataset import collate_fn as default_collate_fn
+
+            collate_fn = default_collate_fn
+
+        num_workers = self.config.data["dataloader_num_workers"]
+
+        self.train_dataloader = StatefulDataLoader(
+            dataset=self.train_dataset,
+            batch_size=self.config.data.get("gen_batch_size", self.config.data.train_batch_size),
+            num_workers=num_workers,
+            drop_last=True,
+            collate_fn=collate_fn,
+            sampler=train_sampler,
+        )
+
+        val_batch_size = self.config.data.val_batch_size  # Prefer config value if set
+        if val_batch_size is None:
+            val_batch_size = len(self.val_dataset)
+
+        self.val_dataloader = StatefulDataLoader(
+            dataset=self.val_dataset,
+            batch_size=val_batch_size,
+            num_workers=num_workers,
+            shuffle=self.config.data.get("validation_shuffle", True),
+            drop_last=False,
+            collate_fn=collate_fn,
+        )
+
+        assert len(self.train_dataloader) >= 1, "Train dataloader is empty!"
+        assert len(self.val_dataloader) >= 1, "Validation dataloader is empty!"
+
+        print(
+            f"Size of train dataloader: {len(self.train_dataloader)}, Size of val dataloader: "
+            f"{len(self.val_dataloader)}"
+        )
+
+        total_training_steps = len(self.train_dataloader) * self.config.trainer.total_epochs
+
+        if self.config.trainer.total_training_steps is not None:
+            total_training_steps = self.config.trainer.total_training_steps
+
+        self.total_training_steps = total_training_steps
+        print(f"Total training steps: {self.total_training_steps}")
+
+        try:
+            OmegaConf.set_struct(self.config, True)
+            with open_dict(self.config):
+                if OmegaConf.select(self.config, "actor_rollout_ref.actor.optim"):
+                    self.config.actor_rollout_ref.actor.optim.total_training_steps = total_training_steps
+                if OmegaConf.select(self.config, "critic.optim"):
+                    self.config.critic.optim.total_training_steps = total_training_steps
+        except Exception as e:
+            print(f"Warning: Could not set total_training_steps in config. Structure missing? Error: {e}")
+
+    def _dump_generations(self, inputs, outputs, gts, scores, reward_extra_infos_dict, dump_path):
+        """Dump rollout/validation samples as JSONL."""
+        os.makedirs(dump_path, exist_ok=True)
+
+        visual_folder = os.path.join(dump_path, f"{self.global_steps}")
+        os.makedirs(visual_folder, exist_ok=True)
+
+        output_paths = []
+        images_pil = outputs.cpu().float().permute(0, 2, 3, 1).numpy()
+        images_pil = (images_pil * 255).round().clip(0, 255).astype("uint8")
+        for i, image in enumerate(images_pil):
+            image_path = os.path.join(visual_folder, f"{i}.jpg")
+            Image.fromarray(image).save(image_path)
+            output_paths.append(image_path)
+
+        filename = os.path.join(dump_path, f"{self.global_steps}.jsonl")
+
+        n = len(inputs)
+        base_data = {
+            "input": inputs,
+            "output": output_paths,
+            "gts": gts,
+            "score": scores,
+            "step": [self.global_steps] * n,
+        }
+
+        for k, v in reward_extra_infos_dict.items():
+            if len(v) == n:
+                base_data[k] = v
+
+        lines = []
+        for i in range(n):
+            entry = {k: v[i] for k, v in base_data.items()}
+            lines.append(json.dumps(entry, ensure_ascii=False))
+
+        with open(filename, "w") as f:
+            f.write("\n".join(lines) + "\n")
+
+        print(f"Dumped generations to {filename}")
+
+    def _log_rollout_data(
+        self, batch: DataProto, reward_extra_infos_dict: dict, timing_raw: dict, rollout_data_dir: str
+    ):
+        """Log rollout data to disk.
+        Args:
+            batch (DataProto): The batch containing rollout data
+            reward_extra_infos_dict (dict): Additional reward information to log
+            timing_raw (dict): Timing information for profiling
+            rollout_data_dir (str): Directory path to save the rollout data
+        """
+        with marked_timer("dump_rollout_generations", timing_raw, color="green"):
+            inputs = self.tokenizer.batch_decode(batch.batch["prompts"], skip_special_tokens=True)
+            outputs = batch.batch["responses"]
+            scores = batch.batch["token_level_scores"].sum(-1).cpu().tolist()
+            sample_gts = [item.non_tensor_batch.get("reward_model", {}).get("ground_truth", None) for item in batch]
+
+            reward_extra_infos_to_dump = reward_extra_infos_dict.copy()
+            if "request_id" in batch.non_tensor_batch:
+                reward_extra_infos_dict.setdefault(
+                    "request_id",
+                    batch.non_tensor_batch["request_id"].tolist(),
+                )
+
+            self._dump_generations(
+                inputs=inputs,
+                outputs=outputs,
+                gts=sample_gts,
+                scores=scores,
+                reward_extra_infos_dict=reward_extra_infos_to_dump,
+                dump_path=rollout_data_dir,
+            )
+
+    def _maybe_log_val_generations(self, inputs, outputs, scores):
+        """Log a table of validation samples to the configured logger (wandb or swanlab)"""
+
+        generations_to_log = self.config.trainer.log_val_generations
+
+        if generations_to_log == 0:
+            return
+
+        import numpy as np
+
+        # Create tuples of (input, output, score) and sort by input text
+        if "wandb" in self.config.trainer.logger:
+            import wandb
+
+            outputs = [wandb.Image(image.float(), file_type="jpg") for image in outputs]
+        samples = list(zip(inputs, outputs, scores, strict=True))
+        samples.sort(key=lambda x: x[0])  # Sort by input text
+
+        # Use fixed random seed for deterministic shuffling
+        rng = np.random.RandomState(42)
+        rng.shuffle(samples)
+
+        # Take first N samples after shuffling
+        samples = samples[:generations_to_log]
+
+        # Log to each configured logger
+        self.validation_generations_logger.log(self.config.trainer.logger, samples, self.global_steps)
+
+    def _get_gen_batch(self, batch: DataProto) -> DataProto:
+        reward_keys = set({"data_source", "reward_model", "extra_info", "uid"}) & batch.non_tensor_batch.keys()
+
+        # pop those keys for generation
+        batch_keys_to_pop = []
+        non_tensor_batch_keys_to_pop = set(batch.non_tensor_batch.keys()) - reward_keys
+        gen_batch = batch.pop(
+            batch_keys=batch_keys_to_pop,
+            non_tensor_batch_keys=list(non_tensor_batch_keys_to_pop),
+        )
+
+        # For agent loop, we need reward model keys to compute score.
+        gen_batch.non_tensor_batch.update(batch.non_tensor_batch)
+
+        return gen_batch
+
+    def _compute_reward_colocate(self, batch: DataProto) -> tuple[torch.Tensor, dict[str, Any]] | torch.Tensor:
+        """
+        compute reward use colocate reward model
+        """
+        assert self.reward_loop_manager is not None, "RewardLoopManager is None"
+        batch_reward = self.reward_loop_manager.compute_rm_score(batch)
+        return batch_reward
+
+    def _validate(self, merged: bool = False):
+        data_source_lst = []
+        reward_extra_infos_dict: dict[str, list] = defaultdict(list)
+
+        # Lists to collect samples for the table
+        sample_inputs = []
+        sample_outputs = []
+        sample_gts = []
+        sample_scores = []
+        sample_turns = []
+        sample_uids = []
+
+        for test_data in self.val_dataloader:
+            test_batch = DataProto.from_single_dict(test_data)
+
+            if "uid" not in test_batch.non_tensor_batch:
+                test_batch.non_tensor_batch["uid"] = np.array(
+                    [str(uuid.uuid4()) for _ in range(len(test_batch.batch))], dtype=object
+                )
+
+            # repeat test batch
+            test_batch = test_batch.repeat(
+                repeat_times=self.config.actor_rollout_ref.rollout.val_kwargs.n, interleave=True
+            )
+
+            ground_truths = [
+                item.non_tensor_batch.get("reward_model", {}).get("ground_truth", None) for item in test_batch
+            ]
+            sample_gts.extend(ground_truths)
+
+            test_gen_batch = self._get_gen_batch(test_batch)
+            test_gen_batch.meta_info = {
+                "recompute_log_prob": False,
+                "validate": True,
+                "global_steps": self.global_steps,
+            }
+            print(f"test_gen_batch meta info: {test_gen_batch.meta_info}")
+
+            # pad to be divisible by dp_size
+            size_divisor = self.config.actor_rollout_ref.rollout.agent.num_workers
+            test_gen_batch_padded, pad_size = pad_dataproto_to_divisor(test_gen_batch, size_divisor)
+            test_output_gen_batch_padded = self.async_rollout_manager.generate_sequences(test_gen_batch_padded)
+
+            if self.use_rm and "rm_scores" not in test_output_gen_batch_padded.batch.keys():
+                # for colocate reward models, we need to sleep rollout model
+                # to spare GPU memory for reward model
+                self.checkpoint_manager.sleep_replicas()
+                batch_reward = self._compute_reward_colocate(test_output_gen_batch_padded)
+                test_output_gen_batch_padded = test_output_gen_batch_padded.union(batch_reward)
+                # wake up rollout model
+                # replace with wake_up method once supported
+                self.checkpoint_manager.update_weights(self.global_steps)
+
+            # unpad
+            test_output_gen_batch = unpad_dataproto(test_output_gen_batch_padded, pad_size=pad_size)
+
+            print("validation generation end")
+
+            # Store generated outputs
+            output_images = test_output_gen_batch.batch["responses"]
+            sample_outputs.append(output_images)
+
+            test_batch = test_batch.union(test_output_gen_batch)
+            test_batch.meta_info["validate"] = True
+
+            # Store original inputs
+            input_ids = test_batch.batch["input_ids"]
+            # TODO: Can we keep special tokens except for padding tokens?
+            input_texts = [self.tokenizer.decode(ids, skip_special_tokens=True) for ids in input_ids]
+            sample_inputs.extend(input_texts)
+            sample_uids.extend(test_batch.non_tensor_batch["uid"])
+
+            # evaluate using reward_function
+            reward_tensor, reward_extra_info = extract_reward(test_batch)
+
+            scores = reward_tensor.sum(-1).cpu().tolist()
+            sample_scores.extend(scores)
+
+            reward_extra_infos_dict["reward"].extend(scores)
+            for key, values in reward_extra_info.items():
+                if key not in reward_extra_infos_dict:
+                    reward_extra_infos_dict[key] = []
+                if isinstance(values, np.ndarray):
+                    reward_extra_infos_dict[key].extend(values.tolist())
+                else:
+                    reward_extra_infos_dict[key].extend(values if isinstance(values, list) else [values])
+
+            # collect num_turns of each prompt
+            if "__num_turns__" in test_batch.non_tensor_batch:
+                sample_turns.append(test_batch.non_tensor_batch["__num_turns__"])
+
+            data_source_lst.append(test_batch.non_tensor_batch.get("data_source", ["unknown"] * reward_tensor.shape[0]))
+
+        sample_outputs = torch.cat(sample_outputs, dim=0)
+        self._maybe_log_val_generations(inputs=sample_inputs, outputs=sample_outputs, scores=sample_scores)
+
+        # dump generations
+        val_data_dir = self.config.trainer.get("validation_data_dir", None)
+        if val_data_dir:
+            self._dump_generations(
+                inputs=sample_inputs,
+                outputs=sample_outputs,
+                gts=sample_gts,
+                scores=sample_scores,
+                reward_extra_infos_dict=reward_extra_infos_dict,
+                dump_path=val_data_dir,
+            )
+
+        for key_info, lst in reward_extra_infos_dict.items():
+            assert len(lst) == 0 or len(lst) == len(sample_scores), f"{key_info}: {len(lst)=}, {len(sample_scores)=}"
+
+        if merged:
+            print("_merge_validation_results validate result will be merged")
+            return {
+                "data_sources": data_source_lst,
+                "sample_uids": sample_uids,
+                "sample_turns": sample_turns,
+                "reward_extra_infos_dict": reward_extra_infos_dict,
+            }
+        data_sources = np.concatenate(data_source_lst, axis=0)
+        return self._val_metrics_update(data_sources, sample_uids, reward_extra_infos_dict, sample_turns)
+
+    def _val_metrics_update(self, data_sources, sample_uids, reward_extra_infos_dict, sample_turns):
+        data_src2var2metric2val = process_validation_metrics(data_sources, sample_uids, reward_extra_infos_dict)
+        metric_dict = {}
+        for data_source, var2metric2val in data_src2var2metric2val.items():
+            core_var = "acc" if "acc" in var2metric2val else "reward"
+            for var_name, metric2val in var2metric2val.items():
+                n_max = max([int(name.split("@")[-1].split("/")[0]) for name in metric2val.keys()])
+                for metric_name, metric_val in metric2val.items():
+                    if (
+                        (var_name == core_var)
+                        and any(metric_name.startswith(pfx) for pfx in ["mean", "maj", "best"])
+                        and (f"@{n_max}" in metric_name)
+                    ):
+                        metric_sec = "val-core"
+                    else:
+                        metric_sec = "val-aux"
+                    pfx = f"{metric_sec}/{data_source}/{var_name}/{metric_name}"
+                    metric_dict[pfx] = metric_val
+
+        if len(sample_turns) > 0:
+            sample_turns = np.concatenate(sample_turns)
+            metric_dict["val-aux/num_turns/min"] = sample_turns.min()
+            metric_dict["val-aux/num_turns/max"] = sample_turns.max()
+            metric_dict["val-aux/num_turns/mean"] = sample_turns.mean()
+
+        return metric_dict
+
+    def _merge_validation_results(self, result_a, result_b):
+        if result_a is None and result_b is None:
+            return {}
+        if result_a is None:
+            result_a = {"data_sources": [], "sample_uids": [], "sample_turns": [], "reward_extra_infos_dict": {}}
+        if result_b is None:
+            result_b = {"data_sources": [], "sample_uids": [], "sample_turns": [], "reward_extra_infos_dict": {}}
+
+        if not result_a.get("data_sources") and not result_b.get("data_sources"):
+            return {}
+
+        data_sources = np.concatenate(result_a["data_sources"] + result_b["data_sources"], axis=0)
+        sample_uids = result_a["sample_uids"] + result_b["sample_uids"]
+        sample_turns = result_a["sample_turns"] + result_b["sample_turns"]
+
+        reward_extra_infos_dict = {}
+        all_keys = set(result_a["reward_extra_infos_dict"].keys()) | set(result_b["reward_extra_infos_dict"].keys())
+        for key in all_keys:
+            list_a = result_a["reward_extra_infos_dict"].get(key, [])
+            list_b = result_b["reward_extra_infos_dict"].get(key, [])
+            reward_extra_infos_dict[key] = list_a + list_b
+
+        return self._val_metrics_update(data_sources, sample_uids, reward_extra_infos_dict, sample_turns)
+
+    def init_workers(self):
+        """Initialize distributed training workers using Ray backend.
+
+        Creates:
+        1. Ray resource pools from configuration
+        2. Worker groups for each role (actor, critic, etc.)
+        """
+        self.resource_pool_manager.create_resource_pool()
+
+        self.resource_pool_to_cls = {pool: {} for pool in self.resource_pool_manager.resource_pool_dict.values()}
+
+        # create actor and rollout
+        actor_role = Role.ActorRolloutRef if Role.ActorRolloutRef in self.role_worker_mapping else Role.ActorRollout
+        if self.hybrid_engine:
+            actor_rollout_resource_pool = self.resource_pool_manager.get_resource_pool(actor_role)
+            actor_rollout_cls = RayClassWithInitArgs(
+                cls=self.role_worker_mapping[actor_role],
+                config=self.config.actor_rollout_ref,
+                role=str(actor_role),
+            )
+            self.resource_pool_to_cls[actor_rollout_resource_pool][str(actor_role)] = actor_rollout_cls
+        else:
+            raise NotImplementedError
+
+        # create critic
+        if self.use_critic:
+            resource_pool = self.resource_pool_manager.get_resource_pool(Role.Critic)
+
+            from verl.workers.config import CriticConfig
+
+            critic_cfg: CriticConfig = omega_conf_to_dataclass(self.config.critic)
+
+            if self.use_legacy_worker_impl == "disable":
+                # convert critic_cfg into TrainingWorkerConfig
+                from verl.workers.engine_workers import TrainingWorkerConfig
+
+                orig_critic_cfg = critic_cfg
+                if orig_critic_cfg.strategy == "fsdp":
+                    engine_config: FSDPEngineConfig = orig_critic_cfg.model.fsdp_config
+                    engine_config.infer_max_token_len_per_gpu = critic_cfg.ppo_infer_max_token_len_per_gpu
+                    engine_config.max_token_len_per_gpu = critic_cfg.ppo_max_token_len_per_gpu
+                else:
+                    raise NotImplementedError(f"Unknown strategy {orig_critic_cfg.strategy=}")
+
+                critic_cfg = TrainingWorkerConfig(
+                    model_type="value_model",
+                    model_config=orig_critic_cfg.model_config,
+                    engine_config=engine_config,
+                    optimizer_config=orig_critic_cfg.optim,
+                    checkpoint_config=orig_critic_cfg.checkpoint,
+                )
+
+            critic_cls = RayClassWithInitArgs(cls=self.role_worker_mapping[Role.Critic], config=critic_cfg)
+            self.resource_pool_to_cls[resource_pool][str(Role.Critic)] = critic_cls
+
+        # create reference policy if needed
+        if self.use_reference_policy and Role.RefPolicy in self.role_worker_mapping:
+            resource_pool = self.resource_pool_manager.get_resource_pool(Role.RefPolicy)
+            ref_policy_cls = RayClassWithInitArgs(
+                self.role_worker_mapping[Role.RefPolicy],
+                config=self.config.actor_rollout_ref,
+                role=str(Role.RefPolicy),
+            )
+            self.resource_pool_to_cls[resource_pool][str(Role.RefPolicy)] = ref_policy_cls
+
+        # initialize WorkerGroup
+        # NOTE: if you want to use a different resource pool for each role, which can support different parallel size,
+        # you should not use `create_colocated_worker_cls`.
+        # Instead, directly pass different resource pool to different worker groups.
+        # See https://github.com/volcengine/verl/blob/master/examples/ray/tutorial.ipynb for more information.
+        all_wg = {}
+        wg_kwargs = {}  # Setting up kwargs for RayWorkerGroup
+        if OmegaConf.select(self.config.trainer, "ray_wait_register_center_timeout") is not None:
+            wg_kwargs["ray_wait_register_center_timeout"] = self.config.trainer.ray_wait_register_center_timeout
+        if OmegaConf.select(self.config.global_profiler, "steps") is not None:
+            wg_kwargs["profile_steps"] = OmegaConf.select(self.config.global_profiler, "steps")
+            # Only require nsight worker options when tool is nsys
+            if OmegaConf.select(self.config.global_profiler, "tool") == "nsys":
+                assert (
+                    OmegaConf.select(self.config.global_profiler.global_tool_config.nsys, "worker_nsight_options")
+                    is not None
+                ), "worker_nsight_options must be set when using nsys with profile_steps"
+                wg_kwargs["worker_nsight_options"] = OmegaConf.to_container(
+                    OmegaConf.select(self.config.global_profiler.global_tool_config.nsys, "worker_nsight_options")
+                )
+        wg_kwargs["device_name"] = self.device_name
+
+        for resource_pool, class_dict in self.resource_pool_to_cls.items():
+            if not class_dict:
+                continue
+            worker_dict_cls = create_colocated_worker_cls(class_dict=class_dict)
+            wg_dict = self.ray_worker_group_cls(
+                resource_pool=resource_pool,
+                ray_cls_with_init=worker_dict_cls,
+                **wg_kwargs,
+            )
+            spawn_wg = wg_dict.spawn(prefix_set=class_dict.keys())
+            all_wg.update(spawn_wg)
+
+        if self.use_critic:
+            self.critic_wg = all_wg[str(Role.Critic)]
+            if self.use_legacy_worker_impl == "disable":
+                self.critic_wg.reset()
+                # assign critic loss
+                from functools import partial
+
+                from verl.workers.utils.losses import value_loss
+
+                value_loss_ = partial(value_loss, config=orig_critic_cfg)
+                self.critic_wg.set_loss_fn(value_loss_)
+            else:
+                self.critic_wg.init_model()
+
+        if self.use_reference_policy and not self.ref_in_actor:
+            if str(Role.RefPolicy) in all_wg:
+                self.ref_policy_wg = all_wg[str(Role.RefPolicy)]
+                self.ref_policy_wg.init_model()
+            else:
+                # Model engine: ActorRolloutRefWorker
+                assert str(Role.ActorRolloutRef) in all_wg, f"{all_wg.keys()=}"
+                self.ref_policy_wg = all_wg[str(Role.ActorRolloutRef)]
+
+        # we should create rollout at the end so that vllm can have a better estimation of kv cache memory
+        self.actor_rollout_wg = all_wg[str(actor_role)]
+        self.actor_rollout_wg.init_model()
+
+        if self.ref_in_actor:
+            self.ref_policy_wg = self.actor_rollout_wg
+
+        # create reward loop manager
+        from verl.experimental.reward_loop import RewardLoopManager
+
+        # initalize reward loop manager
+        # reward model (colocate or standalone): get resource_pool
+        # no reward model: resource_pool = None
+        resource_pool = self.resource_pool_manager.get_resource_pool(Role.RewardModel) if self.use_rm else None
+        self.reward_loop_manager = RewardLoopManager(
+            config=self.config,
+            rm_resource_pool=resource_pool,
+        )
+
+        # create async rollout manager and request scheduler
+        # Note: mode is always "async" since sync mode is deprecated
+        self.async_rollout_mode = True
+
+        # Support custom AgentLoopManager via config
+        manager_class_fqn = self.config.actor_rollout_ref.rollout.get("agent", {}).get("agent_loop_manager_class")
+        if manager_class_fqn:
+            AgentLoopManager = load_class_from_fqn(manager_class_fqn, "AgentLoopManager")
+        else:
+            from verl.experimental.agent_loop import AgentLoopManager
+
+        # infrastructure overview: https://verl.readthedocs.io/en/latest/advance/reward_loop.html#architecture-design
+        # agent_reward_loop: streaming reward computation with actor rollout
+        # two conditions satisfied: (1) no reward model, or (2) reward model with extra resource pool
+        enable_agent_reward_loop = not self.use_rm or self.config.reward.reward_model.enable_resource_pool
+
+        # if enable_agent_reward_loop, we directly pass reward_loop_workers to agent loop manager
+        # to stream reward computation with actor rollout
+        reward_loop_worker_handles = self.reward_loop_manager.reward_loop_workers if enable_agent_reward_loop else None
+        self.async_rollout_manager = AgentLoopManager.create(
+            config=self.config,
+            worker_group=self.actor_rollout_wg,
+            rollout_resource_pool=actor_rollout_resource_pool,
+            reward_loop_worker_handles=reward_loop_worker_handles,
+        )
+
+        checkpoint_engine_config = omega_conf_to_dataclass(self.config.actor_rollout_ref.rollout.checkpoint_engine)
+        self.checkpoint_manager = CheckpointEngineManager(
+            config=checkpoint_engine_config,
+            trainer=self.actor_rollout_wg,
+            replicas=self.async_rollout_manager.rollout_replicas,
+        )
+
+        # sleep all replicas to load checkpoint
+        self.checkpoint_manager.sleep_replicas()
+
+    def _save_checkpoint(self):
+        from verl.utils.fs import local_mkdir_safe
+
+        # path: given_path + `/global_step_{global_steps}` + `/actor`
+        local_global_step_folder = os.path.join(
+            self.config.trainer.default_local_dir, f"global_step_{self.global_steps}"
+        )
+
+        print(f"local_global_step_folder: {local_global_step_folder}")
+        actor_local_path = os.path.join(local_global_step_folder, "actor")
+
+        actor_remote_path = (
+            None
+            if self.config.trainer.default_hdfs_dir is None
+            else os.path.join(self.config.trainer.default_hdfs_dir, f"global_step_{self.global_steps}", "actor")
+        )
+
+        remove_previous_ckpt_in_save = self.config.trainer.get("remove_previous_ckpt_in_save", False)
+        if remove_previous_ckpt_in_save:
+            print(
+                "Warning: remove_previous_ckpt_in_save is deprecated,"
+                + " set max_actor_ckpt_to_keep=1 and max_critic_ckpt_to_keep=1 instead"
+            )
+        max_actor_ckpt_to_keep = (
+            self.config.trainer.get("max_actor_ckpt_to_keep", None) if not remove_previous_ckpt_in_save else 1
+        )
+        max_critic_ckpt_to_keep = (
+            self.config.trainer.get("max_critic_ckpt_to_keep", None) if not remove_previous_ckpt_in_save else 1
+        )
+
+        self.actor_rollout_wg.save_checkpoint(
+            actor_local_path, actor_remote_path, self.global_steps, max_ckpt_to_keep=max_actor_ckpt_to_keep
+        )
+
+        if self.use_critic:
+            critic_local_path = os.path.join(local_global_step_folder, str(Role.Critic))
+            critic_remote_path = (
+                None
+                if self.config.trainer.default_hdfs_dir is None
+                else os.path.join(
+                    self.config.trainer.default_hdfs_dir, f"global_step_{self.global_steps}", str(Role.Critic)
+                )
+            )
+            self.critic_wg.save_checkpoint(
+                critic_local_path, critic_remote_path, self.global_steps, max_ckpt_to_keep=max_critic_ckpt_to_keep
+            )
+
+        # save dataloader
+        local_mkdir_safe(local_global_step_folder)
+        dataloader_local_path = os.path.join(local_global_step_folder, "data.pt")
+        dataloader_state_dict = self.train_dataloader.state_dict()
+        torch.save(dataloader_state_dict, dataloader_local_path)
+
+        # latest checkpointed iteration tracker (for atomic usage)
+        if (
+            hasattr(self.config.actor_rollout_ref.actor.checkpoint, "async_save")
+            and self.config.actor_rollout_ref.actor.checkpoint.async_save
+        ) or (
+            "async_save" in self.config.actor_rollout_ref.actor.checkpoint
+            and self.config.actor_rollout_ref.actor.checkpoint["async_save"]
+        ):
+            print("skip write latest_checkpointed_iteration.txt when async_save is True")
+            return
+        local_latest_checkpointed_iteration = os.path.join(
+            self.config.trainer.default_local_dir, "latest_checkpointed_iteration.txt"
+        )
+        with open(local_latest_checkpointed_iteration, "w") as f:
+            f.write(str(self.global_steps))
+
+    def _load_checkpoint(self):
+        if self.config.trainer.resume_mode == "disable":
+            return 0
+
+        # load from hdfs
+        if self.config.trainer.default_hdfs_dir is not None:
+            raise NotImplementedError("load from hdfs is not implemented yet")
+        else:
+            checkpoint_folder = self.config.trainer.default_local_dir  # TODO: check path
+            if not os.path.isabs(checkpoint_folder):
+                working_dir = os.getcwd()
+                checkpoint_folder = os.path.join(working_dir, checkpoint_folder)
+            global_step_folder = find_latest_ckpt_path(checkpoint_folder)  # None if no latest
+
+        # find global_step_folder
+        if self.config.trainer.resume_mode == "auto":
+            if global_step_folder is None:
+                print("Training from scratch")
+                return 0
+        else:
+            if self.config.trainer.resume_mode == "resume_path":
+                assert isinstance(self.config.trainer.resume_from_path, str), "resume ckpt must be str type"
+                assert "global_step_" in self.config.trainer.resume_from_path, (
+                    "resume ckpt must specify the global_steps"
+                )
+                global_step_folder = self.config.trainer.resume_from_path
+                if not os.path.isabs(global_step_folder):
+                    working_dir = os.getcwd()
+                    global_step_folder = os.path.join(working_dir, global_step_folder)
+        print(f"Load from checkpoint folder: {global_step_folder}")
+        # set global step
+        self.global_steps = int(global_step_folder.split("global_step_")[-1])
+
+        print(f"Setting global step to {self.global_steps}")
+        print(f"Resuming from {global_step_folder}")
+
+        actor_path = os.path.join(global_step_folder, "actor")
+        critic_path = os.path.join(global_step_folder, str(Role.Critic))
+        # load actor
+        self.actor_rollout_wg.load_checkpoint(
+            actor_path, del_local_after_load=self.config.trainer.del_local_ckpt_after_load
+        )
+        # load critic
+        if self.use_critic:
+            self.critic_wg.load_checkpoint(
+                critic_path, del_local_after_load=self.config.trainer.del_local_ckpt_after_load
+            )
+
+        # load dataloader,
+        # TODO: from remote not implemented yet
+        dataloader_local_path = os.path.join(global_step_folder, "data.pt")
+        if os.path.exists(dataloader_local_path):
+            dataloader_state_dict = torch.load(dataloader_local_path, weights_only=False)
+            self.train_dataloader.load_state_dict(dataloader_state_dict)
+        else:
+            print(f"Warning: No dataloader state found at {dataloader_local_path}, will start from scratch")
+
+    def _start_profiling(self, do_profile: bool) -> None:
+        """Start profiling for all worker groups if profiling is enabled."""
+        if do_profile:
+            self.actor_rollout_wg.start_profile(role="e2e", profile_step=self.global_steps)
+            if self.use_reference_policy:
+                self.ref_policy_wg.start_profile(profile_step=self.global_steps)
+            if self.use_critic:
+                self.critic_wg.start_profile(profile_step=self.global_steps)
+
+    def _stop_profiling(self, do_profile: bool) -> None:
+        """Stop profiling for all worker groups if profiling is enabled."""
+        if do_profile:
+            self.actor_rollout_wg.stop_profile()
+            if self.use_reference_policy:
+                self.ref_policy_wg.stop_profile()
+            if self.use_critic:
+                self.critic_wg.stop_profile()
+
+    def _get_dp_size(self, worker_group, role: str) -> int:
+        """Get data parallel size from worker group dispatch info.
+
+        This method retrieves the data parallel size by querying the dispatch info
+        for the specified role. The dispatch info is cached for subsequent calls.
+
+        Args:
+            worker_group: The worker group to query dispatch info from.
+            role: The role name (e.g., "actor", "critic") to get DP size for.
+
+        Returns:
+            The data parallel size (number of DP ranks).
+        """
+        if role not in worker_group._dispatch_info:
+            dp_rank_mapping = worker_group._query_dispatch_info(role)
+            worker_group._dispatch_info[role] = dp_rank_mapping
+        else:
+            dp_rank_mapping = worker_group._dispatch_info[role]
+        return max(dp_rank_mapping) + 1
+
+    def _balance_batch(self, batch: DataProto, metrics, logging_prefix="global_seqlen", keep_minibatch=False):
+        """Reorder the data on single controller such that each dp rank gets similar total tokens.
+
+        When use_prefix_grouper is enabled, uses group-level balancing to keep samples with
+        the same uid together on the same rank for prefix sharing optimization.
+        """
+        attention_mask = batch.batch["attention_mask"]
+        batch_size = attention_mask.shape[0]
+        global_seqlen_lst = batch.batch["attention_mask"].view(batch_size, -1).sum(-1)  # (train_batch_size,)
+        workload_lst = calculate_workload(global_seqlen_lst)
+        # Get dp_size from dispatch info to correctly balance across data parallel ranks
+        # Note: world_size may include tensor/pipeline parallel dimensions, but we only want DP
+        dp_size = self._get_dp_size(self.actor_rollout_wg, "actor")
+
+        # Use group-level balancing for PrefixGrouper to keep same-uid samples together
+        if getattr(self, "use_prefix_grouper", False) and "uid" in batch.non_tensor_batch:
+            from verl.utils.seqlen_balancing import get_group_balanced_partitions
+
+            uid_list = list(batch.non_tensor_batch["uid"])
+            seqlen_list = global_seqlen_lst.tolist()
+
+            # Count number of uid groups
+            num_groups = len(set(uid_list))
+
+            if num_groups % dp_size != 0:
+                raise ValueError(
+                    f"PrefixGrouper with balance_batch requires num_uid_groups ({num_groups}) "
+                    f"% dp_size ({dp_size}) == 0. "
+                    f"This ensures each rank gets equal number of groups. "
+                    f"Current batch_size={batch_size}, adjust batch_size to be a multiple of "
+                    f"dp_size * rollout.n."
+                )
+
+            global_partition_lst = get_group_balanced_partitions(
+                seqlen_list=seqlen_list,
+                uid_list=uid_list,
+                k_partitions=dp_size,
+            )
+
+        elif keep_minibatch:
+            # Decouple the DP balancing and mini-batching.
+            minibatch_size = self.config.actor_rollout_ref.actor.get("ppo_mini_batch_size")
+            minibatch_num = len(workload_lst) // minibatch_size
+            global_partition_lst = [[] for _ in range(dp_size)]
+            for i in range(minibatch_num):
+                rearrange_minibatch_lst = get_seqlen_balanced_partitions(
+                    workload_lst[i * minibatch_size : (i + 1) * minibatch_size],
+                    k_partitions=dp_size,
+                    equal_size=True,
+                )
+                for j, part in enumerate(rearrange_minibatch_lst):
+                    global_partition_lst[j].extend([x + minibatch_size * i for x in part])
+        else:
+            global_partition_lst = get_seqlen_balanced_partitions(workload_lst, k_partitions=dp_size, equal_size=True)
+        # Place smaller micro-batches at both ends to reduce the bubbles in pipeline parallel.
+        # Skip reordering within partitions for PrefixGrouper to maintain uid grouping
+        if not getattr(self, "use_prefix_grouper", False):
+            for idx, partition in enumerate(global_partition_lst):
+                partition.sort(key=lambda x: (workload_lst[x], x))
+                ordered_partition = partition[::2] + partition[1::2][::-1]
+                global_partition_lst[idx] = ordered_partition
+
+        # reorder based on index. The data will be automatically equally partitioned by dispatch function
+        global_idx = torch.tensor([j for partition in global_partition_lst for j in partition])
+        batch.reorder(global_idx)
+        global_balance_stats = log_seqlen_unbalance(
+            seqlen_list=global_seqlen_lst.tolist(), partitions=global_partition_lst, prefix=logging_prefix
+        )
+        metrics.update(global_balance_stats)
+
+    def _compute_values(self, batch: DataProto) -> DataProto:
+        if self.use_legacy_worker_impl == "disable":
+            batch_td = batch.to_tensordict()
+            # step 2: convert from padding to nopadding
+            batch_td = embeds_padding_2_no_padding(batch_td)
+            # step 3: add meta info
+            tu.assign_non_tensor(batch_td, compute_loss=False)
+            output = self.critic_wg.infer_batch(batch_td)
+            output = output.get()
+            values = tu.get(output, "values")
+            values = tu.get_tensordict({"values": values.float()})
+            values = DataProto.from_tensordict(values)
+        else:
+            values = self.critic_wg.compute_values(batch)
+        return values
+
+    def _compute_ref_log_prob(self, batch: DataProto) -> DataProto:
+        if self.use_legacy_worker_impl == "disable":
+            # step 1: convert dataproto to tensordict.
+            batch_td = batch.to_tensordict()
+            # step 2: convert from padding to nopadding
+            batch_td = embeds_padding_2_no_padding(batch_td)
+            # step 3: add meta info
+            metadata = {
+                "compute_loss": False,
+                "height": self.config.actor_rollout_ref.model.image_height,
+                "width": self.config.actor_rollout_ref.model.image_width,
+                "vae_scale_factor": self.config.actor_rollout_ref.model.get("vae_scale_factor", 8),
+            }
+            if self.ref_in_actor:
+                metadata["no_lora_adapter"] = True
+            tu.assign_non_tensor(batch_td, **metadata)
+            if self.ref_in_actor:
+                output = self.actor_rollout_wg.compute_log_prob(batch_td)
+            else:
+                output = self.ref_policy_wg.compute_ref_log_prob(batch_td)
+            # gather output
+            log_probs = tu.get(output, "log_probs")
+            prev_sample_mean = tu.get(output, "prev_sample_mean")
+            # step 5: rebuild a tensordict and convert to dataproto
+            ref_log_prob = tu.get_tensordict(
+                {"ref_log_prob": log_probs.float(), "ref_prev_sample_mean": prev_sample_mean.float()}
+            )
+            ref_log_prob = DataProto.from_tensordict(ref_log_prob)
+        else:
+            ref_log_prob = self.ref_policy_wg.compute_ref_log_prob(batch)
+
+        return ref_log_prob
+
+    def _compute_old_log_prob(self, batch: DataProto):
+        if self.use_legacy_worker_impl == "disable":
+            # TODO: remove step 1, 2, 4 after we make the whole training tensordict and padding free
+            # step 1: convert dataproto to tensordict.
+            batch_td = batch.to_tensordict()
+            # step 2: convert from padding to nopadding
+            batch_td = embeds_padding_2_no_padding(batch_td)
+            # step 3: add meta info
+            tu.assign_non_tensor(
+                batch_td,
+                compute_loss=False,
+                height=self.config.actor_rollout_ref.model.image_height,
+                width=self.config.actor_rollout_ref.model.image_width,
+                vae_scale_factor=self.config.actor_rollout_ref.model.get("vae_scale_factor", 8),
+            )
+            output = self.actor_rollout_wg.compute_log_prob(batch_td)
+            # gather output
+            log_probs = tu.get(output, "log_probs")
+            # step 5: rebuild a tensordict and convert to dataproto
+            old_log_prob = tu.get_tensordict({"old_log_probs": log_probs.float()})
+            old_log_prob = DataProto.from_tensordict(old_log_prob)
+        else:
+            old_log_prob = self.actor_rollout_wg.compute_log_prob(batch)
+        return old_log_prob
+
+    def _update_actor(self, batch: DataProto) -> DataProto:
+        rollout_config = self.config.actor_rollout_ref.rollout
+        batch.meta_info["multi_turn"] = rollout_config.multi_turn.enable
+        # update actor
+        if self.use_legacy_worker_impl == "disable":
+            batch_td = batch.to_tensordict()
+            # step 2: convert from padding to no-padding
+            batch_td = embeds_padding_2_no_padding(batch_td)
+            ppo_mini_batch_size = self.config.actor_rollout_ref.actor.ppo_mini_batch_size
+            ppo_mini_batch_size = ppo_mini_batch_size * self.config.actor_rollout_ref.rollout.n
+            ppo_epochs = self.config.actor_rollout_ref.actor.ppo_epochs
+            seed = self.config.actor_rollout_ref.actor.data_loader_seed
+            shuffle = self.config.actor_rollout_ref.actor.shuffle
+            tu.assign_non_tensor(
+                batch_td,
+                global_batch_size=ppo_mini_batch_size,
+                mini_batch_size=ppo_mini_batch_size,
+                epochs=ppo_epochs,
+                seed=seed,
+                dataloader_kwargs={"shuffle": shuffle},
+                height=self.config.actor_rollout_ref.model.image_height,
+                width=self.config.actor_rollout_ref.model.image_width,
+                vae_scale_factor=self.config.actor_rollout_ref.model.get("vae_scale_factor", 8),
+            )
+
+            actor_output = self.actor_rollout_wg.update_actor(batch_td)
+            actor_output = tu.get(actor_output, "metrics")
+            actor_output = rename_dict(actor_output, "actor/")
+            actor_output = DataProto.from_single_dict(data={}, meta_info={"metrics": actor_output})
+        else:
+            actor_output = self.actor_rollout_wg.update_actor(batch)
+
+        return actor_output
+
+    def _update_critic(self, batch: DataProto) -> DataProto:
+        if self.use_legacy_worker_impl == "disable":
+            batch_td = batch.to_tensordict()
+            # step 2: convert from padding to no-padding
+            batch_td = embeds_padding_2_no_padding(batch_td)
+            ppo_mini_batch_size = self.config.critic.ppo_mini_batch_size
+            ppo_mini_batch_size = ppo_mini_batch_size * self.config.actor_rollout_ref.rollout.n
+            ppo_epochs = self.config.critic.ppo_epochs
+            seed = self.config.critic.data_loader_seed
+            shuffle = self.config.critic.shuffle
+            tu.assign_non_tensor(
+                batch_td,
+                global_batch_size=ppo_mini_batch_size,
+                mini_batch_size=ppo_mini_batch_size,
+                epochs=ppo_epochs,
+                seed=seed,
+                dataloader_kwargs={"shuffle": shuffle},
+                height=self.config.actor_rollout_ref.model.image_height,
+                width=self.config.actor_rollout_ref.model.image_width,
+                vae_scale_factor=self.config.actor_rollout_ref.model.get("vae_scale_factor", 8),
+            )
+
+            output = self.critic_wg.train_mini_batch(batch_td)
+            output = output.get()
+            output = tu.get(output, "metrics")
+            output = rename_dict(output, "critic/")
+            # modify key name
+            output["perf/mfu/critic"] = output.pop("critic/mfu")
+            critic_output = DataProto.from_single_dict(data={}, meta_info={"metrics": output})
+        else:
+            critic_output = self.critic_wg.update_critic(batch)
+        return critic_output
+
+    def fit(self):
+        """
+        The training loop of FlowGRPO.
+        The driver process only need to call the compute functions of the worker group through RPC
+        to construct the PPO dataflow.
+        The light-weight advantage computation is done on the driver process.
+        """
+        from omegaconf import OmegaConf
+
+        from verl.utils.tracking import Tracking
+
+        logger = Tracking(
+            project_name=self.config.trainer.project_name,
+            experiment_name=self.config.trainer.experiment_name,
+            default_backend=self.config.trainer.logger,
+            config=OmegaConf.to_container(self.config, resolve=True),
+        )
+
+        self.global_steps = 0
+
+        # load checkpoint and update weights before doing anything
+        self._load_checkpoint()
+        self.checkpoint_manager.update_weights(self.global_steps)
+
+        current_epoch = self.global_steps // len(self.train_dataloader)
+
+        # perform validation before training
+        # currently, we only support validation using the reward_function.
+        if self.config.trainer.get("val_before_train", True):
+            val_metrics = self._validate()
+            assert val_metrics, f"{val_metrics=}"
+            pprint(f"Initial validation metrics: {val_metrics}")
+            logger.log(data=val_metrics, step=self.global_steps)
+            if self.config.trainer.get("val_only", False):
+                return
+
+        # add tqdm
+        progress_bar = tqdm(total=self.total_training_steps, initial=self.global_steps, desc="Training Progress")
+
+        # we start from step 1
+        self.global_steps += 1
+        last_val_metrics = None
+        self.max_steps_duration = 0
+
+        prev_step_profile = False
+        curr_step_profile = (
+            self.global_steps in self.config.global_profiler.steps
+            if self.config.global_profiler.steps is not None
+            else False
+        )
+        next_step_profile = False
+
+        for epoch in range(current_epoch, self.config.trainer.total_epochs):
+            for batch_dict in self.train_dataloader:
+                if hasattr(self.actor_rollout_wg, "async_calls_finalize_fn_exec"):
+                    self.actor_rollout_wg.async_calls_finalize_fn_exec(blocking=False)
+                metrics = {}
+                timing_raw = {}
+
+                with marked_timer("start_profile", timing_raw):
+                    self._start_profiling(
+                        not prev_step_profile and curr_step_profile
+                        if self.config.global_profiler.profile_continuous_steps
+                        else curr_step_profile
+                    )
+                batch: DataProto = DataProto.from_single_dict(batch_dict)
+
+                # add uid to batch
+                batch.non_tensor_batch["uid"] = np.array(
+                    [str(uuid.uuid4()) for _ in range(len(batch.batch))], dtype=object
+                )
+
+                gen_batch = self._get_gen_batch(batch)
+
+                # pass global_steps to trace
+                gen_batch.meta_info["global_steps"] = self.global_steps
+                gen_batch_output = gen_batch.repeat(
+                    repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True
+                )
+
+                is_last_step = self.global_steps >= self.total_training_steps
+                with marked_timer("step", timing_raw):
+                    # generate a batch
+                    with marked_timer("gen", timing_raw, color="red"):
+                        if curr_step_profile:
+                            self.async_rollout_manager.start_profile()
+                        gen_batch_output = self.async_rollout_manager.generate_sequences(gen_batch_output)
+                        self.checkpoint_manager.sleep_replicas()
+                        if curr_step_profile:
+                            self.async_rollout_manager.stop_profile()
+
+                        timing_raw.update(gen_batch_output.meta_info["timing"])
+                        gen_batch_output.meta_info.pop("timing", None)
+
+                    # repeat to align with repeated responses in rollout
+                    batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
+                    batch = batch.union(gen_batch_output)
+
+                    if "response_mask" not in batch.batch.keys():
+                        batch.batch["response_mask"] = compute_response_mask(batch)
+                    # Balance the number of valid tokens across DP ranks.
+                    # NOTE: This usually changes the order of data in the `batch`,
+                    # which won't affect the advantage calculation (since it's based on uid),
+                    # but might affect the loss calculation (due to the change of mini-batching).
+                    if self.config.trainer.balance_batch:
+                        self._balance_batch(batch, metrics=metrics)
+
+                    # compute global_valid tokens
+                    batch.meta_info["global_token_num"] = torch.sum(batch.batch["attention_mask"], dim=-1).tolist()
+                    # get images_seqlens
+                    images_seqlens_all = []
+                    for multi_modal_input in batch.non_tensor_batch["multi_modal_inputs"]:
+                        if "image_grid_thw" not in multi_modal_input.keys():
+                            continue
+                        images_seqlens_all.extend(multi_modal_input["images_seqlens"].tolist())
+                    batch.meta_info["images_seqlens"] = images_seqlens_all
+                    with marked_timer("reward", timing_raw, color="yellow"):
+                        # compute reward model score
+                        if self.use_rm and "rm_scores" not in batch.batch.keys():
+                            batch_reward = self._compute_reward_colocate(batch)
+                            batch = batch.union(batch_reward)
+
+                        # extract reward_tensor and reward_extra_infos_dict for training
+                        reward_tensor, reward_extra_infos_dict = extract_reward(batch)
+
+                    # Operating Mode Selection:
+                    # - Bypass mode: Sets old_log_probs = rollout_log_probs (2 policies: π_rollout, π_θ)
+                    # - Decoupled mode: Recomputes old_log_probs as proximal anchor (3 policies: π_rollout, π_old, π_θ)
+                    #   Note: π_old computed once per data batch, serves as stable reference during mini-batch updates
+                    rollout_corr_config = self.config.algorithm.get("rollout_correction", None)
+                    bypass_recomputing_logprobs = rollout_corr_config and rollout_corr_config.get("bypass_mode", False)
+                    if bypass_recomputing_logprobs:  # Use `rollout_log_probs`
+                        batch.batch["old_log_probs"] = batch.batch["rollout_log_probs"]
+                    else:  # Recompute old_log_probs
+                        with marked_timer("old_log_prob", timing_raw, color="blue"):
+                            old_log_prob = self._compute_old_log_prob(batch)
+                            batch = batch.union(old_log_prob)
+                            if "rollout_log_probs" in batch.batch.keys():
+                                # TODO: we may want to add diff of probs too.
+                                from verl.utils.debug.metrics import calculate_debug_metrics
+
+                                metrics.update(calculate_debug_metrics(batch))
+
+                    assert "old_log_probs" in batch.batch, f'"old_log_prob" not in {batch.batch.keys()=}'
+
+                    if self.use_reference_policy:
+                        # compute reference log_prob
+                        with marked_timer(str(Role.RefPolicy), timing_raw, color="olive"):
+                            ref_log_prob = self._compute_ref_log_prob(batch)
+                            batch = batch.union(ref_log_prob)
+
+                    # compute values
+                    if self.use_critic:
+                        with marked_timer("values", timing_raw, color="cyan"):
+                            values = self._compute_values(batch)
+                            batch = batch.union(values)
+
+                    with marked_timer("adv", timing_raw, color="brown"):
+                        # we combine with rule-based rm
+                        reward_extra_infos_dict: dict[str, list]
+                        batch.batch["token_level_scores"] = reward_tensor
+
+                        if reward_extra_infos_dict:
+                            batch.non_tensor_batch.update({k: np.array(v) for k, v in reward_extra_infos_dict.items()})
+
+                        # compute rewards. apply_kl_penalty if available
+                        if self.config.algorithm.use_kl_in_reward:
+                            batch, kl_metrics = apply_kl_penalty(
+                                batch, kl_ctrl=self.kl_ctrl_in_reward, kl_penalty=self.config.algorithm.kl_penalty
+                            )
+                            metrics.update(kl_metrics)
+                        else:
+                            batch.batch["token_level_rewards"] = batch.batch["token_level_scores"]
+
+                        # Compute rollout correction: IS weights, rejection sampling, and metrics
+                        # Only runs in decoupled mode (computes once per batch using stable π_old)
+                        # In bypass mode, this is skipped - actor computes metrics from evolving π_θ vs π_rollout
+                        if (
+                            rollout_corr_config is not None
+                            and "rollout_log_probs" in batch.batch
+                            and not bypass_recomputing_logprobs  # Only in decoupled mode
+                        ):
+                            from verl.trainer.ppo.rollout_corr_helper import compute_rollout_correction_and_add_to_batch
+
+                            # Compute IS weights, apply rejection sampling, compute metrics
+                            batch, is_metrics = compute_rollout_correction_and_add_to_batch(batch, rollout_corr_config)
+                            # IS and off-policy metrics already have rollout_corr/ prefix
+                            metrics.update(is_metrics)
+
+                        # compute advantages, executed on the driver process
+                        norm_adv_by_std_in_grpo = self.config.algorithm.get(
+                            "norm_adv_by_std_in_grpo", True
+                        )  # GRPO adv normalization factor
+
+                        batch = compute_advantage(
+                            batch,
+                            adv_estimator=self.config.algorithm.adv_estimator,
+                            norm_adv_by_std_in_grpo=norm_adv_by_std_in_grpo,
+                            global_std=self.config.algorithm.global_std,
+                            config=self.config.algorithm,
+                        )
+
+                    # update critic
+                    if self.use_critic:
+                        with marked_timer("update_critic", timing_raw, color="pink"):
+                            critic_output = self._update_critic(batch)
+                        critic_output_metrics = reduce_metrics(critic_output.meta_info["metrics"])
+                        metrics.update(critic_output_metrics)
+
+                    # implement critic warmup
+                    if self.config.trainer.critic_warmup <= self.global_steps:
+                        # update actor
+                        with marked_timer("update_actor", timing_raw, color="red"):
+                            actor_output = self._update_actor(batch)
+
+                        # Check if the ESI (Elastic Server Instance)/training plan is close to expiration.
+                        esi_close_to_expiration = should_save_ckpt_esi(
+                            max_steps_duration=self.max_steps_duration,
+                            redundant_time=self.config.trainer.esi_redundant_time,
+                        )
+                        # Check if the conditions for saving a checkpoint are met.
+                        # The conditions include a mandatory condition (1) and
+                        # one of the following optional conditions (2/3/4):
+                        # 1. The save frequency is set to a positive value.
+                        # 2. It's the last training step.
+                        # 3. The current step number is a multiple of the save frequency.
+                        # 4. The ESI(Elastic Server Instance)/training plan is close to expiration.
+                        if self.config.trainer.save_freq > 0 and (
+                            is_last_step
+                            or self.global_steps % self.config.trainer.save_freq == 0
+                            or esi_close_to_expiration
+                        ):
+                            if esi_close_to_expiration:
+                                print("Force saving checkpoint: ESI instance expiration approaching.")
+                            with marked_timer("save_checkpoint", timing_raw, color="green"):
+                                self._save_checkpoint()
+
+                        # update weights from trainer to rollout
+                        with marked_timer("update_weights", timing_raw, color="red"):
+                            self.checkpoint_manager.update_weights(self.global_steps)
+
+                        actor_output_metrics = reduce_metrics(actor_output.meta_info["metrics"])
+                        metrics.update(actor_output_metrics)
+
+                    # Log rollout generations if enabled
+                    rollout_data_dir = self.config.trainer.get("rollout_data_dir", None)
+                    if rollout_data_dir:
+                        self._log_rollout_data(batch, reward_extra_infos_dict, timing_raw, rollout_data_dir)
+
+                # validate
+                if self.config.trainer.test_freq > 0 and (
+                    is_last_step or self.global_steps % self.config.trainer.test_freq == 0
+                ):
+                    with marked_timer("testing", timing_raw, color="green"):
+                        val_metrics: dict = self._validate()
+                        if is_last_step:
+                            last_val_metrics = val_metrics
+                    metrics.update(val_metrics)
+
+                with marked_timer("stop_profile", timing_raw):
+                    next_step_profile = (
+                        self.global_steps + 1 in self.config.global_profiler.steps
+                        if self.config.global_profiler.steps is not None
+                        else False
+                    )
+                    self._stop_profiling(
+                        curr_step_profile and not next_step_profile
+                        if self.config.global_profiler.profile_continuous_steps
+                        else curr_step_profile
+                    )
+                    prev_step_profile = curr_step_profile
+                    curr_step_profile = next_step_profile
+
+                steps_duration = timing_raw["step"]
+                self.max_steps_duration = max(self.max_steps_duration, steps_duration)
+
+                # training metrics
+                metrics.update(
+                    {
+                        "training/global_step": self.global_steps,
+                        "training/epoch": epoch,
+                    }
+                )
+                # collect metrics
+                metrics.update(compute_data_metrics(batch=batch, use_critic=self.use_critic))
+                metrics.update(compute_timing_metrics(batch=batch, timing_raw=timing_raw))
+                # TODO: implement actual tflpo and theoretical tflpo
+                n_gpus = self.resource_pool_manager.get_n_gpus()
+                metrics.update(compute_throughout_metrics(batch=batch, timing_raw=timing_raw, n_gpus=n_gpus))
+                # compute variance proxy metrics
+                gradient_norm = metrics.get("actor/grad_norm", None)
+                metrics.update(compute_variance_proxy_metrics(batch=batch, gradient_norm=gradient_norm))
+                # Note: mismatch metrics (KL, PPL, etc.) are collected at line 1179 after advantage computation
+
+                # this is experimental and may be changed/removed in the future in favor of a general-purpose one
+                if isinstance(self.train_dataloader.sampler, AbstractCurriculumSampler):
+                    self.train_dataloader.sampler.update(batch=batch)
+
+                # TODO: make a canonical logger that supports various backend
+                logger.log(data=metrics, step=self.global_steps)
+
+                progress_bar.update(1)
+                self.global_steps += 1
+
+                if (
+                    hasattr(self.config.actor_rollout_ref.actor, "profiler")
+                    and self.config.actor_rollout_ref.actor.profiler.tool == "torch_memory"
+                ):
+                    self.actor_rollout_wg.dump_memory_snapshot(
+                        tag=f"post_update_step{self.global_steps}", sub_dir=f"step{self.global_steps}"
+                    )
+
+                if is_last_step:
+                    if hasattr(self.actor_rollout_wg, "async_calls_finalize_fn_exec"):
+                        self.actor_rollout_wg.async_calls_finalize_fn_exec(blocking=True)
+                    pprint(f"Final validation metrics: {last_val_metrics}")
+                    progress_bar.close()
+                    return
+
+                # this is experimental and may be changed/removed in the future
+                # in favor of a general-purpose data buffer pool
+                if hasattr(self.train_dataset, "on_batch_end"):
+                    # The dataset may be changed after each training batch
+                    self.train_dataset.on_batch_end(batch=batch)
diff --git a/verl/trainer/ppo/reward.py b/verl/trainer/ppo/reward.py
index f13a3abf976..0d8b108e1dd 100644
--- a/verl/trainer/ppo/reward.py
+++ b/verl/trainer/ppo/reward.py
@@ -19,7 +19,7 @@
 from typing import TYPE_CHECKING, Any, Optional, cast
 
 from verl import DataProto
-from verl.utils.reward_score import default_compute_score
+from verl.utils.reward_score import default_compute_score, default_compute_score_image
 
 if TYPE_CHECKING:
     from omegaconf import DictConfig
@@ -123,6 +123,9 @@ def load_reward_manager(config: DictConfig, tokenizer: Any, **reward_kwargs: Any
             load_extern_object(module_path=module_cfg.path, object_name=reward_manager_cls_name),
         )
 
+    default_compute_score_ = (
+        default_compute_score_image if reward_manager_cfg.name == "image" else default_compute_score
+    )
     if compute_score is None:
         sandbox_config = config.reward.get("sandbox_fusion")
         sandbox_url = sandbox_config.get("url") if sandbox_config else None
@@ -132,13 +135,13 @@ def load_reward_manager(config: DictConfig, tokenizer: Any, **reward_kwargs: Any
             # Create a semaphore to control concurrent access to the sandbox
             _concurrent_semaphore = sandbox_manager.Semaphore(sandbox_config.get("max_concurrent", 64))
             final_compute_score = partial(
-                default_compute_score,
+                default_compute_score_,
                 sandbox_fusion_url=sandbox_url,
                 concurrent_semaphore=_concurrent_semaphore,
                 memory_limit_mb=memory_limit_mb,
             )
         else:
-            final_compute_score = default_compute_score
+            final_compute_score = default_compute_score_
 
     # Instantiate and return the reward manager with the specified parameters
     return reward_manager_cls(
diff --git a/verl/utils/dataset/rl_dataset.py b/verl/utils/dataset/rl_dataset.py
index 117f2df8d41..63fe574fd83 100644
--- a/verl/utils/dataset/rl_dataset.py
+++ b/verl/utils/dataset/rl_dataset.py
@@ -141,6 +141,9 @@ def __init__(
         self.shuffle = config.get("shuffle", False)
         self.seed = config.get("seed")
 
+        # For diffusion model training only
+        self.negative_prompt_key = config.get("negative_prompt_key", "negative_prompt")
+
         self._download()
         self._read_files_and_tokenize()
 
@@ -289,7 +292,7 @@ def __getstate__(self):
     def __len__(self):
         return len(self.dataframe)
 
-    def _build_messages(self, example: dict):
+    def _build_messages(self, example: dict, key: str):
         """Replace <image> and <video> placeholder in messages with corresponding image and video
         which is required by processor.apply_chat_template.
         - <image>: {"type": "image", **image}
@@ -301,7 +304,7 @@ def _build_messages(self, example: dict):
         Returns:
             messages: List of messages with replaced placeholder.
         """
-        messages: list = example[self.prompt_key]
+        messages: list = example[key]
         # When concatenating image and video datasets, pop will return None for image or video sample
         images = example.pop(self.image_key, None) or []
         videos = example.pop(self.video_key, None) or []
@@ -348,7 +351,11 @@ def _build_messages(self, example: dict):
     def __getitem__(self, item):
         """For rollout, apply_chat_template has been moved to AgentLoop, so we only return raw_prompt here."""
         row_dict: dict = self.dataframe[item]
-        row_dict["raw_prompt"] = self._build_messages(row_dict)
+        row_dict["raw_prompt"] = self._build_messages(row_dict, key=self.prompt_key)
+        try:
+            row_dict["raw_negative_prompt"] = self._build_messages(row_dict, key=self.negative_prompt_key)
+        except IndexError:
+            pass
 
         # TODO(wuxibin): We still need a dummy tensor to make sure DataProto.batch is not empty.
         # Remove this after deprecate DataProto by TensorDict.
diff --git a/verl/utils/diffusers/__init__.py b/verl/utils/diffusers/__init__.py
new file mode 100644
index 00000000000..1cd1e8433df
--- /dev/null
+++ b/verl/utils/diffusers/__init__.py
@@ -0,0 +1,13 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/verl/utils/diffusers/schedulers/__init__.py b/verl/utils/diffusers/schedulers/__init__.py
new file mode 100644
index 00000000000..4127227d7c7
--- /dev/null
+++ b/verl/utils/diffusers/schedulers/__init__.py
@@ -0,0 +1,17 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .scheduling_flow_match_sde_discrete import FlowMatchSDEDiscreteScheduler
+
+__all__ = ["FlowMatchSDEDiscreteScheduler"]
diff --git a/verl/utils/diffusers/schedulers/scheduling_flow_match_sde_discrete.py b/verl/utils/diffusers/schedulers/scheduling_flow_match_sde_discrete.py
new file mode 100644
index 00000000000..645b0ffbe24
--- /dev/null
+++ b/verl/utils/diffusers/schedulers/scheduling_flow_match_sde_discrete.py
@@ -0,0 +1,228 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+from dataclasses import dataclass
+from typing import Literal, Optional
+
+import torch
+from diffusers import FlowMatchEulerDiscreteScheduler
+from diffusers.utils import BaseOutput
+from diffusers.utils.torch_utils import randn_tensor
+
+
+@dataclass
+class FlowMatchSDEDiscreteSchedulerOutput(BaseOutput):
+    """
+    Output class for the scheduler's `step` function output.
+
+    Args:
+        prev_sample (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_channels)` for images):
+            Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the
+            denoising loop.
+        log_prob (`torch.FloatTensor` of shape `(batch_size,)`, *optional*):
+            The log probability of the previous sample.
+        prev_sample_mean (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_channels)` for images):
+            The mean of the computed sample of previous timestep.
+        std_dev_t (`torch.FloatTensor` of shape `(batch_size, 1, 1)`):
+            The standard deviation used to compute `prev_sample`.
+    """
+
+    prev_sample: torch.FloatTensor
+    log_prob: Optional[torch.FloatTensor]
+    prev_sample_mean: torch.FloatTensor
+    std_dev_t: torch.FloatTensor
+
+
+class FlowMatchSDEDiscreteScheduler(FlowMatchEulerDiscreteScheduler):
+    def step(
+        self,
+        model_output: torch.FloatTensor,
+        timestep: float | torch.FloatTensor,
+        sample: torch.FloatTensor,
+        s_churn: float = 0.0,
+        s_tmin: float = 0.0,
+        s_tmax: float = float("inf"),
+        s_noise: float = 1.0,
+        generator: Optional[torch.Generator] = None,
+        per_token_timesteps: Optional[torch.Tensor] = None,
+        return_dict: bool = True,
+        noise_level: float = 0.7,
+        prev_sample: Optional[torch.FloatTensor] = None,
+        sde_type: Literal["sde", "cps"] = "sde",
+        logprobs: bool = True,
+    ) -> FlowMatchSDEDiscreteSchedulerOutput | tuple:
+        """
+        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
+        process from the learned model outputs (most often the predicted noise).
+
+        Modified from https://github.com/yifan123/flow_grpo/blob/main/flow_grpo/diffusers_patch/sd3_sde_with_logprob.py
+
+        Args:
+            model_output (`torch.FloatTensor`):
+                The direct output from learned diffusion model.
+            timestep (`float`):
+                The current discrete timestep in the diffusion chain.
+            sample (`torch.FloatTensor`):
+                A current instance of a sample created by the diffusion process.
+            s_churn (`float`):
+            s_tmin  (`float`):
+            s_tmax  (`float`):
+            s_noise (`float`, defaults to 1.0):
+                Scaling factor for noise added to the sample.
+            generator (`torch.Generator`, *optional*):
+                A random number generator.
+            per_token_timesteps (`torch.Tensor`, *optional*):
+                The timesteps for each token in the sample.
+            return_dict (`bool`):
+                Whether or not to return a
+                [`~schedulers.scheduling_flow_match_euler_discrete.FlowMatchSDEDiscreteSchedulerOutput`] or tuple.
+            noise_level (`float`, *optional*, defaults to 0.7):
+                The noise level used in the SDE.
+            prev_sample (`torch.FloatTensor`, *optional*):
+                The sample from the previous timestep. If not provided, it will be sampled inside the function.
+            sde_type (`str`, *optional*, defaults to "sde"):
+                The type of SDE to use. Choose between "sde" and "cps".
+
+        Returns:
+            [`~schedulers.scheduling_flow_match_euler_discrete.FlowMatchSDEDiscreteSchedulerOutput`] or `tuple`:
+                If return_dict is `True`,
+                [`~schedulers.scheduling_flow_match_euler_discrete.FlowMatchSDEDiscreteSchedulerOutput`] is returned,
+                otherwise a tuple is returned where the first element is the sample tensor.
+        """
+
+        if isinstance(timestep, int) or isinstance(timestep, torch.IntTensor) or isinstance(timestep, torch.LongTensor):
+            raise ValueError(
+                (
+                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
+                    " `FlowMatchEulerDiscreteScheduler.step()` is not supported. Make sure to pass"
+                    " one of the `scheduler.timesteps` as a timestep."
+                ),
+            )
+
+        if self.step_index is None:
+            self._init_step_index(timestep)
+
+        # Upcast to avoid precision issues when computing prev_sample
+        sample = sample.to(torch.float32)
+        if prev_sample is not None:
+            prev_sample = prev_sample.to(torch.float32)
+
+        prev_sample, log_prob, prev_sample_mean, std_dev_t = self.sample_previous_step(
+            sample=sample,
+            model_output=model_output,
+            generator=generator,
+            per_token_timesteps=per_token_timesteps,
+            noise_level=noise_level,
+            prev_sample=prev_sample,
+            sde_type=sde_type,
+            logprobs=logprobs,
+        )
+
+        # upon completion increase step index by one
+        self._step_index += 1
+        if per_token_timesteps is None:
+            # Cast sample back to model compatible dtype
+            prev_sample = prev_sample.to(model_output.dtype)
+
+        if not return_dict:
+            return (prev_sample, log_prob, prev_sample_mean, std_dev_t)
+        return FlowMatchSDEDiscreteSchedulerOutput(
+            prev_sample=prev_sample, log_prob=log_prob, prev_sample_mean=prev_sample_mean, std_dev_t=std_dev_t
+        )
+
+    def sample_previous_step(
+        self,
+        sample: torch.Tensor,
+        model_output: torch.Tensor,
+        timestep: Optional[torch.FloatTensor] = None,
+        generator: Optional[torch.Generator] = None,
+        per_token_timesteps: Optional[torch.Tensor] = None,
+        noise_level: float = 0.7,
+        prev_sample: Optional[torch.Tensor] = None,
+        sde_type: Literal["cps", "sde"] = "sde",
+        logprobs: bool = True,
+    ):
+        # check inputs
+        assert sample.dtype == torch.float32
+        if prev_sample is not None:
+            assert prev_sample.dtype == torch.float32
+
+        if per_token_timesteps is not None:
+            raise NotImplementedError("per_token_timesteps is not supported yet for FlowMatchSDEDiscreteScheduler.")
+        else:
+            if timestep is None:
+                sigma_idx = self.step_index
+                sigma = self.sigmas[sigma_idx]
+                sigma_next = self.sigmas[sigma_idx + 1]
+            else:
+                sigma_idx = torch.tensor([self.index_for_timestep(t) for t in timestep])
+                sigma = self.sigmas[sigma_idx].view(-1, *([1] * (len(sample.shape) - 1)))
+                sigma_next = self.sigmas[sigma_idx + 1].view(-1, *([1] * (len(sample.shape) - 1)))
+
+            sigma_max = self.sigmas[1]
+            dt = sigma_next - sigma
+
+        if sde_type == "sde":
+            std_dev_t = torch.sqrt(sigma / (1 - torch.where(sigma == 1, sigma_max, sigma))) * noise_level
+
+            # our sde
+            prev_sample_mean = (
+                sample * (1 + std_dev_t**2 / (2 * sigma) * dt)
+                + model_output * (1 + std_dev_t**2 * (1 - sigma) / (2 * sigma)) * dt
+            )
+
+            if prev_sample is None:
+                variance_noise = randn_tensor(
+                    model_output.shape,
+                    generator=generator,
+                    device=model_output.device,
+                    dtype=model_output.dtype,
+                )
+                prev_sample = prev_sample_mean + std_dev_t * torch.sqrt(-1 * dt) * variance_noise
+
+            if logprobs:
+                log_prob = (
+                    -((prev_sample.detach() - prev_sample_mean) ** 2) / (2 * ((std_dev_t * torch.sqrt(-1 * dt)) ** 2))
+                    - torch.log(std_dev_t * torch.sqrt(-1 * dt))
+                    - torch.log(torch.sqrt(2 * torch.as_tensor(math.pi)))
+                )
+            else:
+                log_prob = None
+
+        elif sde_type == "cps":
+            std_dev_t = sigma_next * math.sin(noise_level * math.pi / 2)  # sigma_t in paper
+            pred_original_sample = sample - sigma * model_output  # predicted x_0 in paper
+            noise_estimate = sample + model_output * (1 - sigma)  # predicted x_1 in paper
+            prev_sample_mean = pred_original_sample * (1 - sigma_next) + noise_estimate * torch.sqrt(
+                sigma_next**2 - std_dev_t**2
+            )
+
+            if prev_sample is None:
+                variance_noise = randn_tensor(
+                    model_output.shape,
+                    generator=generator,
+                    device=model_output.device,
+                    dtype=model_output.dtype,
+                )
+                prev_sample = prev_sample_mean + std_dev_t * variance_noise
+
+            # remove all constants
+            if logprobs:
+                log_prob = -((prev_sample.detach() - prev_sample_mean) ** 2)
+            else:
+                log_prob = None
+
+        # mean along all but batch dimension
+        log_prob = log_prob.mean(dim=tuple(range(1, log_prob.ndim))) if log_prob is not None else None
+        return prev_sample, log_prob, prev_sample_mean, std_dev_t
diff --git a/verl/utils/diffusers/utils.py b/verl/utils/diffusers/utils.py
new file mode 100644
index 00000000000..b6059ff7b72
--- /dev/null
+++ b/verl/utils/diffusers/utils.py
@@ -0,0 +1,43 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from diffusers import SchedulerMixin
+
+from verl.utils.device import get_device_name
+from verl.workers.config import DiffusersModelConfig
+
+
+def set_timesteps(scheduler: SchedulerMixin, model_config: DiffusersModelConfig):
+    # TODO (mike): using path name is not robust, refactor later
+    if model_config.path.endswith("Qwen-Image"):
+        from diffusers.pipelines.qwenimage.pipeline_qwenimage import calculate_shift
+
+        vae_scale_factor = 8
+        latent_height, latent_width = (
+            model_config.image_height // vae_scale_factor // 2,
+            model_config.image_width // vae_scale_factor // 2,
+        )
+        num_inference_steps = model_config.num_inference_steps
+        sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps)
+        mu = calculate_shift(
+            latent_height * latent_width,
+            scheduler.config.get("base_image_seq_len", 256),
+            scheduler.config.get("max_image_seq_len", 4096),
+            scheduler.config.get("base_shift", 0.5),
+            scheduler.config.get("max_shift", 1.15),
+        )
+        scheduler.set_timesteps(num_inference_steps, device=get_device_name(), sigmas=sigmas, mu=mu)
+    else:
+        raise NotImplementedError("unsupported model for custom scheduler settings")
diff --git a/verl/utils/fsdp_utils.py b/verl/utils/fsdp_utils.py
index 8f7f8bef0d0..1f2be9fba7b 100644
--- a/verl/utils/fsdp_utils.py
+++ b/verl/utils/fsdp_utils.py
@@ -568,7 +568,7 @@ def fsdp2_clip_grad_norm_(parameters, max_norm, norm_type=2.0, error_if_nonfinit
     return total_norm
 
 
-def layered_summon_lora_params(fsdp_module) -> OrderedDict:
+def layered_summon_lora_params(fsdp_module, is_diffusers=False) -> OrderedDict:
     from peft.utils.save_and_load import get_peft_model_state_dict
 
     def __prefix_submodules(module, prefix):
@@ -577,22 +577,33 @@ def __prefix_submodules(module, prefix):
                 yield name, submodule
 
     lora_params = OrderedDict()
-    prefix_list = [
-        # fsdp
-        "_fsdp_wrapped_module.base_model.model.",
-        "_fsdp_wrapped_module.base_model.model.model.",
-        "_fsdp_wrapped_module.base_model.model.model.layers.",
-        "_fsdp_wrapped_module.base_model.model.model.language_model.layers.",
-        # fsdp2
-        "base_model.model.",
-        "base_model.model.model.",
-        "base_model.model.model.layers.",
-        "base_model.model.model.language_model.layers.",
-    ]
+    if is_diffusers:
+        prefix_list = [
+            # fsdp
+            "_fsdp_wrapped_module.transformer_blocks.",
+            # fsdp2
+            "transformer_blocks.",
+        ]
+    else:
+        prefix_list = [
+            # fsdp
+            "_fsdp_wrapped_module.base_model.model.",
+            "_fsdp_wrapped_module.base_model.model.model.",
+            "_fsdp_wrapped_module.base_model.model.model.layers.",
+            "_fsdp_wrapped_module.base_model.model.model.language_model.layers.",
+            # fsdp2
+            "base_model.model.",
+            "base_model.model.model.",
+            "base_model.model.model.layers.",
+            "base_model.model.model.language_model.layers.",
+        ]
     peft_model = getattr(fsdp_module, "_fsdp_wrapped_module", fsdp_module)
     for prefix in prefix_list:
         for name, submodule in __prefix_submodules(fsdp_module, prefix):
-            prefix = name.replace("_fsdp_wrapped_module.base_model.model.", "base_model.model.")
+            if is_diffusers:
+                prefix = name.replace("_fsdp_wrapped_module.", "")
+            else:
+                prefix = name.replace("_fsdp_wrapped_module.base_model.model.", "base_model.model.")
             if name.endswith(".model") or name.endswith(".layers"):
                 continue
             if fsdp_version(submodule) > 0:
@@ -610,7 +621,9 @@ def __prefix_submodules(module, prefix):
     return lora_params
 
 
-def collect_lora_params(module: FSDP, layered_summon: bool, base_sync_done: bool) -> OrderedDict:
+def collect_lora_params(
+    module: FSDP, layered_summon: bool, base_sync_done: bool, is_diffusers: bool = False
+) -> OrderedDict:
     """
     collect lora params or full params if base model is not ready in vllm
     work with if isinstance(self.module._fsdp_wrapped_module, PeftModel)
@@ -626,7 +639,7 @@ def collect_lora_params(module: FSDP, layered_summon: bool, base_sync_done: bool
                     "To use layered_summon, you must make sure base-model is preloaded in vllm, e.g. let "
                     "rollout.load_format=safetensors"
                 )
-            lora_params = layered_summon_lora_params(module)
+            lora_params = layered_summon_lora_params(module, is_diffusers=is_diffusers)
         else:
             with FSDP.summon_full_params(module, writeback=False):
                 if base_sync_done:
diff --git a/verl/utils/reward_score/__init__.py b/verl/utils/reward_score/__init__.py
index b65d94ec14d..4e6ede51d05 100644
--- a/verl/utils/reward_score/__init__.py
+++ b/verl/utils/reward_score/__init__.py
@@ -114,6 +114,47 @@ def default_compute_score(
         return float(res[0])
 
 
+def default_compute_score_image(
+    data_source,
+    solution_image,
+    ground_truth,
+    extra_info=None,
+    sandbox_fusion_url=None,
+    concurrent_semaphore=None,
+    memory_limit_mb=None,
+    **kwargs,
+):
+    """Compute the score for a given solution based on the data source.
+
+    Args:
+        data_source (str): The source dataset identifier which determines the scoring method.
+        solution_image (Image.Image or torch.Tensor): The solution image to be evaluated.
+        ground_truth (str): The ground truth answer for comparison.
+        extra_info (dict, optional): Additional information that might be needed for scoring. Defaults to None.
+
+    Returns:
+        float: The computed score as a floating point number. If the result is a dictionary,
+               it returns the dictionary instead.
+
+    Raises:
+        NotImplementedError: If the reward function is not implemented for the given data source.
+    """
+    if data_source == "jpeg_compressibility":
+        from . import jpeg_compressibility
+
+        res = jpeg_compressibility.compute_score(solution_image)
+
+    else:
+        raise NotImplementedError(f"Reward function is not implemented for {data_source=}")
+
+    if isinstance(res, dict):
+        return res
+    elif isinstance(res, int | float | bool):
+        return float(res)
+    else:
+        return float(res[0])
+
+
 @deprecated("verl.utils.reward_score.default_compute_score")
 def _default_compute_score(
     data_source,
diff --git a/verl/utils/reward_score/jpeg_compressibility.py b/verl/utils/reward_score/jpeg_compressibility.py
new file mode 100644
index 00000000000..edf8b971a14
--- /dev/null
+++ b/verl/utils/reward_score/jpeg_compressibility.py
@@ -0,0 +1,56 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import io
+
+import numpy as np
+import torch
+from PIL import Image
+
+
+def jpeg_incompressibility():
+    def _fn(images, prompts):
+        if isinstance(images, torch.Tensor):
+            images = (images * 255).round().clamp(0, 255).to(torch.uint8).cpu().numpy()
+            images = images.transpose(0, 2, 3, 1)  # NCHW -> NHWC
+        images = [Image.fromarray(image) for image in images]
+        buffers = [io.BytesIO() for _ in images]
+        for image, buffer in zip(images, buffers, strict=False):
+            image.save(buffer, format="JPEG", quality=95)
+        sizes = [buffer.tell() / 1000 for buffer in buffers]
+        return np.array(sizes), {}
+
+    return _fn
+
+
+def jpeg_compressibility():
+    jpeg_fn = jpeg_incompressibility()
+
+    def _fn(images, prompts):
+        rew, meta = jpeg_fn(images, prompts)
+        return -rew / 500, meta
+
+    return _fn
+
+
+def compute_score(solution_image):
+    """The scoring function for JPEG compressibility.
+
+    Args:
+        solution_image: the solution image
+    """
+    if isinstance(solution_image, torch.Tensor) and solution_image.ndim == 3:
+        solution_image = solution_image.unsqueeze(0)
+    score = jpeg_compressibility()(solution_image, None)[0]
+    return score
diff --git a/verl/utils/vllm/__init__.py b/verl/utils/vllm/__init__.py
index 00aa7bdb642..a5384ed6760 100644
--- a/verl/utils/vllm/__init__.py
+++ b/verl/utils/vllm/__init__.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from .utils import TensorLoRARequest, VLLMHijack, is_version_ge
+from .utils import OmniTensorLoRARequest, TensorLoRARequest, VLLMHijack, VLLMOmniHijack, is_version_ge
 
 # The contents of vllm/patch.py should not be imported here, because the contents of
 # patch.py should be imported after the vllm LLM instance is created. Therefore,
@@ -21,6 +21,8 @@
 
 __all__ = [
     "TensorLoRARequest",
+    "OmniTensorLoRARequest",
     "VLLMHijack",
+    "VLLMOmniHijack",
     "is_version_ge",
 ]
diff --git a/verl/utils/vllm/utils.py b/verl/utils/vllm/utils.py
index 1ac655fcf60..1a646e28e19 100644
--- a/verl/utils/vllm/utils.py
+++ b/verl/utils/vllm/utils.py
@@ -21,9 +21,12 @@
 except ImportError:
     from vllm.lora.models import LoRAModel
 
+from vllm.lora.peft_helper import PEFTHelper
 from vllm.lora.request import LoRARequest
 from vllm.lora.utils import get_adapter_absolute_path
 from vllm.lora.worker_manager import LRUCacheWorkerLoRAManager
+from vllm_omni.diffusion.lora.manager import DiffusionLoRAManager, logger
+from vllm_omni.lora.request import LoRARequest as OmniLoRARequest
 
 from verl.third_party.vllm import get_version
 
@@ -33,6 +36,11 @@ class TensorLoRARequest(LoRARequest):
     lora_tensors: dict = field(default=None)
 
 
+class OmniTensorLoRARequest(OmniLoRARequest):
+    peft_config: dict = field(default=None)
+    lora_tensors: dict = field(default=None)
+
+
 class VLLMHijack:
     @staticmethod
     def hijack():
@@ -58,7 +66,6 @@ def hijack__load_adapter(self, lora_request: TensorLoRARequest) -> LoRAModel:
                 expected_lora_modules = list(set(expected_lora_modules))
 
                 lora_tensors = None
-                from vllm.lora.peft_helper import PEFTHelper
 
                 if isinstance(lora_request, TensorLoRARequest):
                     peft_config = lora_request.peft_config
@@ -126,3 +133,85 @@ def do_hijack(target_cls, target_method_name, hooking_method):
 def is_version_ge(pkg: str = "vllm", minver: str = "0.7.3"):
     """check if the package version is greater than or equal to the minimum version"""
     return vs.parse(get_version(pkg)) >= vs.parse(minver)
+
+
+class VLLMOmniHijack:
+    @staticmethod
+    def hijack():
+        def hijack__load_adapter(self, lora_request: OmniTensorLoRARequest) -> tuple[LoRAModel, PEFTHelper]:
+            """
+            based on vllm_omni.diffusion.lora.manager.DiffusionLoRAManager._load_adapter,
+            support load adapter with lora tensors
+
+            Reason:
+            VLLM-Omni does not support adding LoRA from tensors directly. It only supports adding LoRA via file paths.
+            To synchronize the LoRA tensors of the actor model, we need to find a workaround to enable VLLM to
+            load memory-based LoRA tensors.
+            """
+            if not self._expected_lora_modules:
+                raise ValueError("No supported LoRA modules found in the diffusion pipeline.")
+
+            logger.debug("Supported LoRA modules: %s", self._expected_lora_modules)
+
+            lora_tensors = None
+
+            if isinstance(lora_request, OmniTensorLoRARequest):
+                peft_config = lora_request.peft_config
+                lora_tensors = lora_request.lora_tensors
+                peft_helper = PEFTHelper.from_dict(peft_config)
+            else:
+                lora_path = get_adapter_absolute_path(lora_request.lora_path)
+                logger.debug("Resolved LoRA path: %s", lora_path)
+
+                peft_helper = PEFTHelper.from_local_dir(
+                    lora_path,
+                    max_position_embeddings=None,  # no need in diffusion
+                    tensorizer_config_dict=lora_request.tensorizer_config_dict,
+                )
+
+            logger.info(
+                "Loaded PEFT config: r=%d, lora_alpha=%d, target_modules=%s",
+                peft_helper.r,
+                peft_helper.lora_alpha,
+                peft_helper.target_modules,
+            )
+
+            if isinstance(lora_request, OmniTensorLoRARequest):
+                lora_model = LoRAModel.from_lora_tensors(
+                    tensors=lora_tensors,
+                    peft_helper=peft_helper,
+                    lora_model_id=lora_request.lora_int_id,
+                    device="cpu",  # consistent w/ vllm's behavior
+                    dtype=self.dtype,
+                    model_vocab_size=None,
+                    weights_mapper=None,
+                )
+            else:
+                lora_model = LoRAModel.from_local_checkpoint(
+                    lora_path,
+                    expected_lora_modules=self._expected_lora_modules,
+                    peft_helper=peft_helper,
+                    lora_model_id=lora_request.lora_int_id,
+                    device="cpu",  # consistent w/ vllm's behavior
+                    dtype=self.dtype,
+                    model_vocab_size=None,
+                    tensorizer_config_dict=lora_request.tensorizer_config_dict,
+                    weights_mapper=None,
+                )
+
+            logger.info(
+                "Loaded LoRA model: id=%d, num_modules=%d, modules=%s",
+                lora_model.id,
+                len(lora_model.loras),
+                list(lora_model.loras.keys()),
+            )
+
+            for lora in lora_model.loras.values():
+                lora.optimize()  # ref: _create_merged_loras_inplace, internal scaling
+
+            return lora_model, peft_helper
+
+        def do_hijack(target_cls, target_method_name, hooking_method):
+            setattr(target_cls, target_method_name, hooking_method)
+
+        do_hijack(DiffusionLoRAManager, "_load_adapter", hijack__load_adapter)
diff --git a/verl/utils/vllm_omni/__init__.py b/verl/utils/vllm_omni/__init__.py
new file mode 100644
index 00000000000..1cd1e8433df
--- /dev/null
+++ b/verl/utils/vllm_omni/__init__.py
@@ -0,0 +1,13 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/verl/utils/vllm_omni/pipelines/__init__.py b/verl/utils/vllm_omni/pipelines/__init__.py
new file mode 100644
index 00000000000..c14448db8db
--- /dev/null
+++ b/verl/utils/vllm_omni/pipelines/__init__.py
@@ -0,0 +1,16 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .pipeline_qwenimage import QwenImagePipelineWithLogProb
+
+__all__ = ["QwenImagePipelineWithLogProb"]
diff --git a/verl/utils/vllm_omni/pipelines/pipeline_qwenimage.py b/verl/utils/vllm_omni/pipelines/pipeline_qwenimage.py
new file mode 100644
index 00000000000..52e5c8a357a
--- /dev/null
+++ b/verl/utils/vllm_omni/pipelines/pipeline_qwenimage.py
@@ -0,0 +1,438 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+from typing import Any, Literal
+
+import torch
+from diffusers.models.autoencoders.autoencoder_kl_qwenimage import AutoencoderKLQwenImage
+from transformers import Qwen2_5_VLForConditionalGeneration
+from vllm_omni.diffusion.data import DiffusionOutput, OmniDiffusionConfig
+from vllm_omni.diffusion.distributed.utils import get_local_device
+from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader
+from vllm_omni.diffusion.models.qwen_image import QwenImagePipeline
+from vllm_omni.diffusion.request import OmniDiffusionRequest
+
+from verl.utils.diffusers.schedulers import FlowMatchSDEDiscreteScheduler
+from verl.utils.vllm_omni.pipelines.qwen_image.qwen_image_transformer import QwenImageTransformer2DModelFixed
+
+
+def _maybe_to_cpu(v):
+    if isinstance(v, torch.Tensor):
+        return v.detach().cpu()
+    return v
+
+
+class QwenImagePipelineWithLogProb(QwenImagePipeline):
+    def __init__(self, *, od_config: OmniDiffusionConfig, prefix: str = ""):
+        super(QwenImagePipeline, self).__init__()
+        self.od_config = od_config
+        self.parallel_config = od_config.parallel_config
+        self.weights_sources = [
+            DiffusersPipelineLoader.ComponentSource(
+                model_or_path=od_config.model,
+                subfolder="transformer",
+                revision=None,
+                prefix="transformer.",
+                fall_back_to_pt=True,
+            )
+        ]
+
+        self.device = get_local_device()
+        model = od_config.model
+        # Check if model is a local path
+        local_files_only = os.path.exists(model)
+
+        self.scheduler = FlowMatchSDEDiscreteScheduler.from_pretrained(
+            model, subfolder="scheduler", local_files_only=local_files_only
+        )
+        self.text_encoder = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+            model, subfolder="text_encoder", local_files_only=local_files_only
+        )
+        self.vae = AutoencoderKLQwenImage.from_pretrained(model, subfolder="vae", local_files_only=local_files_only).to(
+            self.device
+        )
+        self.transformer = QwenImageTransformer2DModelFixed(od_config=od_config)
+
+        self.stage = None
+
+        self.vae_scale_factor = 2 ** len(self.vae.temperal_downsample) if getattr(self, "vae", None) else 8
+        # QwenImage latents are turned into 2x2 patches and packed.
+        # This means the latent width and height has to be divisible
+        # by the patch size. So the vae scale factor is multiplied by the patch size to account for this
+        # self.image_processor = VaeImageProcessor(
+        #     vae_scale_factor=self.vae_scale_factor * 2
+        # )
+        self.prompt_template_encode_start_idx = 34
+        self.default_sample_size = 128
+
+    def _get_qwen_prompt_embeds(
+        self,
+        prompt_ids: torch.Tensor,
+        attention_mask: torch.Tensor | None = None,
+        dtype: torch.dtype | None = None,
+    ):
+        dtype = dtype or self.text_encoder.dtype
+
+        if attention_mask is None:
+            attention_mask = torch.ones_like(prompt_ids, dtype=torch.long)
+
+        prompt_ids = prompt_ids.unsqueeze(0) if prompt_ids.ndim == 1 else prompt_ids
+        attention_mask = attention_mask.unsqueeze(0) if attention_mask.ndim == 1 else attention_mask
+        drop_idx = self.prompt_template_encode_start_idx
+        encoder_hidden_states = self.text_encoder(
+            input_ids=prompt_ids.to(self.device),
+            attention_mask=attention_mask.to(self.device),
+            output_hidden_states=True,
+        )
+        hidden_states = encoder_hidden_states.hidden_states[-1]
+        split_hidden_states = self._extract_masked_hidden(hidden_states, attention_mask)
+        split_hidden_states = [e[drop_idx:] for e in split_hidden_states]
+        attn_mask_list = [torch.ones(e.size(0), dtype=torch.long, device=e.device) for e in split_hidden_states]
+        max_seq_len = max([e.size(0) for e in split_hidden_states])
+        prompt_embeds = torch.stack(
+            [torch.cat([u, u.new_zeros(max_seq_len - u.size(0), u.size(1))]) for u in split_hidden_states]
+        )
+        encoder_attention_mask = torch.stack(
+            [torch.cat([u, u.new_zeros(max_seq_len - u.size(0))]) for u in attn_mask_list]
+        )
+
+        prompt_embeds = prompt_embeds.to(dtype=dtype)
+
+        return prompt_embeds, encoder_attention_mask
+
+    def encode_prompt(
+        self,
+        prompt_ids: torch.Tensor,
+        attention_mask: torch.Tensor | None = None,
+        num_images_per_prompt: int = 1,
+        prompt_embeds: torch.Tensor | None = None,
+        prompt_embeds_mask: torch.Tensor | None = None,
+        max_sequence_length: int = 1024,
+    ):
+        prompt_ids = prompt_ids.unsqueeze(0) if prompt_ids.ndim == 1 else prompt_ids
+        attention_mask = (
+            attention_mask.unsqueeze(0) if attention_mask is not None and attention_mask.ndim == 1 else attention_mask
+        )
+        batch_size = prompt_ids.shape[0] if prompt_embeds is None else prompt_embeds.shape[0]
+
+        if prompt_embeds is None:
+            prompt_embeds, prompt_embeds_mask = self._get_qwen_prompt_embeds(prompt_ids, attention_mask=attention_mask)
+
+        prompt_embeds = prompt_embeds[:, :max_sequence_length]
+        prompt_embeds_mask = prompt_embeds_mask[:, :max_sequence_length]
+
+        _, seq_len, _ = prompt_embeds.shape
+        prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds = prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+        prompt_embeds_mask = prompt_embeds_mask.repeat(1, num_images_per_prompt, 1)
+        prompt_embeds_mask = prompt_embeds_mask.view(batch_size * num_images_per_prompt, seq_len)
+
+        return prompt_embeds, prompt_embeds_mask
+
+    def diffuse(
+        self,
+        prompt_embeds,
+        prompt_embeds_mask,
+        negative_prompt_embeds,
+        negative_prompt_embeds_mask,
+        latents,
+        img_shapes,
+        txt_seq_lens,
+        negative_txt_seq_lens,
+        timesteps,
+        do_true_cfg,
+        guidance,
+        true_cfg_scale,
+        noise_level,
+        sde_window,
+        sde_type,
+        generator,
+        logprobs,
+    ):
+        all_latents = []
+        all_log_probs = []
+        all_timesteps = []
+        self.scheduler.set_begin_index(0)
+        for i, t in enumerate(timesteps):
+            if self.interrupt:
+                continue
+
+            if i < sde_window[0]:
+                cur_noise_level = 0.0
+            elif i == sde_window[0]:
+                cur_noise_level = noise_level
+                all_latents.append(latents)
+            elif i > sde_window[0] and i < sde_window[1]:
+                cur_noise_level = noise_level
+            else:
+                cur_noise_level = 0.0
+
+            self._current_timestep = t
+
+            # Broadcast timestep to match batch size
+            timestep = t.expand(latents.shape[0]).to(device=latents.device, dtype=latents.dtype)
+
+            # Forward pass for positive prompt (or unconditional if no CFG)
+            self.transformer.do_true_cfg = do_true_cfg
+            noise_pred = self.transformer(
+                hidden_states=latents,
+                timestep=timestep / 1000,
+                guidance=guidance,
+                encoder_hidden_states_mask=prompt_embeds_mask,
+                encoder_hidden_states=prompt_embeds,
+                img_shapes=img_shapes,
+                txt_seq_lens=txt_seq_lens,
+                attention_kwargs=self.attention_kwargs,
+                return_dict=False,
+            )[0]
+            # Forward pass for negative prompt (CFG)
+            if do_true_cfg:
+                neg_noise_pred = self.transformer(
+                    hidden_states=latents,
+                    timestep=timestep / 1000,
+                    guidance=guidance,
+                    encoder_hidden_states_mask=negative_prompt_embeds_mask,
+                    encoder_hidden_states=negative_prompt_embeds,
+                    img_shapes=img_shapes,
+                    txt_seq_lens=negative_txt_seq_lens,
+                    attention_kwargs=self.attention_kwargs,
+                    return_dict=False,
+                )[0]
+                comb_pred = neg_noise_pred + true_cfg_scale * (noise_pred - neg_noise_pred)
+                cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
+                noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
+                noise_pred = comb_pred * (cond_norm / noise_norm)
+            # compute the previous noisy sample x_t -> x_t-1
+            latents, log_prob, _, _ = self.scheduler.step(
+                noise_pred,
+                t,
+                latents,
+                generator=generator,
+                noise_level=cur_noise_level,
+                sde_type=sde_type,
+                logprobs=logprobs,
+                return_dict=False,
+            )
+
+            if i >= sde_window[0] and i < sde_window[1]:
+                all_latents.append(latents)
+                all_log_probs.append(log_prob)
+                all_timesteps.append(t)
+
+        all_latents = torch.stack(all_latents, dim=1)
+
+        if all_log_probs[0] is not None:
+            all_log_probs = torch.stack(all_log_probs, dim=1)
+        else:
+            all_log_probs = None
+
+        all_timesteps = torch.stack(all_timesteps).unsqueeze(0).expand(latents.shape[0], -1)
+
+        return latents, all_latents, all_log_probs, all_timesteps
+
+    def forward(
+        self,
+        req: OmniDiffusionRequest,
+        prompt_ids: torch.Tensor | list[int] | None = None,
+        prompt_mask: torch.Tensor | None = None,
+        negative_prompt_ids: torch.Tensor | list[int] | None = None,
+        negative_prompt_mask: torch.Tensor | None = None,
+        true_cfg_scale: float = 4.0,
+        height: int | None = None,
+        width: int | None = None,
+        num_inference_steps: int = 50,
+        sigmas: list[float] | None = None,
+        guidance_scale: float = 1.0,
+        num_images_per_prompt: int = 1,
+        generator: torch.Generator | list[torch.Generator] | None = None,
+        latents: torch.Tensor | None = None,
+        prompt_embeds: torch.Tensor | None = None,
+        prompt_embeds_mask: torch.Tensor | None = None,
+        negative_prompt_embeds: torch.Tensor | None = None,
+        negative_prompt_embeds_mask: torch.Tensor | None = None,
+        output_type: str | None = "pil",
+        attention_kwargs: dict[str, Any] | None = None,
+        callback_on_step_end_tensor_inputs: tuple[str, ...] = ("latents",),
+        max_sequence_length: int = 512,
+        noise_level: float = 0.7,
+        sde_window_size: int | None = None,
+        sde_window_range: tuple[int, int] = (0, 5),
+        sde_type: Literal["sde", "cps"] = "sde",
+        logprobs: bool = True,
+    ) -> DiffusionOutput:
+        # Extract prompt data from OmniCustomPrompt in req.prompts[0]
+        custom_prompt = req.prompts[0] if req.prompts else {}
+        if isinstance(custom_prompt, dict):
+            prompt_ids = custom_prompt.get("prompt_ids", prompt_ids)
+            prompt_mask = custom_prompt.get("prompt_mask", prompt_mask)
+            negative_prompt_ids = custom_prompt.get("negative_prompt_ids", negative_prompt_ids)
+            negative_prompt_mask = custom_prompt.get("negative_prompt_mask", negative_prompt_mask)
+
+        # Read sampling params from req.sampling_params
+        sp = req.sampling_params
+        height = sp.height or self.default_sample_size * self.vae_scale_factor
+        width = sp.width or self.default_sample_size * self.vae_scale_factor
+        num_inference_steps = sp.num_inference_steps or num_inference_steps
+        max_sequence_length = sp.max_sequence_length or max_sequence_length
+
+        noise_level = sp.extra_args.get("noise_level", None) or noise_level
+        sde_window_size = sp.extra_args.get("sde_window_size", None) or sde_window_size
+        sde_window_range = sp.extra_args.get("sde_window_range", None) or sde_window_range
+        sde_type = sp.extra_args.get("sde_type", None) or sde_type
+        logprobs = sp.extra_args.get("logprobs", None)
+
+        generator = sp.generator or generator
+        if generator is None and sp.seed is not None:
+            generator = torch.Generator(device=self.device).manual_seed(sp.seed)
+        true_cfg_scale = sp.true_cfg_scale or true_cfg_scale
+        req_num_outputs = getattr(sp, "num_outputs_per_prompt", None)
+        if req_num_outputs and req_num_outputs > 0:
+            num_images_per_prompt = req_num_outputs
+
+        self._guidance_scale = guidance_scale
+        self._attention_kwargs = attention_kwargs
+        self._current_timestep = None
+        self._interrupt = False
+
+        if prompt_ids is not None:
+            if isinstance(prompt_ids, list):
+                prompt_ids = torch.tensor(prompt_ids, device=self.device)
+            batch_size = prompt_ids.shape[0] if prompt_ids.ndim == 2 else 1
+        elif prompt_embeds is not None:
+            batch_size = prompt_embeds.shape[0]
+        else:
+            # Both prompt_ids and prompt_embeds are None (e.g. during warmup/dummy run).
+            # Return a minimal dummy output to avoid crashing.
+            return DiffusionOutput(output=None, custom_output={})
+
+        if isinstance(negative_prompt_ids, list):
+            negative_prompt_ids = torch.tensor(negative_prompt_ids, device=self.device)
+
+        has_neg_prompt = negative_prompt_ids is not None or (
+            negative_prompt_embeds is not None and negative_prompt_embeds_mask is not None
+        )
+
+        do_true_cfg = true_cfg_scale > 1 and has_neg_prompt
+        prompt_embeds, prompt_embeds_mask = self.encode_prompt(
+            prompt_ids=prompt_ids,
+            attention_mask=prompt_mask,
+            prompt_embeds=prompt_embeds,
+            prompt_embeds_mask=prompt_embeds_mask,
+            num_images_per_prompt=num_images_per_prompt,
+            max_sequence_length=max_sequence_length,
+        )
+        if do_true_cfg:
+            negative_prompt_embeds, negative_prompt_embeds_mask = self.encode_prompt(
+                prompt_ids=negative_prompt_ids,
+                attention_mask=negative_prompt_mask,
+                prompt_embeds=negative_prompt_embeds,
+                prompt_embeds_mask=negative_prompt_embeds_mask,
+                num_images_per_prompt=num_images_per_prompt,
+                max_sequence_length=max_sequence_length,
+            )
+
+        num_channels_latents = self.transformer.in_channels // 4
+        latents = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            self.device,
+            generator,
+            latents,
+        )
+        img_shapes = [[(1, height // self.vae_scale_factor // 2, width // self.vae_scale_factor // 2)]] * batch_size
+
+        timesteps, num_inference_steps = self.prepare_timesteps(num_inference_steps, sigmas, latents.shape[1])
+        # num_warmup_steps = max(len(timesteps) - num_inference_steps * self.scheduler.order, 0)
+        self._num_timesteps = len(timesteps)
+
+        # handle guidance
+        if self.transformer.guidance_embeds:
+            guidance = torch.full([1], guidance_scale, dtype=torch.float32)
+            guidance = guidance.expand(latents.shape[0])
+        else:
+            guidance = None
+
+        if self.attention_kwargs is None:
+            self._attention_kwargs = {}
+
+        txt_seq_lens = prompt_embeds_mask.sum(dim=1).tolist() if prompt_embeds_mask is not None else None
+        negative_txt_seq_lens = (
+            negative_prompt_embeds_mask.sum(dim=1).tolist() if negative_prompt_embeds_mask is not None else None
+        )
+
+        if sde_window_size is not None:
+            start = torch.randint(
+                sde_window_range[0],
+                sde_window_range[1] - sde_window_size + 1,
+                (1,),
+                generator=generator,
+                device=self.device,
+            ).item()
+            end = start + sde_window_size
+            sde_window = (start, end)
+        else:
+            sde_window = (0, len(timesteps) - 1)
+
+        latents, all_latents, all_log_probs, all_timesteps = self.diffuse(
+            prompt_embeds,
+            prompt_embeds_mask,
+            negative_prompt_embeds,
+            negative_prompt_embeds_mask,
+            latents,
+            img_shapes,
+            txt_seq_lens,
+            negative_txt_seq_lens,
+            timesteps,
+            do_true_cfg,
+            guidance,
+            true_cfg_scale,
+            noise_level,
+            sde_window,
+            sde_type,
+            generator,
+            logprobs,
+        )
+
+        self._current_timestep = None
+        if output_type == "latent":
+            image = latents
+        else:
+            latents = self._unpack_latents(latents, height, width, self.vae_scale_factor)
+            latents = latents.to(self.vae.dtype)
+            latents_mean = (
+                torch.tensor(self.vae.config.latents_mean)
+                .view(1, self.vae.config.z_dim, 1, 1, 1)
+                .to(latents.device, latents.dtype)
+            )
+            latents_std = 1.0 / torch.tensor(self.vae.config.latents_std).view(1, self.vae.config.z_dim, 1, 1, 1).to(
+                latents.device, latents.dtype
+            )
+            latents = latents / latents_std + latents_mean
+            image = self.vae.decode(latents, return_dict=False)[0][:, :, 0]
+
+        return DiffusionOutput(
+            output=_maybe_to_cpu(image),
+            custom_output={
+                "all_latents": _maybe_to_cpu(all_latents),
+                "all_log_probs": _maybe_to_cpu(all_log_probs),
+                "all_timesteps": _maybe_to_cpu(all_timesteps),
+                "prompt_embeds": _maybe_to_cpu(prompt_embeds),
+                "prompt_embeds_mask": _maybe_to_cpu(prompt_embeds_mask),
+                "negative_prompt_embeds": _maybe_to_cpu(negative_prompt_embeds),
+                "negative_prompt_embeds_mask": _maybe_to_cpu(negative_prompt_embeds_mask),
+            },
+        )
diff --git a/verl/utils/vllm_omni/pipelines/qwen_image/__init__.py b/verl/utils/vllm_omni/pipelines/qwen_image/__init__.py
new file mode 100644
index 00000000000..1cd1e8433df
--- /dev/null
+++ b/verl/utils/vllm_omni/pipelines/qwen_image/__init__.py
@@ -0,0 +1,13 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/verl/utils/vllm_omni/pipelines/qwen_image/qwen_image_transformer.py b/verl/utils/vllm_omni/pipelines/qwen_image/qwen_image_transformer.py
new file mode 100644
index 00000000000..8742da447e5
--- /dev/null
+++ b/verl/utils/vllm_omni/pipelines/qwen_image/qwen_image_transformer.py
@@ -0,0 +1,93 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import torch.nn as nn
+from diffusers.models.normalization import AdaLayerNormContinuous
+from vllm.model_executor.layers.layernorm import RMSNorm
+from vllm_omni.diffusion.data import OmniDiffusionConfig
+from vllm_omni.diffusion.models.qwen_image.qwen_image_transformer import (
+    ImageRopePrepare,
+    ModulateIndexPrepare,
+    QwenEmbedLayer3DRope,
+    QwenEmbedRope,
+    QwenImageTransformer2DModel,
+    QwenImageTransformerBlock,
+    QwenTimestepProjEmbeddings,
+)
+
+
+class QwenImageTransformer2DModelFixed(QwenImageTransformer2DModel):
+    def __init__(
+        self,
+        od_config: OmniDiffusionConfig,
+        patch_size: int = 2,
+        in_channels: int = 64,
+        out_channels: int | None = 16,
+        num_layers: int = 60,
+        attention_head_dim: int = 128,
+        num_attention_heads: int = 24,
+        joint_attention_dim: int = 3584,
+        guidance_embeds: bool = False,  # TODO: this should probably be removed
+        axes_dims_rope: tuple[int, int, int] = (16, 56, 56),
+        zero_cond_t: bool = False,
+        use_additional_t_cond: bool = False,
+        use_layer3d_rope: bool = False,
+    ):
+        super(QwenImageTransformer2DModel, self).__init__()
+        self.parallel_config = od_config.parallel_config
+        model_config = od_config.tf_model_config
+        self.num_layers = model_config.num_layers
+        self.attention_head_dim = model_config.attention_head_dim
+        self.num_attention_heads = model_config.num_attention_heads
+        self.joint_attention_dim = model_config.joint_attention_dim
+        self.in_channels = model_config.in_channels
+        self.out_channels = model_config.out_channels or self.in_channels
+        self.inner_dim = self.num_attention_heads * self.attention_head_dim
+        self.guidance_embeds = model_config.guidance_embeds
+        self.axes_dims_rope = model_config.axes_dims_rope
+
+        if not use_layer3d_rope:
+            self.pos_embed = QwenEmbedRope(theta=10000, axes_dim=list(self.axes_dims_rope), scale_rope=True)
+        else:
+            self.pos_embed = QwenEmbedLayer3DRope(theta=10000, axes_dim=list(self.axes_dims_rope), scale_rope=True)
+
+        self.time_text_embed = QwenTimestepProjEmbeddings(
+            embedding_dim=self.inner_dim, use_additional_t_cond=use_additional_t_cond
+        )
+
+        self.txt_norm = RMSNorm(self.joint_attention_dim, eps=1e-6)
+
+        self.img_in = nn.Linear(in_channels, self.inner_dim)
+        self.txt_in = nn.Linear(self.joint_attention_dim, self.inner_dim)
+
+        self.transformer_blocks = nn.ModuleList(
+            [
+                QwenImageTransformerBlock(
+                    dim=self.inner_dim,
+                    num_attention_heads=self.num_attention_heads,
+                    attention_head_dim=self.attention_head_dim,
+                    zero_cond_t=zero_cond_t,
+                )
+                for _ in range(self.num_layers)
+            ]
+        )
+
+        self.norm_out = AdaLayerNormContinuous(self.inner_dim, self.inner_dim, elementwise_affine=False, eps=1e-6)
+        self.proj_out = nn.Linear(self.inner_dim, patch_size * patch_size * self.out_channels, bias=True)
+
+        self.gradient_checkpointing = False
+        self.zero_cond_t = zero_cond_t
+
+        self.image_rope_prepare = ImageRopePrepare(self.img_in, self.pos_embed)
+        self.modulate_index_prepare = ModulateIndexPrepare(zero_cond_t=zero_cond_t)
diff --git a/verl/workers/config/model.py b/verl/workers/config/model.py
index 9205a99f038..0b49c4d8e20 100644
--- a/verl/workers/config/model.py
+++ b/verl/workers/config/model.py
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-
+import os
 from dataclasses import dataclass, field
 from typing import Any, Optional
 
@@ -24,7 +24,7 @@
 from verl.utils.import_utils import import_external_libs
 from verl.utils.model import get_generation_config, update_model_config
 
-__all__ = ["HFModelConfig", "MtpConfig"]
+__all__ = ["HFModelConfig", "DiffusersModelConfig", "MtpConfig"]
 
 
 @dataclass
@@ -220,3 +220,110 @@ def __post_init__(self):
 
     def get_processor(self):
         return self.processor if self.processor is not None else self.tokenizer
+
+
+@dataclass
+class DiffusersModelConfig(BaseConfig):
+    _mutable_fields = {
+        "tokenizer_path",
+        "tokenizer",
+        "processor",
+        "local_path",
+        "local_tokenizer_path",
+    }
+
+    path: str = MISSING
+    local_path: Optional[str] = None
+    tokenizer_path: Optional[str] = None
+    local_tokenizer_path: Optional[str] = None
+    model_type: str = "diffusion_model"
+
+    # whether to load tokenizer. This is useful when we only want to load model config
+    load_tokenizer: bool = True
+
+    tokenizer: Any = None
+    processor: Any = None
+
+    # whether to use shared memory
+    use_shm: bool = False
+    trust_remote_code: bool = False
+
+    # custom chat template for the model
+    custom_chat_template: Optional[str] = None
+
+    external_lib: Optional[str] = None
+
+    override_config: dict = field(default_factory=dict)
+
+    enable_gradient_checkpointing: bool = False
+    enable_activation_offload: bool = False
+
+    use_remove_padding: bool = True
+
+    # lora related. We may setup a separate config later
+    lora_rank: int = 32
+    lora_alpha: int = 64
+    lora_init_weights: str = "gaussian"
+    target_modules: Optional[Any] = "all-linear"  # allow both "all-linear" and ["q_proj","k_proj"]
+    target_parameters: Optional[list[str]] = None  # for lora adapter on nn.Parameter
+
+    exclude_modules: Optional[str] = None
+
+    # megatron lora config
+    lora: dict[str, Any] = field(default_factory=dict)
+
+    # path to pre-trained LoRA adapter to load for continued training
+    lora_adapter_path: Optional[str] = None
+    use_liger: bool = False
+
+    # optimization related
+    use_fused_kernels: bool = False
+    fused_kernel_options: dict = field(default_factory=dict)
+
+    # TiledMLP configuration for memory-efficient MLP computation
+    tiled_mlp: dict = field(default_factory=lambda: {"enabled": False, "num_shards": 4})
+
+    # sample related
+    image_height: int = 512
+    image_width: int = 512
+    num_inference_steps: int = 10
+    noise_level: float = 0.7
+    guidance_scale: float = 4.5
+    sde_type: str = "sde"  # "sde" or "cps"
+
+    def __post_init__(self):
+        import_external_libs(self.external_lib)
+        if self.tokenizer_path is None:
+            self.tokenizer_path = os.path.join(self.path, "tokenizer")
+        self.local_path = copy_to_local(self.path, use_shm=self.use_shm)
+
+        # construct tokenizer
+        if self.load_tokenizer:
+            self.local_tokenizer_path = copy_to_local(self.tokenizer_path, use_shm=self.use_shm)
+            # see issue https://github.com/huggingface/tokenizers/issues/537, we use a non-fast tokenizer here
+            self.tokenizer = hf_tokenizer(
+                self.local_tokenizer_path, trust_remote_code=self.trust_remote_code, use_fast=False
+            )
+            if os.path.exists(os.path.join(self.local_path, "processor")):
+                self.processor = hf_processor(
+                    os.path.join(self.local_path, "processor"), trust_remote_code=self.trust_remote_code
+                )
+            else:
+                self.processor = None
+
+        # Ensure target_modules is a str or list[str] (only if not None)
+        if self.target_modules is not None:
+            if not isinstance(self.target_modules, (str | list)):
+                raise TypeError(
+                    "target_modules must be a string or a list of strings, "
+                    f"but got {type(self.target_modules).__name__}"
+                )
+            if isinstance(self.target_modules, list):
+                for x in self.target_modules:
+                    if not isinstance(x, str):
+                        raise TypeError(
+                            f"All elements in target_modules list must be strings, but found {type(x).__name__}"
+                        )
+
+    def get_processor(self):
+        return self.processor if self.processor is not None else self.tokenizer
diff --git a/verl/workers/config/rollout.py b/verl/workers/config/rollout.py
index ec4f766ffdc..d353a891949 100644
--- a/verl/workers/config/rollout.py
+++ b/verl/workers/config/rollout.py
@@ -13,7 +13,7 @@
 # limitations under the License.
 import warnings
 from dataclasses import dataclass, field
-from typing import Optional
+from typing import Literal, Optional
 
 from omegaconf import MISSING
 
@@ -23,6 +23,7 @@
 
 __all__ = [
     "SamplingConfig",
+    "DiffusionSamplingConfig",
     "MultiTurnConfig",
     "CustomAsyncServerConfig",
     "AgentLoopConfig",
@@ -30,6 +31,7 @@
     "ServerConfig",
     "PrometheusConfig",
     "RolloutConfig",
+    "DiffusionRolloutConfig",
     "CheckpointEngineConfig",
 ]
 
@@ -43,6 +45,15 @@ class SamplingConfig(BaseConfig):
     n: int = 1
 
 
+@dataclass
+class DiffusionSamplingConfig(BaseConfig):
+    do_sample: bool = True
+    n: int = 1
+    noise_level: float = 0.0
+    num_inference_steps: int = 40
+    seed: int = 42
+
+
 @dataclass
 class MultiTurnConfig(BaseConfig):
     _mutable_fields = {"max_assistant_turns", "max_user_turns"}
@@ -291,3 +302,132 @@ def __post_init__(self):
                 raise NotImplementedError(
                     f"Current rollout {self.name=} not implemented pipeline_model_parallel_size > 1 yet."
                 )
+
+
+@dataclass
+class DiffusionRolloutConfig(BaseConfig):
+    _mutable_fields = {"max_model_len", "load_format"}
+
+    name: Optional[str] = MISSING
+    mode: str = "async"
+    nnodes: int = 0
+    n_gpus_per_node: int = 8
+
+    do_sample: bool = True
+    n: int = 1
+
+    # Early termination threshold for multi-turn rollout in sglang.
+    # Abort remaining requests when (1 - over_sample_rate) * total_requests are completed.
+    over_sample_rate: float = 0.0
+
+    prompt_length: int = 512
+    # response_length: int = 512
+
+    dtype: str = "bfloat16"
+    gpu_memory_utilization: float = 0.5
+    enforce_eager: bool = True
+    cudagraph_capture_sizes: Optional[list] = None
+    free_cache_engine: bool = True
+    data_parallel_size: int = 1
+    expert_parallel_size: int = 1
+    tensor_model_parallel_size: int = 2
+    pipeline_model_parallel_size: int = 1
+    max_num_batched_tokens: int = 8192
+    logprobs_mode: Optional[str] = "processed_logprobs"
+    scheduling_policy: Optional[str] = "fcfs"
+
+    val_kwargs: DiffusionSamplingConfig = field(default_factory=DiffusionSamplingConfig)
+
+    max_model_len: Optional[int] = None
+    max_num_seqs: int = 1024
+
+    # note that the logprob computation should belong to the actor
+    log_prob_micro_batch_size: Optional[int] = None
+    log_prob_micro_batch_size_per_gpu: Optional[int] = None
+    log_prob_use_dynamic_bsz: bool = False
+    log_prob_max_token_len_per_gpu: int = 16384
+
+    disable_log_stats: bool = True
+
+    multi_stage_wake_up: bool = False
+    engine_kwargs: dict = field(default_factory=dict)
+
+    calculate_log_probs: bool = False
+
+    agent: AgentLoopConfig = field(default_factory=AgentLoopConfig)
+
+    trace: TraceConfig = field(default_factory=TraceConfig)
+
+    multi_turn: MultiTurnConfig = field(default_factory=MultiTurnConfig)
+
+    # Use Prometheus to collect and monitor rollout statistics
+    prometheus: PrometheusConfig = field(default_factory=PrometheusConfig)
+
+    # Checkpoint Engine config for update weights from trainer to rollout
+    checkpoint_engine: CheckpointEngineConfig = field(default_factory=CheckpointEngineConfig)
+
+    profiler: Optional[ProfilerConfig] = None
+
+    enable_chunked_prefill: bool = True
+
+    enable_prefix_caching: bool = True
+
+    load_format: str = "dummy"
+
+    layered_summon: bool = False
+
+    limit_images: Optional[int] = None
+
+    quantization: Optional[str] = None
+
+    quantization_config_file: Optional[str] = None
+
+    enable_rollout_routing_replay: bool = False
+
+    enable_sleep_mode: bool = True
+
+    qat: Optional[dict] = None
+
+    # diffusion use
+    image_height: int = 512
+
+    image_width: int = 512
+
+    num_inference_steps: int = 10
+
+    noise_level: float = 0.7
+
+    guidance_scale: float = 4.5
+
+    sde_type: Literal["sde", "cps"] = "sde"
+
+    sde_window_size: Optional[int] = None
+
+    sde_window_range: Optional[tuple[int, int]] = None
+
+    def __post_init__(self):
+        """Validate the rollout config"""
+        # Deprecation warning for mode field - only async mode is supported
+        if self.mode == "sync":
+            raise ValueError(
+                "Rollout mode 'sync' has been removed. Please set "
+                "`actor_rollout_ref.rollout.mode=async` or remove the mode setting entirely."
+            )
+        if self.mode != "async":
+            warnings.warn(
+                f"Unknown rollout mode '{self.mode}'. Only 'async' mode is supported. "
+                "The 'mode' field is deprecated and will be removed in a future version.",
+                DeprecationWarning,
+                stacklevel=2,
+            )
+
+        if self.expert_parallel_size > 1:
+            assert self.expert_parallel_size == (self.tensor_model_parallel_size * self.data_parallel_size), (
+                "expert_parallel_size must be equal to tensor_model_parallel_size * data_parallel_size"
+            )
+
+        if self.pipeline_model_parallel_size > 1:
+            if self.name == "vllm_omni" or self.name == "vllm" or self.name == "sglang" or self.name == "trtllm":
+                raise NotImplementedError(
+                    f"Current rollout {self.name=} not implemented pipeline_model_parallel_size > 1 yet."
+                )
diff --git a/verl/workers/engine/__init__.py b/verl/workers/engine/__init__.py
index 8f01080fdcb..b36d295a9a6 100644
--- a/verl/workers/engine/__init__.py
+++ b/verl/workers/engine/__init__.py
@@ -12,13 +12,14 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .base import BaseEngine, EngineRegistry
-from .fsdp import FSDPEngine, FSDPEngineWithLMHead
+from .fsdp import DiffusersFSDPEngine, FSDPEngine, FSDPEngineWithLMHead
 
 __all__ = [
     "BaseEngine",
     "EngineRegistry",
     "FSDPEngine",
     "FSDPEngineWithLMHead",
+    "DiffusersFSDPEngine",
 ]
 
 try:
diff --git a/verl/workers/engine/fsdp/__init__.py b/verl/workers/engine/fsdp/__init__.py
index a1bdb16b47c..167b016d3ca 100644
--- a/verl/workers/engine/fsdp/__init__.py
+++ b/verl/workers/engine/fsdp/__init__.py
@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from .diffusers_impl import DiffusersFSDPEngine
 from .transformer_impl import FSDPEngine, FSDPEngineWithLMHead
 
-__all__ = ["FSDPEngine", "FSDPEngineWithLMHead"]
+__all__ = ["FSDPEngine", "FSDPEngineWithLMHead", "DiffusersFSDPEngine"]
diff --git a/verl/workers/engine/fsdp/diffusers_impl.py b/verl/workers/engine/fsdp/diffusers_impl.py
new file mode 100644
index 00000000000..6512dcc0f13
--- /dev/null
+++ b/verl/workers/engine/fsdp/diffusers_impl.py
@@ -0,0 +1,970 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The concrete Engine implementation using PyTorch FullyShardedDataParallel (FSDP)
+"""
+
+import gc
+import json
+import logging
+import os
+import warnings
+from contextlib import contextmanager, nullcontext
+from typing import Callable, Optional
+
+import torch
+import torch.distributed
+from peft import LoraConfig
+from tensordict import TensorDict
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+from torch.distributed.fsdp.api import FullStateDictConfig, ShardedStateDictConfig, StateDictType
+from torch.distributed.tensor import DTensor
+
+from verl.trainer.config import CheckpointConfig
+from verl.utils import tensordict_utils as tu
+from verl.utils.activation_offload import enable_activation_offloading
+from verl.utils.checkpoint.fsdp_checkpoint_manager import FSDPCheckpointManager
+from verl.utils.debug import log_gpu_memory_usage
+from verl.utils.device import get_device_id, get_device_name
+from verl.utils.fsdp_utils import (
+    CPUOffloadPolicy,
+    FSDPModule,
+    MixedPrecisionPolicy,
+    apply_fsdp2,
+    collect_lora_params,
+    fsdp2_clip_grad_norm_,
+    fsdp2_load_full_state_dict,
+    fsdp_version,
+    get_fsdp_wrap_policy,
+    get_init_weight_context_manager,
+    init_fn,
+    load_fsdp_model_to_gpu,
+    load_fsdp_optimizer,
+    merged_lora_context,
+    normalize_peft_param_name,
+    offload_fsdp_model_to_cpu,
+    offload_fsdp_optimizer,
+    replace_lora_wrapper,
+)
+from verl.utils.model import convert_weight_keys
+from verl.utils.py_functional import convert_to_regular_types
+from verl.utils.ulysses import get_ulysses_sequence_parallel_group, set_ulysses_sequence_parallel_group
+from verl.workers.config import DiffusersModelConfig, FSDPEngineConfig, FSDPOptimizerConfig
+
+from ..base import BaseEngine, BaseEngineCtx, EngineRegistry
+from ..utils import enable_full_determinism, prepare_micro_batches
+from .utils import create_device_mesh, get_sharding_strategy
+
+logger = logging.getLogger(__file__)
+logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
+
+device_name = get_device_name()
+
+
+@EngineRegistry.register(model_type="diffusion_model", backend=["fsdp", "fsdp2"], device=["cuda", "npu"])
+class DiffusersFSDPEngine(BaseEngine):
+    """
+    Concrete Diffusers Engine implementation using PyTorch FullyShardedDataParallel (FSDP).
+
+    Supports model sharding, activation/optimizer offloading, LoRA, and sequence parallelism.
+    """
+
+    def __init__(
+        self,
+        model_config: DiffusersModelConfig,
+        engine_config: FSDPEngineConfig,
+        optimizer_config: FSDPOptimizerConfig,
+        checkpoint_config: CheckpointConfig,
+    ):
+        """
+        Initialize the DiffusersFSDPEngine.
+
+        Sets up distributed device meshes, LoRA, and offload policies based on config.
+
+        Args:
+            config: Configuration object with FSDP and model settings.
+        """
+        super().__init__()
+
+        self.model_config = model_config
+        self.engine_config = engine_config
+        self.optimizer_config = optimizer_config
+        self.checkpoint_config = checkpoint_config
+
+        self.mode = None
+
+        self.rank = torch.distributed.get_rank()
+
+        # Apply NPU patches for FSDP backend
+        from .utils import apply_npu_fsdp_patches
+
+        apply_npu_fsdp_patches()
+
+        # build device mesh for Ulysses Sequence Parallel
+
+        self._init_device_mesh()
+
+        if self.engine_config.full_determinism:
+            enable_full_determinism(seed=self.engine_config.seed)
+
+        # set FSDP offload params
+        self._is_offload_param = self.engine_config.param_offload
+        self._is_offload_optimizer = self.engine_config.optimizer_offload
+        self._is_lora = self.model_config.lora_rank > 0
+        self._guidance_scale = self.model_config.guidance_scale
+
+        # QAT (Quantization-Aware Training)
+        self._qat_config = getattr(self.engine_config, "qat", None)
+        self._qat_enabled = self._qat_config is not None and getattr(self._qat_config, "enable", False)
+        if self._qat_enabled:
+            raise NotImplementedError("Quantization-Aware Training (QAT) is not supported yet.")
+
+    @property
+    def is_param_offload_enabled(self) -> bool:
+        return self._is_offload_param
+
+    @property
+    def is_optimizer_offload_enabled(self) -> bool:
+        return self._is_offload_optimizer
+
+    def is_mp_src_rank_with_outputs(self):
+        if self.ulysses_device_mesh is not None:
+            is_collect = self.ulysses_device_mesh["sp"].get_local_rank() == 0
+        else:
+            is_collect = True
+        return is_collect
+
+    def initialize(self):
+        """
+        Build the model, optimizer, and learning rate scheduler under FSDP.
+
+        Applies device, dtype, and precision configurations, including mixed precision.
+        Sets up checkpoint manager and FLOPs counter.
+        """
+        # This is used to import external_lib into the huggingface systems
+        self._build_model_optimizer()
+
+        self.checkpoint_manager = FSDPCheckpointManager(
+            model=self.module,
+            optimizer=self.optimizer,
+            lr_scheduler=self.lr_scheduler,
+            processing_class=self.model_config.get_processor(),
+            checkpoint_config=self.checkpoint_config,
+            trust_remote_code=self.model_config.trust_remote_code,
+        )
+
+        self.to(
+            device="cpu",
+            model=self._is_offload_param,
+            optimizer=self._is_offload_optimizer,
+            grad=self._is_offload_param,
+        )
+
+        log_gpu_memory_usage("After offload model/optimizer/grad during init", logger=logger)
+
+    def _init_device_mesh(self):
+        world_size = torch.distributed.get_world_size()
+        from torch.distributed.device_mesh import init_device_mesh
+
+        fsdp_size = self.engine_config.fsdp_size
+
+        self.device_mesh = create_device_mesh(world_size=world_size, fsdp_size=fsdp_size)
+        self.ulysses_device_mesh = None
+        self.ulysses_parallel_group = None
+        self.ulysses_sequence_parallel_size = self.engine_config.ulysses_sequence_parallel_size
+        dp_size = self.get_data_parallel_size()
+        if self.ulysses_sequence_parallel_size > 1:
+            self.ulysses_device_mesh = init_device_mesh(
+                device_name, mesh_shape=(dp_size, self.ulysses_sequence_parallel_size), mesh_dim_names=["dp", "sp"]
+            )
+            self.ulysses_parallel_group = self.ulysses_device_mesh["sp"].get_group()
+            raise NotImplementedError("Ulysses sequence parallel is not supported yet.")
+
+        self.use_ulysses_sp = self.ulysses_sequence_parallel_size > 1
+
+    def _build_module(self):
+        from diffusers import AutoModel
+
+        from verl.utils.torch_dtypes import PrecisionType
+
+        # for checkpoint saving
+        def save_config(self, save_directory: str | os.PathLike):
+            output_config_file = os.path.join(save_directory, "config.json")
+            with open(output_config_file, "w", encoding="utf-8") as f:
+                json.dump(self, f, indent=4, sort_keys=True)
+
+        torch_dtype = self.engine_config.model_dtype
+
+        if torch_dtype is None:
+            # if it is training, we force torch_dtype to fp32
+            torch_dtype = torch.float32 if not self.engine_config.forward_only else torch.bfloat16
+
+        torch_dtype = PrecisionType.to_dtype(torch_dtype)
+
+        init_context = get_init_weight_context_manager(use_meta_tensor=True, mesh=self.device_mesh)
+
+        with init_context(), warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+
+            module = AutoModel.from_pretrained(
+                self.model_config.local_path,
+                torch_dtype=torch_dtype,
+                trust_remote_code=self.model_config.trust_remote_code,
+                subfolder="transformer",
+            )
+
+            use_liger = self.model_config.use_liger
+            # Apply Liger kernel to the model if use_liger is set to True
+            if use_liger:
+                raise NotImplementedError("Liger kernel is not supported yet.")
+
+            use_fused_kernels = self.model_config.use_fused_kernels
+            if use_fused_kernels:
+                module.fuse_qkv_projections()
+
+            # some parameters may not in torch_dtype
+            module.to(torch_dtype)
+
+            if self.model_config.enable_gradient_checkpointing:
+                module.enable_gradient_checkpointing()
+
+            # for checkpoint saving
+            module.can_generate = lambda: False
+            module.config.save_pretrained = save_config.__get__(module.config)
+
+        return module
+
+    def _build_lora_module(self, module):
+        lora_adapter_path = getattr(self.model_config, "lora_adapter_path", None)
+        if lora_adapter_path is not None:
+            from verl.utils.fs import copy_to_local
+
+            print(f"Loading pre-trained LoRA adapter to from: {lora_adapter_path}")
+            # Copy adapter to local if needed
+            local_adapter_path = copy_to_local(lora_adapter_path, use_shm=self.model_config.use_shm)
+
+            module.load_lora_adapter(local_adapter_path)
+        else:
+            # Convert config to regular Python types before creating PEFT model
+            lora_config = {
+                "r": self.model_config.lora_rank,
+                "lora_alpha": self.model_config.lora_alpha,
+                "init_lora_weights": "gaussian",
+                "target_modules": convert_to_regular_types(self.model_config.target_modules),
+                "target_parameters": convert_to_regular_types(self.model_config.target_parameters),
+                "exclude_modules": convert_to_regular_types(self.model_config.exclude_modules),
+                "bias": "none",
+            }
+            module.add_adapter(LoraConfig(**lora_config))
+
+        return module
+
+    def _build_fsdp_module(self, module):
+        # TODO(ziheng): need to improve
+        from torch.distributed.fsdp import CPUOffload, MixedPrecision
+
+        from verl.utils.torch_dtypes import PrecisionType
+
+        mixed_precision_config = self.engine_config.mixed_precision
+        if mixed_precision_config is not None:
+            param_dtype = PrecisionType.to_dtype(mixed_precision_config.get("param_dtype", "bf16"))
+            reduce_dtype = PrecisionType.to_dtype(mixed_precision_config.get("reduce_dtype", "fp32"))
+            buffer_dtype = PrecisionType.to_dtype(mixed_precision_config.get("buffer_dtype", "fp32"))
+        else:
+            param_dtype = torch.bfloat16
+            reduce_dtype = torch.float32
+            buffer_dtype = torch.float32
+
+        mixed_precision = MixedPrecision(param_dtype=param_dtype, reduce_dtype=reduce_dtype, buffer_dtype=buffer_dtype)
+
+        auto_wrap_policy = get_fsdp_wrap_policy(
+            module=module,
+            config=self.engine_config.wrap_policy,
+            is_lora=self.model_config.lora_rank > 0,
+        )
+
+        fsdp_mesh = self.device_mesh
+        sharding_strategy = get_sharding_strategy(fsdp_mesh)
+
+        # Note: We force turn off CPUOffload because it causes incorrect results when using grad accumulation
+        if self.engine_config.strategy == "fsdp":
+            # cpu_offload:
+            # - actor: None
+            # - critic: None
+            # - ref: CPUOffload(offload_params=True)
+
+            # We force reference policy to use CPUOffload to save memory.
+            # We force turn off CPUOffload for actor because it causes incorrect results when using grad accumulation
+            cpu_offload = None
+            if self.engine_config.forward_only:
+                cpu_offload = CPUOffload(offload_params=True)
+                self._is_offload_param = False
+                self._is_offload_optimizer = False
+
+            module = FSDP(
+                module,
+                param_init_fn=init_fn,
+                auto_wrap_policy=auto_wrap_policy,
+                device_id=get_device_id(),
+                sharding_strategy=sharding_strategy,
+                mixed_precision=mixed_precision,
+                sync_module_states=True,
+                device_mesh=self.device_mesh,
+                forward_prefetch=self.engine_config.forward_prefetch,
+                use_orig_params=self.engine_config.use_orig_params,
+                cpu_offload=cpu_offload,
+            )
+        elif self.engine_config.strategy == "fsdp2":
+            # - actor: offload_policy
+            # - critic: offload_policy
+            # - ref: CPUOffloadPolicy(pin_memory=True)
+            assert CPUOffloadPolicy is not None, "PyTorch version >= 2.4 is required for using fully_shard API (FSDP2)"
+            mp_policy = MixedPrecisionPolicy(
+                param_dtype=param_dtype, reduce_dtype=reduce_dtype, cast_forward_inputs=True
+            )
+            offload_policy = None
+            if self.engine_config.offload_policy or self.engine_config.forward_only:
+                self._is_offload_param = False
+                self._is_offload_optimizer = False
+                offload_policy = CPUOffloadPolicy(pin_memory=True)
+
+            fsdp_kwargs = {
+                "mesh": fsdp_mesh,
+                "mp_policy": mp_policy,
+                "offload_policy": offload_policy,
+                "reshard_after_forward": self.engine_config.reshard_after_forward,
+            }
+            full_state = module.state_dict()
+            apply_fsdp2(module, fsdp_kwargs, self.engine_config)
+            fsdp2_load_full_state_dict(module, full_state, fsdp_mesh, offload_policy)
+        else:
+            raise NotImplementedError(f"Unknown strategy {self.engine_config.strategy}")
+
+        if self.model_config.enable_activation_offload:
+            enable_gradient_checkpointing = self.model_config.enable_gradient_checkpointing
+            enable_activation_offloading(module, self.engine_config.strategy, enable_gradient_checkpointing)
+
+        if torch.distributed.get_world_size() == 1 and fsdp_version(module) == 1:
+            FSDP.set_state_dict_type(
+                module,
+                state_dict_type=StateDictType.FULL_STATE_DICT,
+                state_dict_config=FullStateDictConfig(),
+            )
+        elif fsdp_version(module) == 1:
+            FSDP.set_state_dict_type(
+                module,
+                state_dict_type=StateDictType.SHARDED_STATE_DICT,
+                state_dict_config=ShardedStateDictConfig(),
+            )
+
+        return module
+
+    def _build_scheduler(self):
+        # TODO (mike): generalize to other diffusers scheduler later
+        from verl.utils.diffusers.schedulers import FlowMatchSDEDiscreteScheduler
+        from verl.utils.diffusers.utils import set_timesteps
+
+        scheduler = FlowMatchSDEDiscreteScheduler.from_pretrained(
+            pretrained_model_name_or_path=self.model_config.local_path, subfolder="scheduler"
+        )
+        set_timesteps(scheduler, self.model_config)
+        return scheduler
+
+    def _build_optimizer(self, module):
+        from verl.workers.config.optimizer import build_optimizer
+
+        optimizer = build_optimizer(module.parameters(), self.optimizer_config)
+
+        return optimizer
+
+    def _build_lr_scheduler(self, optimizer):
+        from verl.utils.torch_functional import get_constant_schedule_with_warmup, get_cosine_schedule_with_warmup
+
+        optim_config = self.optimizer_config
+
+        total_steps = optim_config.total_training_steps
+        num_warmup_steps = optim_config.lr_warmup_steps
+        lr_scheduler_type = optim_config.lr_scheduler_type
+        min_lr_ratio = optim_config.min_lr_ratio
+        num_cycles = optim_config.num_cycles
+        zero_indexed_step = optim_config.zero_indexed_step
+        if num_warmup_steps <= 0:
+            num_warmup_steps_ratio = optim_config.lr_warmup_steps_ratio
+            num_warmup_steps = int(num_warmup_steps_ratio * total_steps)
+
+        if self.rank == 0:
+            print(f"Total steps: {total_steps}, num_warmup_steps: {num_warmup_steps}")
+
+        if lr_scheduler_type == "constant":
+            lr_scheduler = get_constant_schedule_with_warmup(optimizer=optimizer, num_warmup_steps=num_warmup_steps)
+        elif lr_scheduler_type == "cosine":
+            lr_scheduler = get_cosine_schedule_with_warmup(
+                optimizer=optimizer,
+                num_warmup_steps=num_warmup_steps,
+                num_training_steps=total_steps,
+                min_lr_ratio=min_lr_ratio,
+                num_cycles=num_cycles,
+                zero_indexed_step=zero_indexed_step,
+            )
+        else:
+            raise NotImplementedError(f"LR scheduler type {lr_scheduler_type} is not supported")
+        return lr_scheduler
+
+    def _build_model_optimizer(self):
+        from verl.utils.model import print_model_size
+
+        # Load base model with specified configuration and dtype
+        module = self._build_module()
+        scheduler = self._build_scheduler()
+        # Apply LoRA adapters if low-rank adaptation is enabled
+        if self._is_lora:
+            module = self._build_lora_module(module)
+
+        # Synchronize all distributed processes before proceeding
+        torch.distributed.barrier()
+        if self.rank == 0:
+            print_model_size(module)
+        log_gpu_memory_usage("After init model from Diffusers AutoModel", logger=logger)
+
+        # Wrap model with FSDP for distributed training (sharding, mixed precision, etc.)
+        log_gpu_memory_usage("Before FSDP", logger=None)
+        module = self._build_fsdp_module(module)
+        log_gpu_memory_usage("After FSDP", logger=None)
+
+        if not self.engine_config.forward_only:
+            # Initialize optimizer with model parameters and config settings
+            optimizer = self._build_optimizer(module)
+            # Create learning rate scheduler with warmup and decay settings
+            lr_scheduler = self._build_lr_scheduler(optimizer)
+        else:
+            optimizer = None
+            lr_scheduler = None
+
+        self.module = module
+        self.scheduler = scheduler
+        self.optimizer = optimizer
+        self.lr_scheduler = lr_scheduler
+
+    def train_mode(self, **kwargs):
+        """
+        Return a context manager that switches to training mode with FSDP-specific handling.
+
+        Includes parameter and optimizer offload entry/exit.
+        """
+        return EngineTrainModeCtx(self, **kwargs)
+
+    def eval_mode(self, **kwargs):
+        """
+        Return a context manager that switches to evaluation mode with FSDP-specific handling.
+
+        Includes activation offload entry/exit.
+        """
+        return EngineEvalModeCtx(self, **kwargs)
+
+    def get_data_parallel_rank(self):
+        if self.ulysses_device_mesh is not None:
+            return self.ulysses_device_mesh["dp"].get_local_rank()
+        else:
+            return torch.distributed.get_rank()
+
+    def get_data_parallel_size(self):
+        return torch.distributed.get_world_size() // self.ulysses_sequence_parallel_size
+
+    def get_data_parallel_group(self):
+        if self.ulysses_device_mesh is not None:
+            return self.ulysses_device_mesh.get_group(mesh_dim="dp")
+        else:
+            return torch.distributed.group.WORLD
+
+    def forward_backward_batch(self, data: TensorDict, loss_function: Callable, forward_only=False) -> list[TensorDict]:
+        # note that the global_batch_size should include data on all the dp
+        tu.assign_non_tensor(data, sp_size=self.ulysses_sequence_parallel_size)
+
+        # compute num_tokens in global batch for loss normalization
+        batch_num_tokens = data["loss_mask"].sum().to(get_device_id())
+        torch.distributed.all_reduce(
+            batch_num_tokens, op=torch.distributed.ReduceOp.SUM, group=self.get_data_parallel_group()
+        )
+        tu.assign_non_tensor(data, batch_num_tokens=batch_num_tokens.item())
+        tu.assign_non_tensor(data, dp_size=self.get_data_parallel_size())
+
+        micro_batches, indices = prepare_micro_batches(
+            data=data, dp_group=self.get_data_parallel_group(), same_micro_num_in_dp=True
+        )
+
+        output_lst = []
+
+        ctx = torch.no_grad() if forward_only else nullcontext()
+
+        for micro_batch in micro_batches:
+            meta_info_lst = {
+                "model_output": [],
+                "loss": [],
+                "metrics": [],
+            }
+            for step in range(micro_batch["all_timesteps"].shape[1]):
+                with ctx:
+                    loss, meta_info = self.forward_step(
+                        micro_batch, loss_function=loss_function, forward_only=forward_only, step=step
+                    )
+
+                    if not forward_only:
+                        loss.backward()
+                for key, val in meta_info.items():
+                    meta_info_lst[key].append(val)
+
+            output_lst.append(meta_info_lst)
+
+        # postprocess and return
+        return self.postprocess_batch_func(output_lst=output_lst, indices=indices, data=data)
+
+    def postprocess_batch_func(self, output_lst, indices, data: TensorDict):
+        """postprocess the output of a forward_backward_batch.
+        output_lst is a list of dict containing outputs for each micro-batch
+        reorder entropy and outputs. Return None for other pp ranks
+        only on last rank. It should be on every tp rank
+
+        each losses_reduced contains 1. model_output, 2. loss, 3. metrics.
+        """
+
+        from verl.utils.py_functional import append_to_dict
+        # from verl.utils.seqlen_balancing import restore_dynamic_batch
+
+        # use_dynamic_bsz = tu.get_non_tensor_data(data=data, key="use_dynamic_bsz", default=True)
+
+        # losses_reduced is a list of dict containing outputs for each micro-batch
+        # reorder entropy and outputs. Return None for other pp ranks
+        # only on last rank. It should be on every tp rank
+
+        # losses_reduced contains 1. model_output, 2. loss, 3. metrics.
+        # We perform reverse
+
+        model_output = {}
+        losses = []
+        aggregated_metrics = {}
+
+        for o in output_lst:
+            # model output
+            model_output_lst = {}
+            if "model_output" in o:
+                for model_output_dict in o["model_output"]:
+                    for key, val in model_output_dict.items():
+                        if key not in model_output_lst:
+                            model_output_lst[key] = []
+                        model_output_lst[key].append(val)
+                for key, val in model_output_lst.items():
+                    if key not in model_output:
+                        model_output[key] = []
+                    model_output[key].append(torch.stack(val, dim=1))  # (bsz, steps, ...)
+            # loss
+            if "loss" in o:
+                losses.append(o["loss"])
+
+            # metrics
+            if "metrics" in o:  # TODO: (susan) not sure
+                for metrics in o["metrics"]:
+                    append_to_dict(aggregated_metrics, metrics)
+
+        # concat results from micro batches
+
+        for key, val in model_output.items():
+            model_output[key] = torch.concat(val, dim=0)  # (global_bsz, steps, ...)
+            # reverse with dynamic bsz
+            # if use_dynamic_bsz:
+            #     model_output[key] = restore_dynamic_batch(model_output[key], indices)
+
+        output = {
+            "model_output": model_output,  # a dict of tensors in shape (global_bsz, steps, ...)
+            "loss": losses,  # micro-batch step-wise losses
+            "metrics": aggregated_metrics,
+        }
+
+        return output
+
+    def prepare_model_inputs(self, micro_batch: TensorDict, step: int):
+        latents = micro_batch["all_latents"]
+        timesteps = micro_batch["all_timesteps"]
+        prompt_embeds = micro_batch["prompt_embeds"]
+        prompt_embeds_mask = micro_batch["prompt_embeds_mask"]
+        negative_prompt_embeds = micro_batch["negative_prompt_embeds"]
+        negative_prompt_embeds_mask = micro_batch["negative_prompt_embeds_mask"]
+
+        if prompt_embeds.is_nested:
+            batch_size = prompt_embeds.size(0)
+            seq_len_effective = prompt_embeds.offsets().diff()
+            max_seq_len = max(seq_len_effective)
+            embed_dim = prompt_embeds.size(-1)
+            prompt_embeds = torch.nested.to_padded_tensor(
+                prompt_embeds, padding=0, output_size=(batch_size, max_seq_len, embed_dim)
+            )
+            prompt_embeds_mask = torch.nested.to_padded_tensor(
+                prompt_embeds_mask, padding=0, output_size=(batch_size, max_seq_len)
+            )
+        if isinstance(negative_prompt_embeds, torch.Tensor) and negative_prompt_embeds.is_nested:
+            batch_size = negative_prompt_embeds.size(0)
+            seq_len_effective = negative_prompt_embeds.offsets().diff()
+            max_seq_len = max(seq_len_effective)
+            embed_dim = negative_prompt_embeds.size(-1)
+            negative_prompt_embeds = torch.nested.to_padded_tensor(
+                negative_prompt_embeds, padding=0, output_size=(batch_size, max_seq_len, embed_dim)
+            )
+            negative_prompt_embeds_mask = torch.nested.to_padded_tensor(
+                negative_prompt_embeds_mask, padding=0, output_size=(batch_size, max_seq_len)
+            )
+
+        height = tu.get_non_tensor_data(data=micro_batch, key="height", default=None)
+        width = tu.get_non_tensor_data(data=micro_batch, key="width", default=None)
+        vae_scale_factor = tu.get_non_tensor_data(data=micro_batch, key="vae_scale_factor", default=None)
+        img_shapes = [[(1, height // vae_scale_factor // 2, width // vae_scale_factor // 2)]]
+
+        if getattr(self.module.config, "guidance_embeds", False):
+            guidance = torch.full([1], self._guidance_scale, dtype=torch.float32)
+        else:
+            guidance = None
+
+        hidden_states = latents[:, step]
+        timestep = timesteps[:, step] / 1000.0
+
+        # TODO (mike): in diffusers main branch, it no longer accept txt_seq_lens
+        txt_seq_lens = torch.ones_like(prompt_embeds_mask).sum(dim=1).tolist()
+
+        if isinstance(negative_prompt_embeds_mask, torch.Tensor):
+            negative_txt_seq_lens = torch.ones_like(negative_prompt_embeds_mask).sum(dim=1).tolist()
+        else:
+            negative_txt_seq_lens = None
+
+        model_inputs = {
+            "hidden_states": hidden_states,
+            "timestep": timestep,
+            "guidance": guidance,
+            "encoder_hidden_states_mask": prompt_embeds_mask,
+            "encoder_hidden_states": prompt_embeds,
+            "img_shapes": img_shapes,
+            "txt_seq_lens": txt_seq_lens,
+            "return_dict": False,
+        }
+
+        negative_model_inputs = {
+            "hidden_states": hidden_states,
+            "timestep": timestep,
+            "guidance": guidance,
+            "encoder_hidden_states_mask": negative_prompt_embeds_mask,
+            "encoder_hidden_states": negative_prompt_embeds,
+            "img_shapes": img_shapes,
+            "txt_seq_lens": negative_txt_seq_lens,
+            "return_dict": False,
+        }
+
+        return model_inputs, negative_model_inputs
+
+    def prepare_model_outputs(self, output, micro_batch: TensorDict):
+        log_prob, prev_sample_mean, std_dev_t = output
+        model_output = {}
+        model_output["log_probs"] = log_prob
+        model_output["prev_sample_mean"] = prev_sample_mean
+        model_output["std_dev_t"] = std_dev_t
+        return model_output
+
+    def forward_model_with_scheduler(self, model_inputs, negative_model_inputs, micro_batch, step):
+        latents = micro_batch["all_latents"]
+        timesteps = micro_batch["all_timesteps"]
+
+        noise_pred = self.module(**model_inputs)[0]
+        if self._guidance_scale > 1.0:
+            neg_noise_pred = self.module(**negative_model_inputs)[0]
+            comb_pred = neg_noise_pred + self._guidance_scale * (noise_pred - neg_noise_pred)
+            cond_norm = torch.norm(noise_pred, dim=-1, keepdim=True)
+            noise_norm = torch.norm(comb_pred, dim=-1, keepdim=True)
+            noise_pred = comb_pred * (cond_norm / noise_norm)
+
+        _, log_prob, prev_sample_mean, std_dev_t = self.scheduler.sample_previous_step(
+            sample=latents[:, step].float(),
+            model_output=noise_pred,
+            timestep=timesteps[:, step],
+            noise_level=self.model_config.noise_level,
+            prev_sample=latents[:, step + 1].float(),
+            sde_type=self.model_config.sde_type,
+        )
+        return log_prob, prev_sample_mean, std_dev_t
+
+    def forward_step(self, micro_batch: TensorDict, loss_function, forward_only, step):
+        device_name = get_device_name()
+        # actually, we should avoid assigning like this...
+        micro_batch = micro_batch.to(get_device_id())
+        model_inputs, negative_model_inputs = self.prepare_model_inputs(micro_batch=micro_batch, step=step)
+        raw_output = self.forward_model_with_scheduler(
+            model_inputs=model_inputs, negative_model_inputs=negative_model_inputs, micro_batch=micro_batch, step=step
+        )
+        model_output = self.prepare_model_outputs(output=raw_output, micro_batch=micro_batch)
+
+        if loss_function is not None:
+            data = tu.get_tensordict(
+                {
+                    "old_log_probs": micro_batch["old_log_probs"][:, step],
+                    "advantages": micro_batch["advantages"][:, step],
+                    "response_mask": micro_batch["response_mask"][:, step],
+                },
+                {
+                    "dp_size": tu.get_non_tensor_data(micro_batch, "dp_size", None),
+                    "batch_num_tokens": tu.get_non_tensor_data(micro_batch, "batch_num_tokens", None),
+                    "global_batch_size": tu.get_non_tensor_data(micro_batch, "global_batch_size", None),
+                },
+            )
+            if micro_batch.get("ref_log_prob", None) is not None:
+                data["ref_log_prob"] = micro_batch["ref_log_prob"][:, step]
+
+            if micro_batch.get("ref_prev_sample_mean", None) is not None:
+                data["ref_prev_sample_mean"] = micro_batch["ref_prev_sample_mean"][:, step]
+
+            loss, metrics = loss_function(model_output=model_output, data=data, dp_group=self.get_data_parallel_group())
+        else:
+            assert forward_only, "forward_only must be True when loss_function is None"
+            loss = torch.tensor(1.0, device=device_name)
+            metrics = {}
+
+        output = {
+            "model_output": model_output,
+            "loss": loss.detach().item(),
+            "metrics": metrics,
+        }
+
+        return loss, output
+
+    def optimizer_zero_grad(self):
+        """
+        Zero gradients and enforce FSDP grad-clipping logic.
+        """
+        self.optimizer.zero_grad()
+
+    def optimizer_step(self):
+        """
+        Clip gradients, skip update if non-finite, and step optimizer.
+
+        Returns:
+            grad_norm (float): Norm of gradients before clipping.
+        """
+        assert self.optimizer_config.clip_grad is not None
+
+        if isinstance(self.module, FSDP):
+            grad_norm = self.module.clip_grad_norm_(self.optimizer_config.clip_grad)
+        elif isinstance(self.module, FSDPModule):
+            grad_norm = fsdp2_clip_grad_norm_(self.module.parameters(), max_norm=self.optimizer_config.clip_grad)
+        else:
+            grad_norm = torch.nn.utils.clip_grad_norm_(
+                self.module.parameters(), max_norm=self.optimizer_config.clip_grad
+            )
+
+        if isinstance(grad_norm, DTensor):
+            grad_norm = grad_norm.full_tensor()
+
+        # if grad_norm is not finite, skip the update
+        if not torch.isfinite(grad_norm):
+            print(f"WARN: grad_norm is not finite: {grad_norm}")
+            self.optimizer.zero_grad()
+        else:
+            self.optimizer.step()
+        return grad_norm.item()
+
+    def lr_scheduler_step(self):
+        """
+        Advance FSDP scheduler and return updated learning rate.
+        """
+        self.lr_scheduler.step()
+        lr = self.lr_scheduler.get_last_lr()[0]  # only return the first group
+        return lr
+
+    def to(self, device: str, model: bool = True, optimizer: bool = True, grad: bool = True):
+        """
+        Move FSDP model and/or optimizer to CPU or GPU with offload support.
+        Note that this function executes irrespective of offload config. It serves as manual control
+        """
+        super().to(device=device, model=model, optimizer=optimizer, grad=grad)
+
+        if self.engine_config.forward_only:
+            # force cpu_offload
+            return
+
+        device_name = get_device_name()
+
+        assert device in (device_name, "cpu")
+        if device == device_name:
+            if model:
+                load_fsdp_model_to_gpu(self.module)
+            if optimizer and self.optimizer is not None:
+                load_fsdp_optimizer(self.optimizer, device)
+            gc.collect()
+        elif device == "cpu":
+            if model:
+                offload_fsdp_model_to_cpu(self.module)
+            if optimizer and self.optimizer is not None:
+                offload_fsdp_optimizer(self.optimizer)
+        else:
+            raise ValueError(f"Invalid device type: {device}")
+
+    def save_checkpoint(
+        self,
+        local_path: str,
+        hdfs_path: Optional[str] = None,
+        global_step: int = 0,
+        max_ckpt_to_keep: Optional[int] = None,
+        **kwargs,
+    ) -> None:
+        """
+        Save FSDP checkpoint, handling parameter offload as needed.
+        """
+        origin_module_device = next(self.module.parameters()).device.type
+        if self._is_offload_param or origin_module_device == "cpu":
+            load_fsdp_model_to_gpu(self.module)
+
+        self.checkpoint_manager.save_checkpoint(
+            local_path=local_path, hdfs_path=hdfs_path, global_step=global_step, max_ckpt_to_keep=max_ckpt_to_keep
+        )
+
+        torch.distributed.barrier()
+        if self._is_offload_param:
+            offload_fsdp_model_to_cpu(self.module)
+
+    def load_checkpoint(
+        self, local_path: str, hdfs_path: Optional[str] = None, del_local_after_load: int = True, **kwargs
+    ) -> None:
+        """
+        Load FSDP checkpoint, restoring parameters and optimizer state.
+        """
+        import torch
+
+        if self._is_offload_param:
+            load_fsdp_model_to_gpu(self.module)
+
+        self.checkpoint_manager.load_checkpoint(
+            local_path=local_path, hdfs_path=hdfs_path, del_local_after_load=del_local_after_load
+        )
+
+        torch.distributed.barrier()
+        if self._is_offload_param:
+            offload_fsdp_model_to_cpu(self.module)
+
+        if self._is_offload_optimizer:
+            offload_fsdp_optimizer(self.optimizer)
+
+    def get_per_tensor_param(self, layered_summon=False, base_sync_done=False, **kwargs):
+        log_gpu_memory_usage("Before load_fsdp_model_to_gpu", logger=logger)
+
+        load_fsdp_model_to_gpu(self.module)
+
+        log_gpu_memory_usage("After load_fsdp_model_to_gpu", logger=logger)
+
+        peft_config = None
+        merge_lora = self.model_config.lora.get("merge", False)
+
+        peft_model = getattr(self.module, "_fsdp_wrapped_module", self.module)
+        if hasattr(peft_model, "peft_config"):  # LoRA
+            if not merge_lora:
+                peft_config = peft_model.peft_config.get("default", None)
+                params = collect_lora_params(
+                    module=self.module,
+                    layered_summon=layered_summon,
+                    base_sync_done=base_sync_done,
+                    is_diffusers=True,
+                )
+                if not base_sync_done:
+                    params = {replace_lora_wrapper(k, peft_config): v for k, v in params.items()}
+            else:  # merge lora
+                with merged_lora_context(self.module, backup_adapters=True):
+                    params = self.module.state_dict()
+                    params = normalize_peft_param_name(params)
+        else:
+            params = self.module.state_dict()
+
+        params = convert_weight_keys(params, getattr(self.module, "_fsdp_wrapped_module", self.module))
+
+        log_gpu_memory_usage("Before offload_fsdp_model_to_cpu", logger=logger)
+        if self._is_offload_param:
+            offload_fsdp_model_to_cpu(self.module)
+        log_gpu_memory_usage("After offload_fsdp_model_to_cpu", logger=logger)
+
+        if peft_config is not None and base_sync_done:
+            per_tensor_param = params.items()
+        else:
+            device = get_device_id()  # used when fsdp2 set cpu_offload_policy
+            # TODO: cast fp32 to bf16 to reduce weight sync overhead, need more fine-grained control, e.g MoE gate
+            per_tensor_param = (
+                (
+                    name,
+                    param.to(device, non_blocking=True).full_tensor().to(torch.bfloat16, non_blocking=True)
+                    if isinstance(param, DTensor)
+                    else param,
+                )
+                for name, param in params.items()
+            )
+        # return per_tensor_param, peft_config
+        # Convert peft_config to dict for vLLM compatibility (PEFTHelper.from_dict expects dict)
+
+        # diffusers: transformer backbone only
+        # vllm-omni: whole pipeline
+        # thus we need to add the prefix
+        per_tensor_param = ((f"transformer.{name}", tensor) for name, tensor in per_tensor_param)
+        peft_config_dict = peft_config.to_dict() if peft_config is not None else None
+        return per_tensor_param, peft_config_dict
+
+    @contextmanager
+    def disable_adapter(self):
+        try:
+            self.module.disable_adapters()
+            yield
+        finally:
+            self.module.enable_adapters()
+
+
+class EngineEvalModeCtx(BaseEngineCtx):
+    def __init__(self, engine: DiffusersFSDPEngine, **kwargs):
+        super().__init__(engine=engine, mode="eval", **kwargs)
+
+    def __enter__(self):
+        assert isinstance(self.engine, DiffusersFSDPEngine)
+        super().__enter__()
+        self.prev_sp_group = get_ulysses_sequence_parallel_group()
+        set_ulysses_sequence_parallel_group(self.engine.ulysses_parallel_group)
+        self.engine.module.eval()
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        assert isinstance(self.engine, DiffusersFSDPEngine)
+        set_ulysses_sequence_parallel_group(self.prev_sp_group)
+
+        # https://pytorch.org/docs/stable/notes/fsdp.html#fsdp-notes
+        # unshard the root FSDP module
+        if self.engine.engine_config.fsdp_size > 1:
+            if fsdp_version(self.engine.module) == 1:
+                self.engine.module._handle.reshard(True)
+            elif fsdp_version(self.engine.module) == 2:
+                self.engine.module.reshard()
+
+        super().__exit__(exc_type, exc_value, traceback)
+
+
+class EngineTrainModeCtx(BaseEngineCtx):
+    def __init__(self, engine: DiffusersFSDPEngine, **kwargs):
+        super().__init__(engine=engine, mode="train", **kwargs)
+
+    def __enter__(self):
+        assert isinstance(self.engine, DiffusersFSDPEngine)
+        super().__enter__()
+        self.prev_sp_group = get_ulysses_sequence_parallel_group()
+        set_ulysses_sequence_parallel_group(self.engine.ulysses_parallel_group)
+        self.engine.module.train()
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        assert isinstance(self.engine, DiffusersFSDPEngine)
+        set_ulysses_sequence_parallel_group(self.prev_sp_group)
+        self.engine.optimizer_zero_grad()
+        super().__exit__(exc_type, exc_value, traceback)
diff --git a/verl/workers/engine_workers.py b/verl/workers/engine_workers.py
index abca5cdb65b..8aef9e0ed77 100644
--- a/verl/workers/engine_workers.py
+++ b/verl/workers/engine_workers.py
@@ -42,7 +42,14 @@
 from verl.utils.py_functional import append_to_dict
 from verl.utils.tensordict_utils import maybe_fix_3d_position_ids
 from verl.utils.torch_functional import allgather_dict_into_dict
-from verl.workers.config import ActorConfig, HFModelConfig, RolloutConfig, TrainingWorkerConfig
+from verl.workers.config import (
+    ActorConfig,
+    DiffusersModelConfig,
+    DiffusionRolloutConfig,
+    HFModelConfig,
+    RolloutConfig,
+    TrainingWorkerConfig,
+)
 from verl.workers.rollout.base import BaseRollout, get_rollout_class
 from verl.workers.utils.losses import ppo_loss
 
@@ -134,7 +141,10 @@ def __init__(self, config: TrainingWorkerConfig):
             is_collect=self.engine.is_mp_src_rank_with_outputs(),
         )
 
-        self.flops_counter = FlopsCounter(self.model_config.hf_config)
+        if hasattr(self.model_config, "hf_config"):
+            self.flops_counter = FlopsCounter(self.model_config.hf_config)
+        else:
+            self.flops_counter = None  # TODO: add diffusion flops counter later
 
         self.loss_fn = None
 
@@ -205,7 +215,7 @@ def _postprocess_output(self, output, *, global_token_num, delta_time, forward_o
                 flatten_v = [sublist[0] for sublist in v]  # sublist should be single element
                 final_metrics[k] = sum(flatten_v) / len(flatten_v)
         # compute mfu
-        if global_token_num is not None:
+        if global_token_num is not None and self.flops_counter is not None:
             estimated_flops, promised_flops = self.flops_counter.estimate_flops(
                 global_token_num, delta_time, images_seqlens=images_seqlens
             )
@@ -267,7 +277,10 @@ def train_mini_batch(self, data: TensorDict) -> TensorDict:
 
             for batch_idx, mini_batch_td in enumerate(dataloader):
                 # add global token num
-                global_token_num = mini_batch_td["input_ids"].offsets().diff().tolist()  # (total_nnz,)
+                if mini_batch_td["input_ids"].is_nested:
+                    global_token_num = mini_batch_td["input_ids"].offsets().diff().tolist()  # (total_nnz,)
+                else:
+                    global_token_num = torch.sum(mini_batch_td["attention_mask"], dim=-1).tolist()
                 # allgather from dp rank
                 global_token_num_output = [None] * self.engine.get_data_parallel_size()
                 torch.distributed.all_gather_object(
@@ -467,7 +480,7 @@ def to(self, device, model=True, optimizer=True, grad=True):
 
     @register(dispatch_mode=Dispatch.ONE_TO_ALL)
     def init_model(self):
-        model_config: HFModelConfig = omega_conf_to_dataclass(self.config.model)
+        model_config: HFModelConfig | DiffusersModelConfig = omega_conf_to_dataclass(self.config.model)
 
         # 1. build reference model
         if "ref" in self.role:
@@ -484,8 +497,9 @@ def init_model(self):
             ref_config.model_config = model_config
 
             # construct TrainingWorkerConfig
+            model_type = model_config.get("model_type", "language_model")
             ref_training_config = TrainingWorkerConfig(
-                model_type="language_model",
+                model_type=model_type,
                 model_config=ref_config.model_config,
                 engine_config=ref_config.engine,
                 optimizer_config=ref_config.optim,
@@ -508,8 +522,9 @@ def init_model(self):
         if "actor" in self.role:
             actor_config: ActorConfig = omega_conf_to_dataclass(self.config.actor)
             actor_config.model_config = model_config
+            model_type = model_config.get("model_type", "language_model")
             actor_training_config = TrainingWorkerConfig(
-                model_type="language_model",
+                model_type=model_type,
                 model_config=actor_config.model_config,
                 engine_config=actor_config.engine,
                 optimizer_config=actor_config.optim,
@@ -547,7 +562,7 @@ def init_model(self):
 
         # 3. build rollout engine
         if "rollout" in self.role:
-            rollout_config: RolloutConfig = omega_conf_to_dataclass(self.config.rollout)
+            rollout_config: RolloutConfig | DiffusionRolloutConfig = omega_conf_to_dataclass(self.config.rollout)
 
             # TODO: move rollout_device_mesh into ServerAdapter
             # 3.1 build rollout device mesh (sglang need only)
diff --git a/verl/workers/rollout/base.py b/verl/workers/rollout/base.py
index c8038606f1f..2e762d6610e 100644
--- a/verl/workers/rollout/base.py
+++ b/verl/workers/rollout/base.py
@@ -21,7 +21,7 @@
 
 from verl import DataProto
 from verl.utils.config import omega_conf_to_dataclass
-from verl.workers.config import HFModelConfig, RolloutConfig
+from verl.workers.config import DiffusersModelConfig, HFModelConfig, RolloutConfig
 
 __all__ = ["BaseRollout"]
 
@@ -32,13 +32,13 @@ class BaseRollout(ABC):
     def __init__(
         self,
         config: RolloutConfig,
-        model_config: HFModelConfig,
+        model_config: HFModelConfig | DiffusersModelConfig,
         device_mesh: DeviceMesh,
         *args,
         **kwargs,
     ):
         self.config = omega_conf_to_dataclass(config)
-        self.model_config: HFModelConfig = omega_conf_to_dataclass(model_config, dataclass_type=HFModelConfig)
+        self.model_config: HFModelConfig | DiffusersModelConfig = omega_conf_to_dataclass(model_config)
         self.device_mesh = device_mesh
 
     @abstractmethod
@@ -82,6 +82,7 @@ def generate_sequences(self, prompts: DataProto) -> DataProto:
 
 _ROLLOUT_REGISTRY = {
     ("vllm", "async"): "verl.workers.rollout.vllm_rollout.ServerAdapter",
+    ("vllm_omni", "async"): "verl.workers.rollout.vllm_rollout.vLLMOmniServerAdapter",
     ("sglang", "async"): "verl.workers.rollout.sglang_rollout.sglang_rollout.ServerAdapter",
     ("trtllm", "async"): "verl.workers.rollout.trtllm_rollout.trtllm_rollout.ServerAdapter",
 }
diff --git a/verl/workers/rollout/replica.py b/verl/workers/rollout/replica.py
index f6571a4ec5e..a479fb09a68 100644
--- a/verl/workers/rollout/replica.py
+++ b/verl/workers/rollout/replica.py
@@ -26,7 +26,7 @@
 from verl.single_controller.ray import RayClassWithInitArgs, RayResourcePool, RayWorkerGroup, ResourcePoolManager
 from verl.utils.config import omega_conf_to_dataclass
 from verl.utils.device import is_torch_npu_available
-from verl.workers.config import HFModelConfig, RolloutConfig
+from verl.workers.config import DiffusionRolloutConfig, HFModelConfig, RolloutConfig
 
 logger = logging.getLogger(__file__)
 
@@ -51,6 +51,19 @@ class TokenOutput(BaseModel):
     """extra info for rollout"""
 
 
+class ImageOutput(BaseModel):
+    image: list[list[list[float]]]
+    """generated image tensor (CHW format)"""
+    log_probs: Optional[list[float]] = None
+    """logprobs of generated image"""
+    stop_reason: Optional[str] = None
+    """stop reason: 'completed', 'aborted', or None for unknown"""
+    num_preempted: Optional[int] = None
+    """number of preempted times for metric calculation"""
+    extra_info: dict[str, Any] = {}
+    """extra info for rollout"""
+
+
 class RolloutMode(Enum):
     # Rollout engine and training engine(fsdp/megatron) fused in same process
     # Rollout and trainer share GPUs, switch context with weight synchronization.
@@ -93,13 +106,13 @@ class RolloutReplica(ABC):
     def __init__(
         self,
         replica_rank: int,
-        config: RolloutConfig,
+        config: RolloutConfig | DiffusionRolloutConfig,
         model_config: DictConfig,
         gpus_per_node: int = 8,
         is_reward_model: bool = False,
     ) -> None:
         self.replica_rank = replica_rank
-        self.config: RolloutConfig = omega_conf_to_dataclass(config)
+        self.config: RolloutConfig | DiffusionRolloutConfig = omega_conf_to_dataclass(config)
         self.model_config: HFModelConfig = model_config
 
         self.world_size = (
@@ -299,6 +312,12 @@ def _load_vllm():
     return vLLMReplica
 
 
+def _load_vllm_omni():
+    from verl.workers.rollout.vllm_rollout.vllm_omni_async_server import vLLMOmniReplica
+
+    return vLLMOmniReplica
+
+
 def _load_sglang():
     os.environ["SGLANG_USE_CPU_ENGINE"] = "1"
 
@@ -353,6 +372,7 @@ def _load_trtllm():
 RolloutReplicaRegistry.register("vllm", _load_vllm)
 RolloutReplicaRegistry.register("sglang", _load_sglang)
 RolloutReplicaRegistry.register("trtllm", _load_trtllm)
+RolloutReplicaRegistry.register("vllm_omni", _load_vllm_omni)
 
 
 # Original function for backward compatibility
diff --git a/verl/workers/rollout/vllm_rollout/__init__.py b/verl/workers/rollout/vllm_rollout/__init__.py
index 2ecf113c839..a229d4aeff6 100644
--- a/verl/workers/rollout/vllm_rollout/__init__.py
+++ b/verl/workers/rollout/vllm_rollout/__init__.py
@@ -14,6 +14,7 @@
 import os
 from importlib.metadata import PackageNotFoundError, version
 
+from .vllm_omni_rollout import vLLMOmniServerAdapter  # noqa: F401
 from .vllm_rollout import ServerAdapter  # noqa: F401
 
 
diff --git a/verl/workers/rollout/vllm_rollout/utils.py b/verl/workers/rollout/vllm_rollout/utils.py
index 5754f17a81e..5cc5cb90322 100644
--- a/verl/workers/rollout/vllm_rollout/utils.py
+++ b/verl/workers/rollout/vllm_rollout/utils.py
@@ -22,9 +22,10 @@
 from typing import Any, Literal, get_args
 
 import torch
+from vllm_omni.diffusion.worker.diffusion_worker import CustomPipelineWorkerExtension
 
 from verl.utils.device import is_npu_available
-from verl.utils.vllm import TensorLoRARequest, VLLMHijack
+from verl.utils.vllm import OmniTensorLoRARequest, TensorLoRARequest, VLLMHijack, VLLMOmniHijack
 from verl.utils.vllm.patch import patch_vllm_moe_model_weight_loader
 from verl.utils.vllm.vllm_fp8_utils import apply_vllm_fp8_patches, is_fp8_model, load_quanted_weights
 
@@ -236,6 +237,81 @@ def _get_zmq_handle(self) -> str:
         return f"ipc:///tmp/rl-colocate-zmq-{self.device_uuid}.sock"
 
 
+class vLLMOmniColocateWorkerExtension(CustomPipelineWorkerExtension):
+    """
+    The class for vLLM-Omni's worker to inherit from, in the colocate setting.
+    By defining an extension class, the code can work no matter what is
+    the underlying worker class. This way, the code can be compatible
+    with both vLLM V0 and V1.
+    NOTE: we define this class in a separate module, and the main module
+    should pass the full qualified name as `worker_extension_cls` argument.
+
+    Feature support:
+    1. LoRA
+    """
+
+    def __new__(cls, **kwargs):
+        set_death_signal()
+
+        # 1. patch for Lora
+        VLLMOmniHijack.hijack()
+
+        # TODO: For ascend NPU, when the corresponding vllm-ascend version is upgraded to v0.13.0,
+        # please remove the VLLM_ASCEND_REQUIRED_ENV_VARS variable replacement action.
+        # This is only a fix for vllm version < v0.13.0.
+        if is_npu_available:
+            raise NotImplementedError("vLLMOmniColocateWorkerExtension is not supported on NPU currently.")
+
+        return super().__new__(cls)
+
+    def update_weights_from_ipc(self, peft_config: dict = None, base_sync_done=False, use_shm: bool = False):
+        """Update the weights of the rollout model."""
+        from vllm.platforms import current_platform
+
+        from verl.workers.rollout.vllm_rollout.bucketed_weight_transfer import BucketedWeightReceiver
+
+        if current_platform.device_type == "npu" and self.device is None:
+            self.device = torch.device(f"npu:{self.local_rank}")
+
+        # In async mode, make sure the old lora is removed before adding the new one
+        if peft_config and base_sync_done:
+            self.remove_lora(VLLM_LORA_INT_ID)
+
+        assert self.device is not None
+        receiver = BucketedWeightReceiver(
+            zmq_handle=self._get_zmq_handle(),
+            device=self.device,
+            use_shm=use_shm,
+        )
+        receiver.receive_weights(
+            on_bucket_received=lambda weights: self._update_weights(
+                weights, peft_config=peft_config, base_sync_done=base_sync_done
+            )
+        )
+
+    def _update_weights(self, weights: list[tuple[str, torch.Tensor]], peft_config: dict, base_sync_done: bool):
+        if peft_config and base_sync_done:
+            weights = dict(weights)
+            lora_request = OmniTensorLoRARequest(
+                lora_name=VLLM_LORA_NAME,
+                lora_int_id=VLLM_LORA_INT_ID,
+                lora_path=VLLM_LORA_PATH,
+                peft_config=peft_config,
+                lora_tensors=weights,
+            )
+            self.add_lora(lora_request)
+            logger.info(f"vLLM-Omni load weights, loaded_params: {len(weights)}")
+        else:
+            logger.info("Loading standard weights (async)")
+            self.load_weights(weights)
+
+    def _get_zmq_handle(self) -> str:
+        """Get ZMQ handle for communication."""
+        if not hasattr(self, "device_uuid") or not self.device_uuid:
+            self.device_uuid = get_device_uuid(self.device.index)
+        return f"ipc:///tmp/rl-colocate-zmq-{self.device_uuid}.sock"
+
+
 class SuppressSignalInThread:
     def __enter__(self):
         self.original_signal = signal.signal
diff --git a/verl/workers/rollout/vllm_rollout/vllm_omni_async_server.py b/verl/workers/rollout/vllm_rollout/vllm_omni_async_server.py
new file mode 100644
index 00000000000..90f8e818523
--- /dev/null
+++ b/verl/workers/rollout/vllm_rollout/vllm_omni_async_server.py
@@ -0,0 +1,767 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import asyncio
+import json
+import logging
+import os
+from dataclasses import asdict
+from pprint import pprint
+from typing import Any, Callable, Optional
+
+import ray
+import torchvision.transforms as T
+import vllm_omni.entrypoints.cli.serve
+from ray.actor import ActorHandle
+from vllm.entrypoints.openai.api_server import build_app
+from vllm.utils.argparse_utils import FlexibleArgumentParser
+from vllm_omni.engine.arg_utils import AsyncOmniEngineArgs
+from vllm_omni.entrypoints import AsyncOmni
+from vllm_omni.entrypoints.openai.api_server import omni_init_app_state
+from vllm_omni.inputs.data import OmniCustomPrompt, OmniDiffusionSamplingParams
+from vllm_omni.lora.request import LoRARequest
+from vllm_omni.outputs import OmniRequestOutput
+
+from verl.utils.config import omega_conf_to_dataclass
+from verl.utils.device import get_resource_name, get_visible_devices_keyword
+from verl.utils.net_utils import get_free_port, is_valid_ipv6_address
+from verl.utils.profiler import DistProfiler
+from verl.utils.tokenizer import normalize_token_ids
+from verl.workers.config import DiffusersModelConfig, DiffusionRolloutConfig
+from verl.workers.rollout.replica import ImageOutput, RolloutMode, RolloutReplica
+from verl.workers.rollout.utils import run_uvicorn
+from verl.workers.rollout.vllm_rollout.utils import (
+    VLLM_LORA_INT_ID,
+    VLLM_LORA_NAME,
+    VLLM_LORA_PATH,
+    build_cli_args_from_config,
+    get_vllm_max_lora_rank,
+)
+
+logger = logging.getLogger(__file__)
+logger.setLevel(logging.INFO)
+
+
+class vLLMOmniHttpServer:
+    """vLLM-Omni http server in single node, this is equivalent to launch server with command line:
+    ```
+    vllm serve --tensor-parallel-size=8 ...
+    ```
+    """
+
+    def __init__(
+        self,
+        config: DiffusionRolloutConfig,
+        model_config: DiffusersModelConfig,
+        rollout_mode: RolloutMode,
+        workers: list[ActorHandle],
+        replica_rank: int,
+        node_rank: int,
+        gpus_per_node: int,
+        nnodes: int,
+        cuda_visible_devices: str,
+    ):
+        """
+        Args:
+            config (DiffusionRolloutConfig): full config.
+            model_config (HFModelConfig): model config.
+            rollout_mode (RolloutMode): rollout mode.
+            replica_rank (int): replica rank, a replica may contain multiple nodes.
+            node_rank (int): node rank.
+            gpus_per_node (int): number of gpus per node.
+            nnodes (int): number of nodes.
+            cuda_visible_devices (str): cuda visible devices.
+        """
+        os.environ[get_visible_devices_keyword()] = cuda_visible_devices
+
+        self.config: DiffusionRolloutConfig = omega_conf_to_dataclass(config)
+        self.model_config: DiffusersModelConfig = omega_conf_to_dataclass(model_config)
+        self.rollout_mode = rollout_mode
+        self.workers = workers
+
+        self.replica_rank = replica_rank
+        self.node_rank = node_rank
+        self.gpus_per_node = gpus_per_node
+        self.nnodes = nnodes
+        # model weights version, set by ServerAdapter when update weights.
+        self.global_steps = None
+
+        if self.rollout_mode != RolloutMode.HYBRID and self.config.load_format == "dummy":
+            logger.warning(f"rollout mode is {self.rollout_mode}, load_format is dummy, set to auto")
+            self.config.load_format = "auto"
+
+        # used for http server
+        self._server_address = ray.util.get_node_ip_address().strip("[]")
+        self._server_port = None
+
+        # used for controlling vllm server profiler
+        profiler_config = self.config.profiler
+        tool_config = None
+        if profiler_config is not None:
+            if profiler_config.tool in ["torch", "npu"]:
+                tool_config = omega_conf_to_dataclass((profiler_config.tool_config or {}).get(profiler_config.tool))
+            else:
+                logger.warning(f"agent loop only support torch and npu profiler, got {profiler_config.tool}")
+                profiler_config = None
+        self.profiler_controller = DistProfiler(self.replica_rank, config=profiler_config, tool_config=tool_config)
+
+        # used for data parallel: --data-parallel-address, --data-parallel-rpc-port
+        if self.node_rank == 0:
+            self._master_address = self._server_address
+            # used for torch.distributed.init_process_group
+            self._master_port, self._master_sock = get_free_port(self._server_address, with_alive_sock=True)
+            # used for data parallel: --data-parallel-address, --data-parallel-rpc-port
+            self._dp_rpc_port, self._dp_rpc_sock = get_free_port(self._server_address, with_alive_sock=True)
+            self._dp_master_port, self._dp_master_sock = get_free_port(self._server_address, with_alive_sock=True)
+        else:
+            self._master_address = None
+            self._master_port = None
+            self._dp_rpc_port = None
+            self._dp_master_port = None
+
+        self._to_tensor = T.PILToTensor()
+
+        logger.info(
+            f"vLLMOmniHttpServer, replica_rank: {self.replica_rank}, node_rank: {self.node_rank}, "
+            f"{get_visible_devices_keyword()}: {cuda_visible_devices}, "
+            f"master_address: {self._master_address}, master_port: {self._master_port}, "
+            f"data_parallel_rpc_port: {self._dp_rpc_port}, data_parallel_master_port: {self._dp_master_port}"
+        )
+
+    def get_master_address(self):
+        """Get master address and port for data parallel.
+        Returns:
+            tuple: (master_address, master_port, dp_rpc_port)
+        """
+        return self._master_address, self._master_port, self._dp_rpc_port
+
+    def get_server_address(self):
+        """Get http server address and port."""
+        assert self._server_port is not None, "http server is not launched, port is None"
+        return self._server_address, self._server_port
+
+    @property
+    def lora_as_adapter(self) -> bool:
+        return (
+            self.model_config.lora_rank > 0 or self.model_config.lora.get("rank", 0) > 0
+        ) and not self.model_config.lora.get("merge", False)
+
+    async def collective_rpc(
+        self,
+        method: str | Callable,
+        timeout: float | None = None,
+        args: tuple = (),
+        kwargs: dict[str, Any] | None = None,
+    ):
+        await self.engine.collective_rpc(
+            method=method,
+            timeout=timeout,
+            args=args,
+            kwargs=kwargs,
+        )
+
+    async def launch_server(self, master_address: str = None, master_port: int = None, dp_rpc_port: int = None):
+        if self.node_rank != 0:
+            assert master_address and master_port and dp_rpc_port, (
+                "non-master node should provide master_address, master_port and dp_rpc_port"
+            )
+            self._master_address = master_address
+            self._master_port = master_port
+            self._dp_rpc_port = dp_rpc_port
+
+        # 1. setup vllm-omni serve cli args
+        engine_kwargs = self.config.get("engine_kwargs", {}).get("vllm_omni", {}) or {}
+        engine_kwargs = {key: val for key, val in engine_kwargs.items() if val is not None}
+        if self.config.get("limit_images", None):  # support for multi-image data
+            engine_kwargs["limit_mm_per_prompt"] = {"image": self.config.get("limit_images")}
+        if self.config.cudagraph_capture_sizes:
+            engine_kwargs["cuda_graph_sizes"] = self.config.cudagraph_capture_sizes
+
+        # TODO (mike): support custom pipeline for cli
+        engine_kwargs.pop("custom_pipeline", None)
+
+        # Override default generation config from hugging face model config,
+        # user can still override them by passing kwargs in each request.
+        override_generation_config = dict()
+        logger.info(f"override_generation_config: {override_generation_config}")
+
+        logger.info(f"enable_sleep_mode: {self.config.enable_sleep_mode}")
+        if not self.config.enable_sleep_mode:
+            from verl.utils.device import set_expandable_segments
+
+            set_expandable_segments(True)
+
+        quantization = self.config.quantization
+        hf_overrides = {}
+
+        if quantization is not None:
+            raise NotImplementedError("vLLM-Omni server does not support quantization yet.")
+
+        compilation_config = engine_kwargs.pop("compilation_config", None) or {}
+        if isinstance(compilation_config, str):
+            compilation_config = json.loads(compilation_config)
+        compilation_config.setdefault("cudagraph_mode", "FULL_AND_PIECEWISE")
+
+        # FULL cuda graph is not yet supported with DCP, downgrade to PIECEWISE
+        dcp_size = engine_kwargs.get("decode_context_parallel_size", 1) or 1
+        if dcp_size > 1 and compilation_config["cudagraph_mode"] == "FULL_AND_PIECEWISE":
+            logger.warning(
+                "FULL cuda graph is not supported with DCP (decode_context_parallel_size=%d), "
+                "downgrading cudagraph_mode to PIECEWISE.",
+                dcp_size,
+            )
+            compilation_config["cudagraph_mode"] = "PIECEWISE"
+
+        compilation_config = json.dumps(compilation_config)
+        args = {
+            "dtype": self.config.dtype,
+            "load_format": self.config.load_format,
+            "distributed_executor_backend": "mp",
+            "worker_extension_cls": "verl.workers.rollout.vllm_rollout.utils.vLLMOmniColocateWorkerExtension",
+            "trust_remote_code": self.model_config.trust_remote_code,
+            "max_model_len": self.config.max_model_len,
+            "max_num_seqs": self.config.max_num_seqs,
+            "enable_chunked_prefill": self.config.enable_chunked_prefill,
+            "max_num_batched_tokens": self.config.max_num_batched_tokens,
+            "enable_prefix_caching": self.config.enable_prefix_caching,
+            "enable_sleep_mode": self.config.enable_sleep_mode,
+            "logprobs_mode": self.config.logprobs_mode,
+            "enforce_eager": self.config.enforce_eager,
+            "gpu_memory_utilization": self.config.gpu_memory_utilization,
+            "disable_log_stats": self.config.disable_log_stats,
+            "tensor_parallel_size": self.config.tensor_model_parallel_size,
+            "seed": self.replica_rank + self.config.get("seed", 0),
+            "override_generation_config": json.dumps(override_generation_config),
+            "quantization": quantization,
+            "hf_overrides": hf_overrides,
+            "scheduling_policy": self.config.scheduling_policy,
+            "compilation_config": compilation_config,
+            **engine_kwargs,
+        }
+
+        if self.config.prometheus.enable:
+            if self.config.prometheus.served_model_name:
+                # Extract model name from path if it's a full path
+                served_model_name = self.config.prometheus.served_model_name
+                if "/" in served_model_name:
+                    # If it's a full path, extract the last part as model name
+                    served_model_name = served_model_name.split("/")[-1]
+                args["served_model_name"] = served_model_name
+
+        if self.config.expert_parallel_size > 1:
+            assert self.gpus_per_node % self.config.tensor_model_parallel_size == 0, (
+                "gpus_per_node should be divisible by tensor_model_parallel_size"
+            )
+            data_parallel_size_local = self.gpus_per_node // self.config.tensor_model_parallel_size
+            assert len(self.workers) == data_parallel_size_local * self.config.tensor_model_parallel_size, (
+                f"num workers ({len(self.workers)}) should be equal to dp_size_local "
+            )
+            f"({data_parallel_size_local}) * tp_size ({self.config.tensor_model_parallel_size})"
+
+            args.update(
+                {
+                    "enable_expert_parallel": self.config.expert_parallel_size > 1,
+                    "data_parallel_size": self.config.data_parallel_size,
+                    "data_parallel_size_local": data_parallel_size_local,
+                    "data_parallel_start_rank": self.node_rank * data_parallel_size_local,
+                    "data_parallel_address": self._master_address,
+                    "data_parallel_rpc_port": self._dp_rpc_port,
+                }
+            )
+
+        # used for torch.distributed.init_process_group
+        if self.nnodes > 1:
+            args.update(
+                {
+                    "master_addr": self._master_address,
+                    "master_port": self._master_port,
+                    "node_rank": self.node_rank,
+                    "nnodes": self.nnodes,
+                    "data_parallel_address": self._master_address,
+                    "data_parallel_rpc_port": self._dp_rpc_port,
+                }
+            )
+
+        # update lora-related args
+        lora_rank = self.model_config.lora.get("rank", 0)
+        if lora_rank <= 0:
+            lora_rank = (
+                self.model_config.lora_rank
+            )  # FIXME: fallback to lora_rank for now, we should unify lora settings.
+
+        if self.model_config.lora.get("merge", False):
+            lora_rank = 0
+
+        if lora_rank > 0:
+            lora_args = {
+                "enable_lora": True,
+                "max_loras": 1,
+                "max_lora_rank": get_vllm_max_lora_rank(lora_rank),
+            }
+            if self.model_config.lora.get("fully_sharded_loras", False):
+                lora_args["fully_sharded_loras"] = True
+            args.update(lora_args)
+
+        if self.config.enable_rollout_routing_replay:
+            args.update({"enable_return_routed_experts": True})
+
+        server_args = ["serve", self.model_config.local_path] + build_cli_args_from_config(args)
+
+        if self.replica_rank == 0:
+            pprint(server_args)
+
+        CMD_MODULES = [vllm_omni.entrypoints.cli.serve]
+        parser = FlexibleArgumentParser(description="vLLM-Omni CLI")
+        subparsers = parser.add_subparsers(required=False, dest="subparser")
+        cmds = {}
+        for cmd_module in CMD_MODULES:
+            new_cmds = cmd_module.cmd_init()
+            for cmd in new_cmds:
+                cmd.subparser_init(subparsers).set_defaults(dispatch_function=cmd.cmd)
+                cmds[cmd.name] = cmd
+        server_args = parser.parse_args(args=server_args)
+        server_args.model = server_args.model_tag
+        if server_args.subparser in cmds:
+            cmds[server_args.subparser].validate(server_args)
+
+        # 3. launch server
+        if self.node_rank == 0:
+            self._master_sock.close()
+            self._dp_rpc_sock.close()
+            self._dp_master_sock.close()
+            await self.run_server(server_args)
+        else:
+            # TODO: avoid connect before master_sock close
+            await asyncio.sleep(3)
+            await self.run_headless(server_args)
+
+    async def run_server(self, args: argparse.Namespace):
+        engine_args = AsyncOmniEngineArgs.from_cli_args(args)
+        engine_args = asdict(engine_args)
+
+        # TODO (mike): read custom_pipeline from CLI
+        custom_pipeline = self.config.engine_kwargs.get("vllm_omni", {}).get("custom_pipeline", None)
+        if custom_pipeline is not None:
+            engine_args["enable_dummy_pipeline"] = True
+            engine_args["custom_pipeline_args"] = {"pipeline_class": custom_pipeline}
+
+        # TODO (mike): support parsing engine config from CLI
+        engine_client = AsyncOmni(**engine_args)
+        app = build_app(args)
+        await omni_init_app_state(engine_client, app.state, args)
+
+        self.engine = engine_client
+        self._server_port, self._server_task = await run_uvicorn(app, args, self._server_address)
+
+    async def run_headless(self, args: argparse.Namespace):
+        """Run headless server in a separate thread."""
+
+        # TODO (mike): support multi node
+        # Create the EngineConfig.
+        raise NotImplementedError("vLLM-Omni headless mode is not implemented yet.")
+
+    async def generate(
+        self,
+        prompt_ids: list[int],
+        sampling_params: dict[str, Any],
+        request_id: str,
+        image_data: Optional[list[Any]] = None,
+        video_data: Optional[list[Any]] = None,
+        negative_prompt_ids: Optional[list[int]] = None,
+        priority: int = 0,
+    ) -> ImageOutput:
+        """Generate sequence with token-in-image-out."""
+        prompt_ids = normalize_token_ids(prompt_ids)
+
+        multi_modal_data = {}
+        if image_data is not None:
+            multi_modal_data["image"] = image_data
+        if video_data is not None:
+            multi_modal_data["video"] = video_data
+
+        # Add lora request
+        lora_request = None
+        if self.lora_as_adapter:
+            # Make sure we also check that the lora is already loaded in the engine
+            lora_loaded = VLLM_LORA_INT_ID in await self.engine.list_loras()
+            if lora_loaded:
+                lora_request = LoRARequest(
+                    lora_name=VLLM_LORA_NAME, lora_int_id=VLLM_LORA_INT_ID, lora_path=VLLM_LORA_PATH
+                )
+
+        # Build OmniCustomPrompt with pre-tokenized IDs
+        custom_prompt: OmniCustomPrompt = {"prompt_ids": prompt_ids}
+        if negative_prompt_ids is not None:
+            custom_prompt["negative_prompt_ids"] = negative_prompt_ids
+        if multi_modal_data:
+            custom_prompt["extra_args"] = {"multi_modal_data": multi_modal_data}
+
+        # Build OmniDiffusionSamplingParams from the incoming dict
+        sampling_kwargs: dict[str, Any] = {}
+        extra_args: dict[str, Any] = {}
+        for k, v in sampling_params.items():
+            if hasattr(OmniDiffusionSamplingParams, k):
+                sampling_kwargs[k] = v
+            else:
+                extra_args[k] = v
+        sampling_kwargs["extra_args"] = extra_args
+        if lora_request is not None:
+            sampling_kwargs["lora_request"] = lora_request
+        diffusion_sampling_params = OmniDiffusionSamplingParams(**sampling_kwargs)
+
+        # Call AsyncOmni.generate() with the correct API
+        generator = self.engine.generate(
+            prompt=custom_prompt,
+            request_id=request_id,
+            sampling_params_list=[diffusion_sampling_params],
+        )
+
+        # Get final response
+        final_res: Optional[OmniRequestOutput] = None
+        async for output in generator:
+            final_res = output
+        assert final_res is not None
+
+        image = (self._to_tensor(final_res.images[0]) / 255.0).tolist()
+
+        # Extract extra data from custom_output (populated by DiffusionEngine)
+        mm_output = final_res.custom_output or {}
+
+        if sampling_params.get("logprobs", False):
+            all_log_probs = mm_output.get("all_log_probs")
+            log_probs = all_log_probs[0].tolist() if all_log_probs is not None else None
+        else:
+            log_probs = None
+
+        all_latents = mm_output.get("all_latents")
+        all_timesteps = mm_output.get("all_timesteps")
+        prompt_embeds = mm_output.get("prompt_embeds")
+        prompt_embeds_mask = mm_output.get("prompt_embeds_mask")
+        negative_prompt_embeds = mm_output.get("negative_prompt_embeds")
+        negative_prompt_embeds_mask = mm_output.get("negative_prompt_embeds_mask")
+
+        extra_info = {
+            "all_latents": all_latents[0] if all_latents is not None else None,
+            "all_timesteps": all_timesteps[0] if all_timesteps is not None else None,
+            "prompt_embeds": prompt_embeds[0] if prompt_embeds is not None else None,
+            "prompt_embeds_mask": prompt_embeds_mask[0] if prompt_embeds_mask is not None else None,
+            "negative_prompt_embeds": negative_prompt_embeds[0] if negative_prompt_embeds is not None else None,
+            "negative_prompt_embeds_mask": negative_prompt_embeds_mask[0]
+            if negative_prompt_embeds_mask is not None
+            else None,
+            "global_steps": self.global_steps,
+        }
+
+        # Determine stop reason from finish_reason
+        if final_res.request_output is not None and hasattr(final_res.request_output, "finish_reason"):
+            finish_reason = final_res.request_output.finish_reason or "stop"
+        else:
+            finish_reason = "stop"
+
+        if finish_reason == "abort":
+            stop_reason = "aborted"
+        elif finish_reason in ("stop", "length"):
+            stop_reason = "completed"
+        else:
+            stop_reason = finish_reason  # for more stop reason in the future
+
+        num_preempted = None
+        if final_res.request_output is not None and hasattr(final_res.request_output, "num_preempted"):
+            num_preempted = final_res.request_output.num_preempted
+
+        return ImageOutput(
+            image=image,
+            log_probs=log_probs,
+            stop_reason=stop_reason,
+            num_preempted=num_preempted,
+            extra_info=extra_info,
+        )
+
+    async def wake_up(self):
+        if self.node_rank != 0:
+            return
+
+        if self.rollout_mode == RolloutMode.HYBRID:
+            # In hybrid mode, rollout is wake up in `update_weights`
+            raise ValueError(f"wake_up not support rollout_mode {self.rollout_mode}")
+        elif self.rollout_mode == RolloutMode.COLOCATED:
+            # Directly call engine to wake up without sync weights.
+            await self.engine.wake_up(tags=["weights"])
+            await self.engine.reset_prefix_cache()
+        elif self.rollout_mode == RolloutMode.STANDALONE:
+            logger.info("skip wake_up in standalone mode")
+
+    async def sleep(self):
+        if self.node_rank != 0 or not self.config.free_cache_engine:
+            return
+
+        if self.rollout_mode == RolloutMode.HYBRID:
+            await self.engine.sleep(level=1)
+        elif self.rollout_mode == RolloutMode.COLOCATED:
+            await self.engine.sleep(level=1)
+        elif self.rollout_mode == RolloutMode.STANDALONE:
+            logger.info("skip sleep in standalone mode")
+
+    async def start_profile(self, **kwargs):
+        if (
+            self.profiler_controller.check_enable()
+            and self.profiler_controller.check_this_rank()
+            and self.profiler_controller.is_discrete_mode()
+        ):
+            await self.engine.start_profile(**kwargs)
+
+    async def stop_profile(self):
+        if (
+            self.profiler_controller.check_enable()
+            and self.profiler_controller.check_this_rank()
+            and self.profiler_controller.is_discrete_mode()
+        ):
+            await self.engine.stop_profile()
+
+    async def clear_kv_cache(self):
+        pass
+
+    async def set_global_steps(self, global_steps: int):
+        """Set the global steps of the model weights."""
+        self.global_steps = global_steps
+
+    async def wait_for_requests_to_drain(self):
+        # TODO (mike): to be implemented
+        pass
+
+    async def abort_all_requests(self, reset_prefix_cache: bool = True) -> dict[str, Any]:
+        """Abort all ongoing generation requests.
+
+        Returns:
+            dict[str, Any]: Dictionary containing:
+                - aborted_count: Number of requests aborted
+                - request_ids: List of aborted request IDs
+        """
+        try:
+            # Take an atomic snapshot to avoid race conditions with the vLLM engine thread
+            request_states_snapshot = list(self.engine.output_processor.request_states.items())
+            request_ids = [req_id for req_id, _ in request_states_snapshot]
+
+            if not request_ids:
+                return {"aborted_count": 0, "request_ids": []}
+
+            # For each request, create an abort output and put it to its queue
+            # This allows the generator to receive the aborted result
+            from vllm.v1.engine import FinishReason
+
+            for _, req_state in request_states_snapshot:
+                request_output = req_state.make_request_output(
+                    [], pooling_output=None, finish_reason=FinishReason.ABORT, stop_reason=None
+                )
+                req_state.queue.put(request_output)
+
+            # Abort requests in the output processor and engine core
+            self.engine.output_processor.abort_requests(request_ids)
+            await self.engine.engine_core.abort_requests_async(request_ids)
+
+            # Try to reset prefix cache to ensure clean state
+            if reset_prefix_cache:
+                await self.clear_kv_cache()
+                logger.info("Prefix cache reset after abort")
+
+            logger.info(f"Aborted {len(request_ids)} requests: {request_ids}")
+            return {"aborted_count": len(request_ids), "request_ids": request_ids}
+
+        except Exception as e:
+            logger.error(f"Error aborting requests: {e}")
+            return {"aborted_count": 0, "request_ids": [], "error": str(e)}
+
+    async def resume_generation(self):
+        """Resume generation after abort_all_requests (pause_generation).
+
+        # TODO (mike): no usage now
+        """
+        if self.node_rank != 0:
+            return
+
+    async def abort_request(self, request_id: str, reset_prefix_cache: bool = True) -> dict[str, Any]:
+        """Abort a specific generation request.
+
+        Args:
+            request_id: The ID of the request to abort.
+
+        Returns:
+            dict[str, Any]: Dictionary containing abort result.
+        """
+        try:
+            request_states = self.engine.output_processor.request_states
+            req_state = request_states.get(request_id)
+
+            if req_state is None:
+                return {"aborted": False, "error": f"Request {request_id} not found"}
+
+            # Create abort output and put it to the queue
+            from vllm.v1.engine import FinishReason
+
+            request_output = req_state.make_request_output(
+                [], pooling_output=None, finish_reason=FinishReason.ABORT, stop_reason=None
+            )
+            req_state.queue.put(request_output)
+
+            # Abort in output processor and engine core
+            self.engine.output_processor.abort_requests([request_id])
+            await self.engine.engine_core.abort_requests_async([request_id])
+
+            # Try to reset prefix cache to ensure clean state
+            if reset_prefix_cache:
+                await self.clear_kv_cache()
+                logger.info(f"Prefix cache reset after abort request {request_id}")
+
+            logger.info(f"Aborted request: {request_id}")
+            return {"aborted": True, "request_id": request_id}
+
+        except Exception as e:
+            logger.error(f"Error aborting request {request_id}: {e}")
+            return {"aborted": False, "request_id": request_id, "error": str(e)}
+
+
+class vLLMOmniReplica(RolloutReplica):
+    def __init__(
+        self,
+        replica_rank: int,
+        config: DiffusionRolloutConfig,
+        model_config: DiffusersModelConfig,
+        gpus_per_node: int = 8,
+        is_reward_model: bool = False,
+    ):
+        super().__init__(replica_rank, config, model_config, gpus_per_node, is_reward_model)
+        self.server_class = ray.remote(vLLMOmniHttpServer)
+
+    async def launch_servers(self):
+        """Launch http server in each node."""
+        assert len(self.workers) == self.world_size, (
+            f"worker number {len(self.workers)} not equal to world size {self.world_size}"
+        )
+
+        # get (node_id, CUDA_VISIBLE_DEVICES) of all workers
+        worker_infos = await asyncio.gather(
+            *[
+                worker.__ray_call__.remote(
+                    lambda self: (
+                        ray.get_runtime_context().get_node_id(),
+                        ray.get_runtime_context().get_accelerator_ids()[get_resource_name()][0],
+                    )
+                )
+                for worker in self.workers
+            ]
+        )
+        worker_cuda_visible_devices = [worker_info[1] for worker_info in worker_infos]
+        worker_node_ids = [worker_info[0] for worker_info in worker_infos]
+
+        # create server actor in each node with node affinity and cuda visible devices
+        nnodes, gpus_per_replica_node = self.nnodes, self.gpus_per_replica_node
+        for node_rank in range(nnodes):
+            workers = self.workers[node_rank * gpus_per_replica_node : (node_rank + 1) * gpus_per_replica_node]
+            node_cuda_visible_devices = ",".join(
+                worker_cuda_visible_devices[node_rank * gpus_per_replica_node : (node_rank + 1) * gpus_per_replica_node]
+            )
+            node_id = worker_node_ids[node_rank * gpus_per_replica_node]
+            name = (
+                f"vllm_omni_server_{self.replica_rank}_{node_rank}"
+                if not self.is_reward_model
+                else f"vllm_omni_server_reward_{self.replica_rank}_{node_rank}"
+            )
+            server = self.server_class.options(
+                scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
+                    node_id=node_id,
+                    soft=False,
+                ),
+                runtime_env={
+                    "env_vars": {
+                        "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES": "1",
+                        "RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES": "1",
+                    }
+                },
+                name=name,
+                max_concurrency=self.max_concurrency,
+            ).remote(
+                config=self.config,
+                model_config=self.model_config,
+                rollout_mode=self.rollout_mode,
+                workers=workers,
+                replica_rank=self.replica_rank,
+                node_rank=node_rank,
+                gpus_per_node=gpus_per_replica_node,
+                nnodes=nnodes,
+                cuda_visible_devices=node_cuda_visible_devices,
+            )
+            self.servers.append(server)
+
+        # launch http server in each node
+        master_address, master_port, dp_rpc_port = await self.servers[0].get_master_address.remote()
+        await asyncio.gather(
+            *[
+                server.launch_server.remote(
+                    master_address=master_address, master_port=master_port, dp_rpc_port=dp_rpc_port
+                )
+                for server in self.servers
+            ]
+        )
+
+        # get http server address from first server
+        server_address, server_port = await self.servers[0].get_server_address.remote()
+        self._server_handle = self.servers[0]
+        self._server_address = (
+            f"[{server_address}]:{server_port}"
+            if is_valid_ipv6_address(server_address)
+            else f"{server_address}:{server_port}"
+        )
+
+    async def sleep(self):
+        """Sleep each rollout server."""
+        # Drain DP engines for safe sleep.
+        await self.servers[0].wait_for_requests_to_drain.remote()
+        await asyncio.gather(*[server.sleep.remote() for server in self.servers])
+
+    async def abort_all_requests(self) -> dict[str, Any]:
+        """Abort all ongoing generation requests across all servers.
+
+        Returns:
+            dict[str, Any]: Combined abort results from all servers.
+        """
+        results = await asyncio.gather(*[server.abort_all_requests.remote() for server in self.servers])
+
+        total_aborted = sum(r.get("aborted_count", 0) for r in results)
+        all_request_ids = []
+        for r in results:
+            all_request_ids.extend(r.get("request_ids", []))
+
+        return {
+            "aborted_count": total_aborted,
+            "request_ids": all_request_ids,
+            "server_results": results,
+        }
+
+    async def abort_request(self, request_id: str) -> dict[str, Any]:
+        """Abort a specific request. Tries all servers since we don't know which one has it.
+
+        Args:
+            request_id: The ID of the request to abort.
+
+        Returns:
+            dict[str, Any]: Abort result.
+        """
+        # TODO(petersh6): we should only abort on the server that has the request.
+        results = await asyncio.gather(*[server.abort_request.remote(request_id) for server in self.servers])
+
+        for r in results:
+            if r.get("aborted", False):
+                return r
+
+        return {"aborted": False, "request_id": request_id, "error": "Request not found on any server"}
diff --git a/verl/workers/rollout/vllm_rollout/vllm_omni_rollout.py b/verl/workers/rollout/vllm_rollout/vllm_omni_rollout.py
new file mode 100644
index 00000000000..54c4975b576
--- /dev/null
+++ b/verl/workers/rollout/vllm_rollout/vllm_omni_rollout.py
@@ -0,0 +1,116 @@
+# Copyright 2025 Bytedance Ltd. and/or its affiliates
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+The vllm_omni_rollout that can be applied in different backend
+When working with FSDP:
+- Use DTensor weight loader (recommended) or HF weight loader
+- Utilize state_dict from the FSDP to synchronize the weights among tp ranks in vLLM
+When working with Megatron:
+- Use Megatron weight loader
+- During training, only the current pp stage holds the parameters
+- Before inference, broadcast the parameters of the current pp rank
+  to all other pp ranks (all pp ranks holds all the parameters)
+- Bind the parameters to the inference engine
+- Do inference in tp. pp is treated as additional dp
+- After inference, all the parameters that doesn't belong to this pp rank is freed.
+"""
+
+import logging
+import os
+from typing import Any, Optional
+
+import ray
+from torch.distributed.device_mesh import DeviceMesh
+
+from verl.utils.device import get_device_id, is_support_ipc
+from verl.workers.config import HFModelConfig, RolloutConfig
+from verl.workers.rollout.vllm_rollout.utils import get_device_uuid
+from verl.workers.rollout.vllm_rollout.vllm_rollout import ServerAdapter
+
+logger = logging.getLogger(__file__)
+logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "INFO"))
+
+
+class vLLMOmniServerAdapter(ServerAdapter):
+    """
+    vLLM-Omni server adapter used in native async mode, serve as a client to request vLLM-Omni server
+    to resume/release/update weights and kv_cache.
+    """
+
+    def __init__(
+        self,
+        config: RolloutConfig,
+        model_config: HFModelConfig,
+        device_mesh: DeviceMesh,
+        replica_rank: int = -1,
+    ):
+        super(ServerAdapter, self).__init__(config, model_config, device_mesh)
+        self.server_handle: ray.actor.ActorHandle = None
+
+        rank = int(os.environ["RANK"])
+        local_world_size = int(os.environ["RAY_LOCAL_WORLD_SIZE"])
+        rollout_world_size = (
+            self.config.tensor_model_parallel_size
+            * self.config.data_parallel_size
+            * self.config.pipeline_model_parallel_size
+        )
+        if replica_rank == -1:
+            self.replica_rank = rank // rollout_world_size
+        else:
+            self.replica_rank = replica_rank
+        self.rollout_rank = rank % rollout_world_size
+        self.node_rank = self.rollout_rank // local_world_size
+
+        self.sleep_level = 1
+        self.device_uuid = get_device_uuid(get_device_id())
+        self.zmq_handle = f"ipc:///tmp/rl-colocate-zmq-{self.device_uuid}.sock"
+
+        self.use_shm = not is_support_ipc()
+        if self.use_shm:
+            logger.warning(
+                "IPC is not supported on your devices. Falling back to shared memory for weight transfer, "
+                "which may cause performance degradation. If you are using Ascend NPUs, please ensure that "
+                "your software and CANN toolkit versions meet the requirements for IPC support. (Ascend HDK version "
+                ">= 25.3.rc1 and CANN toolkit version >= 8.3.RC1)"
+            )
+
+    async def _execute_method(
+        self,
+        method: str,
+        non_block: bool = False,
+        timeout: Optional[float] = None,
+        args: tuple = (),
+        kwargs: Optional[dict] = None,
+    ) -> Any:
+        """Execute method on inference engine via ray.
+
+        Args:
+            method: The method name to execute on the server.
+            non_block: If True, execute the method asynchronously and return immediately.
+            timeout: Timeout for the collective_rpc call.
+            args: Positional arguments for the method.
+            kwargs: Keyword arguments for the method.
+
+        Returns:
+            The result of the method execution, or None if non_block=True.
+        """
+        if self.rollout_rank != 0:
+            return None
+
+        # Lazy init http server adapter because http server is launched after hybrid engine.
+        if self.server_handle is None:
+            self.server_handle = ray.get_actor(f"vllm_omni_server_{self.replica_rank}_{self.node_rank}")
+
+        future = self.server_handle.collective_rpc.remote(method, timeout=timeout, args=args, kwargs=kwargs)
+        return future if non_block else await future
diff --git a/verl/workers/utils/losses.py b/verl/workers/utils/losses.py
index 2932b9d3fa1..a2cad74fa90 100644
--- a/verl/workers/utils/losses.py
+++ b/verl/workers/utils/losses.py
@@ -17,7 +17,7 @@
 import torch.nn.functional as F
 from tensordict import TensorDict
 
-from verl.trainer.ppo.core_algos import agg_loss, compute_value_loss, get_policy_loss_fn, kl_penalty
+from verl.trainer.ppo.core_algos import agg_loss, compute_value_loss, get_policy_loss_fn, kl_penalty, kl_penalty_image
 from verl.utils import tensordict_utils as tu
 from verl.utils.dataset.dataset_utils import DatasetPadMode
 from verl.utils.metric import AggregationType, Metric
@@ -96,7 +96,13 @@ def _slice_response_from_unpad_output(tensor: torch.Tensor, data: TensorDict) ->
 
 def ppo_loss(config: ActorConfig, model_output, data: TensorDict, dp_group=None):
     """Computes ppo loss from model output (log_prob, entropy, values, etc. ) and old_log_probs from data."""
-    log_prob = no_padding_2_padding(model_output["log_probs"], data)
+    loss_mode = config.policy_loss.get("loss_mode", "vanilla")
+    if loss_mode == "flow_grpo":
+        log_prob = model_output["log_probs"]
+    else:
+        log_prob = _slice_response_from_unpad_output(model_output["log_probs"], data)
+        log_prob = no_padding_2_padding(model_output["log_probs"], data)
+
     entropy = model_output.get("entropy", None)
     if entropy is not None:
         entropy = no_padding_2_padding(entropy, data)
@@ -130,8 +136,6 @@ def ppo_loss(config: ActorConfig, model_output, data: TensorDict, dp_group=None)
 
     loss_agg_mode = config.loss_agg_mode
 
-    loss_mode = config.policy_loss.get("loss_mode", "vanilla")
-
     policy_loss_fn = get_policy_loss_fn(loss_mode)
     pg_loss, pg_metrics = policy_loss_fn(
         old_log_prob=old_log_prob,
@@ -162,12 +166,20 @@ def ppo_loss(config: ActorConfig, model_output, data: TensorDict, dp_group=None)
 
     # add kl loss
     if config.use_kl_loss:
-        ref_log_prob = data["ref_log_prob"]
-        # compute kl loss
-        kld = kl_penalty(logprob=log_prob, ref_logprob=ref_log_prob, kl_penalty=config.kl_loss_type)
-        kl_loss = agg_loss(
-            loss_mat=kld, loss_mask=response_mask, loss_agg_mode=config.loss_agg_mode, **config.global_batch_info
-        )
+        if loss_mode == "flow_grpo":
+            ref_prev_sample_mean = data["ref_prev_sample_mean"]
+            prev_sample_mean = model_output["prev_sample_mean"]
+            std_dev_t = model_output["std_dev_t"]
+            kl_loss = kl_penalty_image(
+                prev_sample_mean=prev_sample_mean, ref_prev_sample_mean=ref_prev_sample_mean, std_dev_t=std_dev_t
+            )
+        else:
+            ref_log_prob = data["ref_log_prob"]
+            # compute kl loss
+            kld = kl_penalty(logprob=log_prob, ref_logprob=ref_log_prob, kl_penalty=config.kl_loss_type)
+            kl_loss = agg_loss(
+                loss_mat=kld, loss_mask=response_mask, loss_agg_mode=config.loss_agg_mode, **config.global_batch_info
+            )
 
         policy_loss += kl_loss * config.kl_loss_coef
         metrics["kl_loss"] = Metric(value=kl_loss, aggregation=metric_aggregation)
diff --git a/verl/workers/utils/padding.py b/verl/workers/utils/padding.py
index 16242e7731f..7de3b2b789a 100644
--- a/verl/workers/utils/padding.py
+++ b/verl/workers/utils/padding.py
@@ -20,6 +20,56 @@
 from verl.utils.attention_utils import index_first_axis, unpad_input
 
 
+def embeds_padding_2_no_padding(data: TensorDict) -> TensorDict:
+    """
+    Convert TensorDict from prompt embeds with padding to no-padding format.
+
+    Args:
+        data: TensorDict with "prompt_embeds", "prompt_embeds_mask",
+              "negative_prompt_embeds", "negative_prompt_embeds_mask"
+
+    Returns:
+        data: TensorDict with
+        - Tensor includes NestedTensors "prompt_embeds", "prompt_embeds_mask",
+          "negative_prompt_embeds", "negative_prompt_embeds_mask"
+    """
+
+    prompt_embeds = data["prompt_embeds"]  # (bs, seq_len, dim)
+    prompt_embeds_mask = data["prompt_embeds_mask"]  # (bs, seq_len)
+    prompt_embeds_list = []
+    prompt_embeds_mask_list = []
+    for i in range(prompt_embeds_mask.shape[0]):
+        curr_mask = prompt_embeds_mask[i].bool()
+        curr_prompt_embeds = prompt_embeds[i, curr_mask, :]
+        prompt_embeds_list.append(curr_prompt_embeds)
+        prompt_embeds_mask_list.append(curr_mask[curr_mask])
+    prompt_embeds_nested = torch.nested.as_nested_tensor(prompt_embeds_list, layout=torch.jagged)
+    prompt_embeds_mask_nested = torch.nested.as_nested_tensor(prompt_embeds_mask_list, layout=torch.jagged)
+    data["prompt_embeds"] = prompt_embeds_nested
+    data["prompt_embeds_mask"] = prompt_embeds_mask_nested
+
+    if isinstance(data.get("negative_prompt_embeds", None), torch.Tensor):
+        negative_prompt_embeds = data["negative_prompt_embeds"]  # (bs, seq_len, dim)
+        negative_prompt_embeds_mask = data["negative_prompt_embeds_mask"]  # (bs, seq_len)
+        negative_prompt_embeds_list = []
+        negative_prompt_embeds_mask_list = []
+        for i in range(negative_prompt_embeds_mask.shape[0]):
+            curr_mask = negative_prompt_embeds_mask[i].bool()
+            curr_negative_prompt_embeds = negative_prompt_embeds[i, curr_mask, :]
+            negative_prompt_embeds_list.append(curr_negative_prompt_embeds)
+            negative_prompt_embeds_mask_list.append(curr_mask[curr_mask])
+        negative_prompt_embeds_nested = torch.nested.as_nested_tensor(negative_prompt_embeds_list, layout=torch.jagged)
+        negative_prompt_embeds_mask_nested = torch.nested.as_nested_tensor(
+            negative_prompt_embeds_mask_list, layout=torch.jagged
+        )
+        data["negative_prompt_embeds"] = negative_prompt_embeds_nested
+        data["negative_prompt_embeds_mask"] = negative_prompt_embeds_mask_nested
+
+    data["loss_mask"] = data["response_mask"]
+
+    return data
+
+
 def left_right_2_no_padding(data: TensorDict) -> TensorDict:
     """
     Convert TensorDict from left-right padding to no-padding format.