[recipe] fix: FlowRL actor to pure implementation (#4397)

Xuekai-Zhu · claude · web-flow · commit f11af2d5358e · 2025-12-04T15:48:03.000+08:00
## Summary

This PR refactors the FlowRL actor implementation by removing
CISPO-specific features and simplifying to a pure FlowRL trajectory
balance objective
  with importance weight clipping.

  ## Changes

  ### Removed
- **Ablation study code**: Deleted `compute_flowrl_cispo_clip_ablation`
function and environment variable switching logic


  ### Modified
- **Function rename**: `compute_flowrl_cispo_clip` → `compute_flowrl` to
better reflect the pure implementation
- **Simplified masking**: Now uses `response_mask` directly without
additional condition-based filtering
- **Cleaner metrics**: Keeps essential metrics (log_prob, log_z,
importance_weight, PPO KL, reference KL)

  ### Kept
- **Core FlowRL objective**: Trajectory balance loss `L = E[w * (log Z +
log p_θ - β*R - log p_ref)²]`
- **Importance weight clipping**: Maintains stability with `max=10`
clipping
- **Log partition function (log Z)**: Projection network for estimating
partition function

---------

Co-authored-by: Claude &lt;noreply@anthropic.com&gt;
diff --git a/recipe/flowrl/README.md b/recipe/flowrl/README.md
@@ -12,7 +12,8 @@
 <p align="center" style="color:#42A5F5; font-size:14px; margin-top:4px;">
   <a href="https://x.com/RoverHM/status/1969113890878259518" target="_blank">𝕏 Post 1</a> |
   <a href="https://x.com/zdhnarsil/status/1969049940774023428" target="_blank">𝕏 Post 2</a> |
-  <a href="https://x.com/_akhaliq/status/1968901977376505929" target="_blank">𝕏 Post 3</a>
+  <a href="https://x.com/_akhaliq/status/1968901977376505929" target="_blank">𝕏 Post 3</a> |
+  <a href="https://x.com/zhu_xuekai/status/1968942580197941563" target="_blank">𝕏 Post 4</a>
 </p>
 
 <p align="center">
@@ -24,16 +25,16 @@
 - [FlowRL Objective](#flowrl-objective)
 - [Trained Models & Experiment Logs](#trained-models--experiment-logs)
 - [Quick Start](#quick-start)
-  - [Option 1: Use verl Recipe](#option-1-use-verl-recipe)
-    - [Step 1: Prepare Data and Model](#step-1-prepare-data-and-model)
-    - [Step 2: Run Training](#step-2-run-training)
-  - [Option 2: Original Paper Reproduction](#option-2-original-paper-reproduction)
+  - [Option 1: Original Paper Reproduction (verl 0.4.0)](#option-1-original-paper-reproduction-verl-040--recommended)
     - [Step 1: Installation](#step-1-installation)
     - [Step 2: Data Preparation](#step-2-data-preparation)
     - [Step 3: Model Preparation](#step-3-model-preparation)
-    - [Step 4: Training](#step-4-training)
-    - [Step 5: Testing](#step-5-testing)
-  - [Option 3: Implement FlowRL Yourself](#option-3-implement-flowrl-yourself)
+    - [Step 4: Training Scripts](#step-4-training-scripts)
+  - [Option 2: Latest verl Recipe FlowRL](#option-3-latest-verl-recipe-flowrl)
+    - [Step 1: Prepare Data and Model](#step-1-prepare-data-and-model)
+    - [Step 2: Run Training](#step-2-run-training)
+  - [Option 3: Implement FlowRL Yourself](#option-4-implement-flowrl-yourself)
+- [Testing](#testing)
 - [Citation](#citation)
 
 ## FlowRL Objective
@@ -56,30 +57,15 @@ FlowRL is a flow-balanced reinforcement learning method that matches full reward
 
 There are three ways to use FlowRL:
 
-### Option 1: Use verl Recipe
-
-For running FlowRL using the verl framework:
-
-#### Step 1: Prepare Data and Model
-
-```bash
-# Prepare dataset
-bash recipe/flowrl/prepare/prepare_data.sh
-
-# Prepare model
-bash recipe/flowrl/prepare/prepare_model.sh
-```
+---
 
-#### Step 2: Run Training
+**⭐ We recommend using Option 1 as the default choice.** Since verl updates frequently, the newest versions may have unstable factors such as training and inference mismatches. Option 1 uses verl 0.4.0, which is stable and has been thoroughly tested with our paper results.
 
-```bash
-# Train FlowRL with Qwen2.5-7B
-bash recipe/flowrl/run_flowrl_qwen2.5_7b.sh
-```
+---
 
-### Option 2: Original Paper Reproduction
+### Option 1: Original Paper Reproduction (verl 0.4.0) ⭐ Recommended
 
-For exact reproduction of results from the paper, use the original repository:
+For exact reproduction of results from the paper, use the original repository with verl 0.4.0:
 
 👉 **Original Code:** [https://github.com/Xuekai-Zhu/FlowRL](https://github.com/Xuekai-Zhu/FlowRL)
 
@@ -115,7 +101,7 @@ bash preprocess/down_load_model.sh
 # For other models, modify MODEL_NAME in the script before running
 ```
 
-#### Step 4: Training
+#### Step 4: Training Scripts
 
 ```bash
 cd verl_FlowRL
@@ -129,8 +115,43 @@ bash command/training/math/flowrl_32B_math.sh
 # For 7B code training
 bash command/training/code/flowrl_7B_code.sh
 ```
+----
+### Option 2: Latest verl Recipe FlowRL
+
+For running FlowRL using the latest verl framework:
 
-#### Step 5: Testing
+**Latest verl:**
+
+- verl recipe: [https://github.com/volcengine/verl/tree/main/recipe/flowrl](https://github.com/volcengine/verl/tree/main/recipe/flowrl)
+
+#### Step 1: Prepare Data and Model
+
+```bash
+# Prepare dataset
+bash recipe/flowrl/prepare/prepare_data.sh
+
+# Prepare model
+bash recipe/flowrl/prepare/prepare_model.sh
+```
+
+#### Step 2: Run Training
+
+```bash
+# Train FlowRL with Qwen2.5-7B
+bash recipe/flowrl/run_flowrl_qwen2.5_7b.sh
+```
+----
+### Option 3: Implement FlowRL Yourself
+
+If you want to implement FlowRL in your own codebase, we provide a detailed implementation guide:
+
+📖 **[FlowRL Implementation Guide](FLOWRL_SIMPLE_GUIDE.md)**
+
+This guide walks you through the key components and steps needed to integrate FlowRL into your existing training pipeline.
+
+## Testing
+
+After training your FlowRL models, you can evaluate them using the following commands:
 
 ```bash
 cd verl_Test
@@ -145,13 +166,7 @@ bash command/eval/math/flowrl_math_test.sh
 bash command/eval/code/flowrl_code_test.sh
 ```
 
-### Option 3: Implement FlowRL Yourself
-
-If you want to implement FlowRL in your own codebase, we provide a detailed implementation guide:
-
-📖 **[FlowRL Implementation Guide](FLOWRL_SIMPLE_GUIDE.md)**
-
-This guide walks you through the key components and steps needed to integrate FlowRL into your existing training pipeline.
+**Reference:** For verl v0.5.0.dev merge model script, see [merge_model.sh](https://github.com/Xuekai-Zhu/verl_FlowRL/blob/flowrl-v0.5.0.dev/recipe/flowrl/eval/merge_model.sh)
 
 ## Citation
 
diff --git a/recipe/flowrl/flowrl_actor.py b/recipe/flowrl/flowrl_actor.py
@@ -359,34 +359,27 @@ def update_policy(self, data: DataProto):
                     # vanilla -> verl.trainer.ppo.core_algos.compute_policy_loss_vanilla
                     # gpg -> verl.trainer.ppo.core_algos.compute_policy_loss_gpg
                     # clip_cov -> verl.trainer.ppo.core_algos.compute_policy_loss_clip_cov
+                    # policy_loss_fn = get_policy_loss_fn(loss_mode)
+                    # pg_loss, pg_clipfrac, ppo_kl, pg_clipfrac_lower = policy_loss_fn(
+                    #     old_log_prob=old_log_prob,
+                    #     log_prob=log_prob,
+                    #     advantages=advantages,
+                    #     response_mask=response_mask,
+                    #     loss_agg_mode=loss_agg_mode,
+                    #     config=self.config,
+                    #     rollout_log_probs=rollout_log_probs,
+                    # )
                     # Compute FlowRL trajectory balance loss
-                    # Use environment variable to switch between versions
-                    use_ablation = os.getenv("FLOWRL_CLIP_ABLATION", "false").lower() == "true"
-
-                    if use_ablation:
-                        # Ablation: only clip, no hard mask
-                        policy_loss, flowrl_metrics = self.compute_flowrl_cispo_clip_ablation(
-                            log_prob=log_prob,
-                            ref_log_prob=ref_log_prob,
-                            old_log_prob=old_log_prob,
-                            log_z=log_z,
-                            reward=advantages,
-                            response_mask=response_mask,
-                            clip_ratio=self.config.clip_ratio,
-                            rollout_log_probs=rollout_log_probs,
-                        )
-                    else:
-                        # Default: CISPO with hard mask + clip
-                        policy_loss, flowrl_metrics = self.compute_flowrl_cispo_clip(
-                            log_prob=log_prob,
-                            ref_log_prob=ref_log_prob,
-                            old_log_prob=old_log_prob,
-                            log_z=log_z,
-                            reward=advantages,
-                            response_mask=response_mask,
-                            clip_ratio=self.config.clip_ratio,
-                            rollout_log_probs=rollout_log_probs,
-                        )
+                    policy_loss, flowrl_metrics = self.compute_flowrl(
+                        log_prob=log_prob,
+                        ref_log_prob=ref_log_prob,
+                        old_log_prob=old_log_prob,
+                        log_z=log_z,
+                        reward=advantages,
+                        response_mask=response_mask,
+                        clip_ratio=self.config.clip_ratio,
+                        rollout_log_probs=rollout_log_probs,
+                    )
 
                     # if entropy_coeff != 0:
                     #     entropy_loss = agg_loss(
@@ -438,7 +431,7 @@ def update_policy(self, data: DataProto):
         self.actor_optimizer.zero_grad()
         return metrics
 
-    def compute_flowrl_cispo_clip(
+    def compute_flowrl(
         self,
         log_prob=None,
         ref_log_prob=None,
@@ -449,37 +442,23 @@ def compute_flowrl_cispo_clip(
         clip_ratio=None,
         rollout_log_probs=None,
     ):
-        log_ratio = log_prob - old_log_prob  # (B, T)
-        ratio = torch.exp(log_ratio)  # (B, T)
-
-        condition_1 = (reward > 0) & (ratio > 1.0 + 0.28)  # (B, T)
-        condition_2 = (reward < 0) & (ratio < 1.0 - 0.2)  # (B, T)
-
-        # CISPO mask
-        cispo_mask = ~(condition_1 | condition_2)
-        cispo_mask = cispo_mask.float()
-        combined_mask = response_mask * cispo_mask
-
         # squeeze log_z to (B,)
         log_z = log_z.squeeze(-1)
 
         # Average token log-probs & rewards over valid positions
-        avg_log_prob = verl_F.masked_mean(log_prob, combined_mask, axis=1)
-        avg_ref_log_prob = verl_F.masked_mean(ref_log_prob, combined_mask, axis=1)
-        seq_log_reward = verl_F.masked_mean(reward, combined_mask, axis=1)
+        avg_log_prob = verl_F.masked_mean(log_prob, response_mask, axis=1)
+        avg_ref_log_prob = verl_F.masked_mean(ref_log_prob, response_mask, axis=1)
+        seq_log_reward = verl_F.masked_mean(reward, response_mask, axis=1)
 
         # FlowRL residual: logZ + logpf - β*R - logpref
         delta = log_z + avg_log_prob - self.flowrl_beta_coef * seq_log_reward - avg_ref_log_prob
 
         # Importance ratio from current vs old policy (product of token ratios)
-        log_w = verl_F.masked_sum(log_prob - old_log_prob, combined_mask, axis=1)
+        log_w = verl_F.masked_sum(log_prob - old_log_prob, response_mask, axis=1)
         imp_w_raw = torch.exp(log_w).detach()
+        imp_w = torch.clamp(imp_w_raw, max=10)
 
-        # Clamp importance weight for numerical stability (prevent extreme values)
-        # imp_w = torch.clamp(imp_w_raw, max=10.0)
-        imp_w = torch.clamp(imp_w_raw, 1 - 0.2, 1 + 0.28)
-
-        # Loss: weighted squared residual with clipped importance weights
+        # Loss: weighted squared residual with importance weights
         weighted_losses = imp_w * (delta**2)
         avg_loss = torch.mean(weighted_losses)
 
@@ -491,11 +470,6 @@ def compute_flowrl_cispo_clip(
         approx_kl_ref = log_prob - ref_log_prob
         ref_kl = verl_F.masked_mean(-approx_kl_ref, response_mask)
 
-        # cispo
-        total_tokens = response_mask.sum()
-        cispo_dropped = (response_mask * (1 - cispo_mask)).sum()
-        cispo_mask_ratio = cispo_dropped / (total_tokens + 1e-8)
-
         # Metrics
         loss_term_dict = {
             "actor/log_prob": verl_F.masked_mean(log_prob, response_mask).detach().item(),
@@ -504,104 +478,9 @@ def compute_flowrl_cispo_clip(
             "actor/log_z": log_z.mean().detach().item(),
             "actor/log_reward": verl_F.masked_mean(reward, response_mask).detach().item(),
             "actor/final_loss": avg_loss.detach().item(),
-            "actor/importance_weight_raw": imp_w_raw.mean().detach().item(),
             "actor/importance_weight": imp_w.mean().detach().item(),
             "actor/ppo_kl": ppo_kl.detach().item(),  # PPO-style KL (current vs old policy)
             "actor/ref_kl": ref_kl.detach().item(),  # KL with reference policy
-            "actor/cispo_mask_ratio": cispo_mask_ratio.detach().item(),  # cispo
-            "actor/cispo_dropped_tokens": cispo_dropped.detach().item(),  # cispo
-            "actor/condition_1_count": (condition_1 * response_mask).sum().detach().item(),  # cispo
-            "actor/condition_2_count": (condition_2 * response_mask).sum().detach().item(),  # cispo
-        }
-
-        return avg_loss, loss_term_dict
-
-    def compute_flowrl_cispo_clip_ablation(
-        self,
-        log_prob=None,
-        ref_log_prob=None,
-        old_log_prob=None,
-        log_z=None,
-        reward=None,
-        response_mask=None,
-        clip_ratio=None,
-        rollout_log_probs=None,
-    ):
-        """
-        Ablation study: Remove hard CISPO mask, only use importance weight clipping.
-        This version uses response_mask only (no condition-based masking).
-        """
-
-        # log_ratio = log_prob - old_log_prob  # (B, T)
-        # ratio = torch.exp(log_ratio)  # (B, T)
-
-        # === Main change: Remove hard mask, only use clip ===
-        # Original version had:
-        # condition_1 = (reward > 0) & (ratio > 1.0 + 0.28)
-        # condition_2 = (reward < 0) & (ratio < 1.0 - 0.2)
-        # cispo_mask = ~(condition_1 | condition_2)
-        # combined_mask = response_mask * cispo_mask
-
-        # New version: Only use response_mask, no hard masking
-        combined_mask = response_mask  # Only keep response_mask
-        # ====================================================
-
-        # squeeze log_z to (B,)
-        log_z = log_z.squeeze(-1)
-
-        # Average token log-probs & rewards over valid positions
-        avg_log_prob = verl_F.masked_mean(log_prob, combined_mask, axis=1)
-        avg_ref_log_prob = verl_F.masked_mean(ref_log_prob, combined_mask, axis=1)
-        seq_log_reward = verl_F.masked_mean(reward, combined_mask, axis=1)
-
-        # FlowRL residual: logZ + logpf - β*R - logpref
-        delta = log_z + avg_log_prob - self.flowrl_beta_coef * seq_log_reward - avg_ref_log_prob
-
-        # Importance ratio from current vs old policy (product of token ratios)
-        log_w = verl_F.masked_sum(log_prob - old_log_prob, combined_mask, axis=1)
-        imp_w_raw = torch.exp(log_w).detach()
-
-        # === Main change: Clipping is the core of CISPO ===
-        # This clipping is what distinguishes this from vanilla FlowRL
-        imp_w = torch.clamp(imp_w_raw, 1 - 0.2, 1 + 0.28)  # Keep this unchanged
-        # ==================================================
-
-        # Loss: weighted squared residual with clipped importance weights
-        weighted_losses = imp_w * (delta**2)
-        avg_loss = torch.mean(weighted_losses)
-
-        # PPO KL: negative_approx_kl = log_prob - old_log_prob
-        negative_approx_kl = log_prob - old_log_prob
-        ppo_kl = verl_F.masked_mean(-negative_approx_kl, response_mask)
-
-        # Reference KL: approx_kl_ref = log_prob - ref_log_prob
-        approx_kl_ref = log_prob - ref_log_prob
-        ref_kl = verl_F.masked_mean(-approx_kl_ref, response_mask)
-
-        # === Updated statistics ===
-        # Since we're using clipping instead of masking, count clipped samples
-        total_tokens = response_mask.sum()
-        clipped_low = ((imp_w_raw < 1.0 - 0.2) & (imp_w_raw > 0)).sum()
-        clipped_high = (imp_w_raw > 1.0 + 0.28).sum()
-        cispo_clipped_count = clipped_low + clipped_high
-        cispo_clip_ratio = cispo_clipped_count / (total_tokens + 1e-8)
-
-        # Metrics
-        loss_term_dict = {
-            "actor/log_prob": verl_F.masked_mean(log_prob, response_mask).detach().item(),
-            "actor/old_log_prob": verl_F.masked_mean(old_log_prob, response_mask).detach().item(),
-            "actor/ref_log_prob": verl_F.masked_mean(ref_log_prob, response_mask).detach().item(),
-            "actor/log_z": log_z.mean().detach().item(),
-            "actor/log_reward": verl_F.masked_mean(reward, response_mask).detach().item(),
-            "actor/final_loss": avg_loss.detach().item(),
-            "actor/importance_weight_raw": imp_w_raw.mean().detach().item(),
-            "actor/importance_weight": imp_w.mean().detach().item(),
-            "actor/ppo_kl": ppo_kl.detach().item(),
-            "actor/ref_kl": ref_kl.detach().item(),
-            "actor/cispo_clip_ratio": cispo_clip_ratio.detach().item(),  # Renamed from mask_ratio
-            "actor/cispo_clipped_count": cispo_clipped_count.detach().item(),  # Renamed from dropped_tokens
-            "actor/clipped_low_count": clipped_low.detach().item(),
-            "actor/clipped_high_count": clipped_high.detach().item(),
         }
 
         return avg_loss, loss_term_dict
diff --git a/recipe/flowrl/run_flowrl_qwen2.5_7b_fp16.sh b/recipe/flowrl/run_flowrl_qwen2.5_7b_fp16.sh