[doc] feat: update reward loop document (#4404)

yyDing1 · web-flow · commit 493a3977d9dc · 2025-12-03T21:37:20.000+08:00
### What does this PR do? as title ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
diff --git a/docs/advance/reward_loop.rst b/docs/advance/reward_loop.rst
@@ -5,62 +5,167 @@ Reward Loop
 
 Author: `Yuyang Ding <https://yyding1.github.io>`_
 
-Last updated: 10/23/2025.
+Last updated: 12/3/2025.
 
 .. warning::
-   Reward Loop is ready for use, but the API may change in future releaes.
+   Reward Loop is ready for use, but the API may change in future releases.
 
-Reward Loop is designed for more flexible and easy-to-use reward computation.
+Reward Loop is designed to support flexible and user-friendly reward computation, with implementation mostly in ``verl/experimental/reward``.
 
-**Design goal**:
+**Supported Types of Rewards:** Reward Loop covers all typical reward-computation scenarios.
 
-- Make reward computation more efficient
-- Support broader reward model interface (including discriminative and generative models)
-- Make user customized reward function more flexible
+- **Rule-based Reward**: The reward is determined by predefined rules, e.g., checking whether the predicted answer matches the ground truth via simple string matching.
+- **Discriminative Reward Model (DisRM)**: The reward is produced by a specified discriminative reward model, such as ``Skywork/Skywork-Reward-Llama-3.1-8B-v0.2``.
+- **Generative Reward Model (GenRM)**: The reward is obtained using a generative reward model, for example ``dyyyyyyyy/FAPO-GenRM-4B``.
+- **Hybrid Reward Scenarios**: Reward Loop provides interfaces for plugging in reward models, allowing users to define custom reward logic based on their needs (e.g., combining rule-based methods with GenRM).
 
-.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_overview.svg?raw=true
+**Support Training Modes:** Reward Loop supports multiple execution modes for reward training
 
-Async Reward Computation
-------------------------
+- **Colocate Mode**: The reward model shares the same resource pool as the actor/rollout/reference models. In this setup, all rollouts must complete first, after which the reward model is awakened to perform inference.
+- **Standalone Mode**: The reward model runs on a separate resource pool, independent from the actor/rollout/reference models. In this setup, each sample is evaluated by the reward model immediately after its rollout finishes.
 
-RewardLoopManager
+.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop.svg?raw=true
+
+Architecture Design
+-------------------
+
+RewardLoopWorker
 ~~~~~~~~~~~~~~~~~
 
-The Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the ``run_single`` function.
-This asynchronous design enables the Reward Loop to handle multiple reward computations concurrently, significantly improving computation efficiency.
+The ``RewardLoopWorker`` is responsible for handling batch-level reward computation across all supported execution modes, operating in an asynchronous manner.
+
+.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_worker.svg?raw=true
+
+For each sample, the reward is computed according to the following logic:
+
+- if ``custom_reward_function`` is provided, we directly use user-customized reward function
+
+- if ``custom_reward_function`` is not provided:
+   - **reward model is not enabled**: use default rule-based reward function
+   - **reward model is discriminative**: compute reward score using disrm
+   - **reward model is generative**: this is not permitted (user-customized reward func **must be** provided)
+
+In most cases, we encourage users to define and use their own customized reward functions.
 
 .. code:: python
 
-   class RewardLoopManagerBase(ABC):
-      async def run_single(self, data: DataProto) -> dict:
-         # ... (data preprocessing)
-         if self.is_async_reward_score:
-            result = await self.compute_score(
-                  data_source=data_source,
-                  solution_str=response_str,
-                  ground_truth=ground_truth,
-                  extra_info=extra_info,
-                  reward_router_address=self.reward_router_address,
-                  reward_model_tokenizer=self.reward_model_tokenizer,
-            )
+   @ray.remote
+   class RewardLoopWorker:
+      async def compute_score_batch(self, data: DataProto) -> list[dict]:
+         tasks = []
+         for i in range(len(data)):
+            tasks.append(asyncio.create_task(self.compute_score(data[i : i + 1])))
+         outputs = await asyncio.gather(*tasks)
+         return outputs
+
+   async def compute_score(self, data: DataProto) -> dict:
+      assert len(data) == 1, "RewardLoopWorker only support single data item"
+      if self.config.custom_reward_function.path is not None:
+         # directly use user-customized reward function
+         return await self.reward_loop.run_single(data)
+      else:
+         if self.config.reward_model.enable:
+            # we assume the rm is disrm
+            # genrm must set custom_reward_function
+            return await self.compute_score_disrm(data)
          else:
-            result = await self.loop.run_in_executor(
-                  None,
-                  lambda: self.compute_score(
-                     data_source=data_source,
-                     solution_str=response_str,
-                     ground_truth=ground_truth,
-                     extra_info=extra_info,
-                     reward_router_address=self.reward_router_address,
-                     reward_model_tokenizer=self.reward_model_tokenizer,
-                  ),
-            )
-         # ... (reward postprocessing)
-         return final_result
+            return await self.reward_loop.run_single(data)
+
+
+RewardLoopManager
+~~~~~~~~~~~~~~~~~
+
+In **standalone mode**, we directly launch one ``RewardLoopWorker`` for each ``AgentLoopWorker`` to handle reward computation independently.
+
+In **colocate mode**, we launch a ``RewardLoopManager`` to
+
+1. launch reward model if enabled
+2. manage multiple ``RewardLoopWorker`` instances to handle CPU-intensive tasks such as code.
+
+.. code:: python
+
+   class RewardLoopManager:
+      """
+      RewardLoopManager run in single controller.
+      This class will create reward loop workers and manage them.
+      RewardLoopManager will deprecate fsdp/megatron RewardModelWorker in the future.
+      """
+   def __init__(self, config: DictConfig, rm_resource_pool: RayResourcePool = None):
+      self.config = config
+      if self.config.reward_model.enable:
+         self.reward_model_manager = RewardModelManager(config.reward_model, rm_resource_pool)
+         self.reward_router_address = self.reward_model_manager.get_router_address()
+      else:
+         self.reward_model_manager = None
+         self.reward_router_address = None
+
+      self._init_reward_loop_workers()
+
+   def _init_reward_loop_workers(self):
+      self.reward_loop_workers = []
+      num_workers = self.config.reward_model.get("num_workers", 1)
+      node_ids = [node["NodeID"] for node in ray.nodes() if node["Alive"] and node["Resources"].get("CPU", 0) > 0]
+
+      for i in range(num_workers):
+         # Round-robin scheduling over the all nodes
+         node_id = node_ids[i % len(node_ids)]
+         self.reward_loop_workers.append(
+            RewardLoopWorker.options(
+               name=f"reward_loop_worker_{i}",
+               scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
+                  node_id=node_id,
+                  soft=True,
+               ),
+            ).remote(self.config, self.reward_router_address)
+         )
+
+   def compute_rm_score(self, data: DataProto) -> DataProto:
+      """
+      Compute reward score for the given data.
+      """
+      ...
+
+
+RewardModelManager
+~~~~~~~~~~~~~~~~~~
+
+To support flexible and scalable reward model computation, Reward Loop implement a reward router that coordinates requests among multiple reward model servers.
+
+Each reward model runs as an independent server and is registered with the router.
+This router will forward the requests to the registered reward servers with load balancing and return the results.
+This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
+
+.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.svg?raw=true
+
+.. code:: python
+
+   class RewardModelManager:
+      """Reward model manager."""
+
+      def __init__(
+         self,
+         config: RewardModelConfig,
+         resource_pool: RayResourcePool = None,
+      ):
+         """
+         Initialize the reward model manager.
+
+         Args:
+            config (RewardModelConfig): Reward model configuration.
+            resource_pool (RayResourcePool, optional): Resource pool. Defaults to None.
+         """
+         self.config = config
+         self.resource_pool = resource_pool
+         self._initialize_llm_servers()
+         self._initialize_router()
+         assert self.config.rollout.skip_tokenizer_init is False, "Reward model should not skip tokenizer init."
+         if self.config.rollout.free_cache_engine:
+               self.sleep()
 
 User-defined reward functions can be implemented as either synchronous or asynchronous.
 ``RewardLoopManager`` automatically detects the type of the user-defined function and executes it accordingly, ensuring that the reward computation process remains non-blocking.
 
+
 User-Customized Reward Function
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -105,90 +210,3 @@ A user-defined reward function may look like the following:
       return {"score": score}
 
 Runable examples are provided in the ``recipe/fapo`` directory for reference.
-
-Reward Models and Router
-------------------------
-
-To support flexible and scalable reward model computation, RewardLoop implement a reward router that coordinates requests among multiple reward model servers.
-
-Each reward model runs as an independent server and is registered with the router.
-This router will forward the requests to the registered reward servers with load balancing and return the results.
-This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
-
-RewardModelManager
-~~~~~~~~~~~~~~~~~~
-
-.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.svg?raw=true
-
-``RewardModelManager`` will launch multiple reward servers and register them in the reward router.
-
-.. code:: python
-
-   class RewardModelManager:
-      """Reward model manager."""
-
-      def __init__(self, config: RewardModelConfig, worker_group: RayWorkerGroup = None):
-         """
-         Initialize the reward model manager.
-
-         Args:
-            config (RewardModelConfig): Reward model configuration.
-            worker_group (RayWorkerGroup, optional): Worker group. Defaults to None.
-         """
-         self.config = config
-         self.worker_group = worker_group
-         self._initialize_llm_servers()
-         self._initialize_router()
-         if self.config.rollout.free_cache_engine:
-            self.sleep()
-
-Reward Router
-~~~~~~~~~~~~~
-
-The router is to forward the requests to the registered reward servers with load balancing.
-
-- For sglang reward servers, we directly use the sglang router to forward the requests.
-- For vllm reward servers, we implement a simple round-robin ``NaiveRouter`` to dispatch the requests.
-
-.. code:: python
-
-   class NaiveRouter:
-      def __init__(
-         self,
-         worker_urls: list[str],
-         max_connections: int = 1024,
-         timeout: int = 60,
-         max_attempts: int = 3,
-         retry_delay: float = 2.0,
-         verbose: bool = False,
-      ):
-         """A minimal async load-balancing router."""
-         self.verbose = verbose
-         self.app = FastAPI()
-         self.worker_urls = worker_urls
-         self.request_counts = {url: 0 for url in worker_urls}
-
-         self.max_connections = max_connections
-         self.timeout = timeout
-         self.max_attempts = max_attempts
-         self.retry_delay = retry_delay
-
-         self.app = FastAPI()
-
-         # Register startup / shutdown hooks
-         self.app.on_event("startup")(self._on_startup)
-         self.app.on_event("shutdown")(self._on_shutdown)
-
-         # Catch-all proxy route
-         self.app.api_route("/{endpoint:path}", methods=["GET", "POST"])(self._make_async_request)
-
-         # Placeholder for aiohttp client
-         self.client = None
-
-Agent Reward Loop
------------------
-
-Reward Loop can be integrated with AgentLoop to enable sample-wise rollout and reward computation.
-
-.. image:: https://github.com/yyDing1/verl-materials/blob/main/agent_reward_loop.svg?raw=true
-