Skip to content

Commit 493a397

Browse files
authored
[doc] feat: update reward loop document (#4404)
### What does this PR do? as title ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent 27d1ada commit 493a397

File tree

1 file changed

+143
-125
lines changed

1 file changed

+143
-125
lines changed

docs/advance/reward_loop.rst

Lines changed: 143 additions & 125 deletions
Original file line numberDiff line numberDiff line change
@@ -5,62 +5,167 @@ Reward Loop
55

66
Author: `Yuyang Ding <https://yyding1.github.io>`_
77

8-
Last updated: 10/23/2025.
8+
Last updated: 12/3/2025.
99

1010
.. warning::
11-
Reward Loop is ready for use, but the API may change in future releaes.
11+
Reward Loop is ready for use, but the API may change in future releases.
1212

13-
Reward Loop is designed for more flexible and easy-to-use reward computation.
13+
Reward Loop is designed to support flexible and user-friendly reward computation, with implementation mostly in ``verl/experimental/reward``.
1414

15-
**Design goal**:
15+
**Supported Types of Rewards:** Reward Loop covers all typical reward-computation scenarios.
1616

17-
- Make reward computation more efficient
18-
- Support broader reward model interface (including discriminative and generative models)
19-
- Make user customized reward function more flexible
17+
- **Rule-based Reward**: The reward is determined by predefined rules, e.g., checking whether the predicted answer matches the ground truth via simple string matching.
18+
- **Discriminative Reward Model (DisRM)**: The reward is produced by a specified discriminative reward model, such as ``Skywork/Skywork-Reward-Llama-3.1-8B-v0.2``.
19+
- **Generative Reward Model (GenRM)**: The reward is obtained using a generative reward model, for example ``dyyyyyyyy/FAPO-GenRM-4B``.
20+
- **Hybrid Reward Scenarios**: Reward Loop provides interfaces for plugging in reward models, allowing users to define custom reward logic based on their needs (e.g., combining rule-based methods with GenRM).
2021

21-
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_overview.svg?raw=true
22+
**Support Training Modes:** Reward Loop supports multiple execution modes for reward training
2223

23-
Async Reward Computation
24-
------------------------
24+
- **Colocate Mode**: The reward model shares the same resource pool as the actor/rollout/reference models. In this setup, all rollouts must complete first, after which the reward model is awakened to perform inference.
25+
- **Standalone Mode**: The reward model runs on a separate resource pool, independent from the actor/rollout/reference models. In this setup, each sample is evaluated by the reward model immediately after its rollout finishes.
2526

26-
RewardLoopManager
27+
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop.svg?raw=true
28+
29+
Architecture Design
30+
-------------------
31+
32+
RewardLoopWorker
2733
~~~~~~~~~~~~~~~~~
2834

29-
The Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the ``run_single`` function.
30-
This asynchronous design enables the Reward Loop to handle multiple reward computations concurrently, significantly improving computation efficiency.
35+
The ``RewardLoopWorker`` is responsible for handling batch-level reward computation across all supported execution modes, operating in an asynchronous manner.
36+
37+
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_worker.svg?raw=true
38+
39+
For each sample, the reward is computed according to the following logic:
40+
41+
- if ``custom_reward_function`` is provided, we directly use user-customized reward function
42+
43+
- if ``custom_reward_function`` is not provided:
44+
- **reward model is not enabled**: use default rule-based reward function
45+
- **reward model is discriminative**: compute reward score using disrm
46+
- **reward model is generative**: this is not permitted (user-customized reward func **must be** provided)
47+
48+
In most cases, we encourage users to define and use their own customized reward functions.
3149

3250
.. code:: python
3351
34-
class RewardLoopManagerBase(ABC):
35-
async def run_single(self, data: DataProto) -> dict:
36-
# ... (data preprocessing)
37-
if self.is_async_reward_score:
38-
result = await self.compute_score(
39-
data_source=data_source,
40-
solution_str=response_str,
41-
ground_truth=ground_truth,
42-
extra_info=extra_info,
43-
reward_router_address=self.reward_router_address,
44-
reward_model_tokenizer=self.reward_model_tokenizer,
45-
)
52+
@ray.remote
53+
class RewardLoopWorker:
54+
async def compute_score_batch(self, data: DataProto) -> list[dict]:
55+
tasks = []
56+
for i in range(len(data)):
57+
tasks.append(asyncio.create_task(self.compute_score(data[i : i + 1])))
58+
outputs = await asyncio.gather(*tasks)
59+
return outputs
60+
61+
async def compute_score(self, data: DataProto) -> dict:
62+
assert len(data) == 1, "RewardLoopWorker only support single data item"
63+
if self.config.custom_reward_function.path is not None:
64+
# directly use user-customized reward function
65+
return await self.reward_loop.run_single(data)
66+
else:
67+
if self.config.reward_model.enable:
68+
# we assume the rm is disrm
69+
# genrm must set custom_reward_function
70+
return await self.compute_score_disrm(data)
4671
else:
47-
result = await self.loop.run_in_executor(
48-
None,
49-
lambda: self.compute_score(
50-
data_source=data_source,
51-
solution_str=response_str,
52-
ground_truth=ground_truth,
53-
extra_info=extra_info,
54-
reward_router_address=self.reward_router_address,
55-
reward_model_tokenizer=self.reward_model_tokenizer,
56-
),
57-
)
58-
# ... (reward postprocessing)
59-
return final_result
72+
return await self.reward_loop.run_single(data)
73+
74+
75+
RewardLoopManager
76+
~~~~~~~~~~~~~~~~~
77+
78+
In **standalone mode**, we directly launch one ``RewardLoopWorker`` for each ``AgentLoopWorker`` to handle reward computation independently.
79+
80+
In **colocate mode**, we launch a ``RewardLoopManager`` to
81+
82+
1. launch reward model if enabled
83+
2. manage multiple ``RewardLoopWorker`` instances to handle CPU-intensive tasks such as code.
84+
85+
.. code:: python
86+
87+
class RewardLoopManager:
88+
"""
89+
RewardLoopManager run in single controller.
90+
This class will create reward loop workers and manage them.
91+
RewardLoopManager will deprecate fsdp/megatron RewardModelWorker in the future.
92+
"""
93+
def __init__(self, config: DictConfig, rm_resource_pool: RayResourcePool = None):
94+
self.config = config
95+
if self.config.reward_model.enable:
96+
self.reward_model_manager = RewardModelManager(config.reward_model, rm_resource_pool)
97+
self.reward_router_address = self.reward_model_manager.get_router_address()
98+
else:
99+
self.reward_model_manager = None
100+
self.reward_router_address = None
101+
102+
self._init_reward_loop_workers()
103+
104+
def _init_reward_loop_workers(self):
105+
self.reward_loop_workers = []
106+
num_workers = self.config.reward_model.get("num_workers", 1)
107+
node_ids = [node["NodeID"] for node in ray.nodes() if node["Alive"] and node["Resources"].get("CPU", 0) > 0]
108+
109+
for i in range(num_workers):
110+
# Round-robin scheduling over the all nodes
111+
node_id = node_ids[i % len(node_ids)]
112+
self.reward_loop_workers.append(
113+
RewardLoopWorker.options(
114+
name=f"reward_loop_worker_{i}",
115+
scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
116+
node_id=node_id,
117+
soft=True,
118+
),
119+
).remote(self.config, self.reward_router_address)
120+
)
121+
122+
def compute_rm_score(self, data: DataProto) -> DataProto:
123+
"""
124+
Compute reward score for the given data.
125+
"""
126+
...
127+
128+
129+
RewardModelManager
130+
~~~~~~~~~~~~~~~~~~
131+
132+
To support flexible and scalable reward model computation, Reward Loop implement a reward router that coordinates requests among multiple reward model servers.
133+
134+
Each reward model runs as an independent server and is registered with the router.
135+
This router will forward the requests to the registered reward servers with load balancing and return the results.
136+
This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
137+
138+
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.svg?raw=true
139+
140+
.. code:: python
141+
142+
class RewardModelManager:
143+
"""Reward model manager."""
144+
145+
def __init__(
146+
self,
147+
config: RewardModelConfig,
148+
resource_pool: RayResourcePool = None,
149+
):
150+
"""
151+
Initialize the reward model manager.
152+
153+
Args:
154+
config (RewardModelConfig): Reward model configuration.
155+
resource_pool (RayResourcePool, optional): Resource pool. Defaults to None.
156+
"""
157+
self.config = config
158+
self.resource_pool = resource_pool
159+
self._initialize_llm_servers()
160+
self._initialize_router()
161+
assert self.config.rollout.skip_tokenizer_init is False, "Reward model should not skip tokenizer init."
162+
if self.config.rollout.free_cache_engine:
163+
self.sleep()
60164
61165
User-defined reward functions can be implemented as either synchronous or asynchronous.
62166
``RewardLoopManager`` automatically detects the type of the user-defined function and executes it accordingly, ensuring that the reward computation process remains non-blocking.
63167

168+
64169
User-Customized Reward Function
65170
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
66171

@@ -105,90 +210,3 @@ A user-defined reward function may look like the following:
105210
return {"score": score}
106211
107212
Runable examples are provided in the ``recipe/fapo`` directory for reference.
108-
109-
Reward Models and Router
110-
------------------------
111-
112-
To support flexible and scalable reward model computation, RewardLoop implement a reward router that coordinates requests among multiple reward model servers.
113-
114-
Each reward model runs as an independent server and is registered with the router.
115-
This router will forward the requests to the registered reward servers with load balancing and return the results.
116-
This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
117-
118-
RewardModelManager
119-
~~~~~~~~~~~~~~~~~~
120-
121-
.. image:: https://github.com/yyDing1/verl-materials/blob/main/reward_loop_full.svg?raw=true
122-
123-
``RewardModelManager`` will launch multiple reward servers and register them in the reward router.
124-
125-
.. code:: python
126-
127-
class RewardModelManager:
128-
"""Reward model manager."""
129-
130-
def __init__(self, config: RewardModelConfig, worker_group: RayWorkerGroup = None):
131-
"""
132-
Initialize the reward model manager.
133-
134-
Args:
135-
config (RewardModelConfig): Reward model configuration.
136-
worker_group (RayWorkerGroup, optional): Worker group. Defaults to None.
137-
"""
138-
self.config = config
139-
self.worker_group = worker_group
140-
self._initialize_llm_servers()
141-
self._initialize_router()
142-
if self.config.rollout.free_cache_engine:
143-
self.sleep()
144-
145-
Reward Router
146-
~~~~~~~~~~~~~
147-
148-
The router is to forward the requests to the registered reward servers with load balancing.
149-
150-
- For sglang reward servers, we directly use the sglang router to forward the requests.
151-
- For vllm reward servers, we implement a simple round-robin ``NaiveRouter`` to dispatch the requests.
152-
153-
.. code:: python
154-
155-
class NaiveRouter:
156-
def __init__(
157-
self,
158-
worker_urls: list[str],
159-
max_connections: int = 1024,
160-
timeout: int = 60,
161-
max_attempts: int = 3,
162-
retry_delay: float = 2.0,
163-
verbose: bool = False,
164-
):
165-
"""A minimal async load-balancing router."""
166-
self.verbose = verbose
167-
self.app = FastAPI()
168-
self.worker_urls = worker_urls
169-
self.request_counts = {url: 0 for url in worker_urls}
170-
171-
self.max_connections = max_connections
172-
self.timeout = timeout
173-
self.max_attempts = max_attempts
174-
self.retry_delay = retry_delay
175-
176-
self.app = FastAPI()
177-
178-
# Register startup / shutdown hooks
179-
self.app.on_event("startup")(self._on_startup)
180-
self.app.on_event("shutdown")(self._on_shutdown)
181-
182-
# Catch-all proxy route
183-
self.app.api_route("/{endpoint:path}", methods=["GET", "POST"])(self._make_async_request)
184-
185-
# Placeholder for aiohttp client
186-
self.client = None
187-
188-
Agent Reward Loop
189-
-----------------
190-
191-
Reward Loop can be integrated with AgentLoop to enable sample-wise rollout and reward computation.
192-
193-
.. image:: https://github.com/yyDing1/verl-materials/blob/main/agent_reward_loop.svg?raw=true
194-

0 commit comments

Comments
 (0)