You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### What does this PR do?
as title
### Checklist Before Starting
- [ ] Search for similar PRs. Paste at least one query link here: ...
- [ ] Format the PR title as `[{modules}] {type}: {description}` (This
will be checked by the CI)
- `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`,
`trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`,
`ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`,
`env`, `tool`, `ckpt`, `doc`, `data`
- If this PR involves multiple modules, separate them with `,` like
`[megatron, fsdp, doc]`
- `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test`
- If this PR breaks any API (CLI arguments, config, function signature,
etc.), add `[BREAKING]` to the beginning of the title.
- Example: `[BREAKING][fsdp, megatron] feat: dynamic batching`
### Test
> For changes that can not be tested by CI (e.g., algorithm
implementation, new model support), validate by experiment(s) and show
results like training curve plots, evaluation results, etc.
### API and Usage Example
> Demonstrate how the API changes if any, and provide usage example(s)
if possible.
```python
# Add code snippet or script demonstrating how to use this
```
### Design & Code Changes
> Demonstrate the high-level design if this PR is complex, and list the
specific changes.
### Checklist Before Submitting
> [!IMPORTANT]
> Please check all the following items before requesting a review,
otherwise the reviewer might deprioritize this PR for review.
- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting):
`pre-commit install && pre-commit run --all-files --show-diff-on-failure
--color=always`
- [ ] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: ...
- [ ] Once your PR is ready for CI, send a message in [the `ci-request`
channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the
`verl` Slack
workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
(If not accessible, please try [the Feishu group
(飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
Reward Loop is ready for use, but the API may change in future releaes.
11
+
Reward Loop is ready for use, but the API may change in future releases.
12
12
13
-
Reward Loop is designed for more flexible and easy-to-use reward computation.
13
+
Reward Loop is designed to support flexible and user-friendly reward computation, with implementation mostly in ``verl/experimental/reward``.
14
14
15
-
**Design goal**:
15
+
**Supported Types of Rewards:** Reward Loop covers all typical reward-computation scenarios.
16
16
17
-
- Make reward computation more efficient
18
-
- Support broader reward model interface (including discriminative and generative models)
19
-
- Make user customized reward function more flexible
17
+
- **Rule-based Reward**: The reward is determined by predefined rules, e.g., checking whether the predicted answer matches the ground truth via simple string matching.
18
+
- **Discriminative Reward Model (DisRM)**: The reward is produced by a specified discriminative reward model, such as ``Skywork/Skywork-Reward-Llama-3.1-8B-v0.2``.
19
+
- **Generative Reward Model (GenRM)**: The reward is obtained using a generative reward model, for example ``dyyyyyyyy/FAPO-GenRM-4B``.
20
+
- **Hybrid Reward Scenarios**: Reward Loop provides interfaces for plugging in reward models, allowing users to define custom reward logic based on their needs (e.g., combining rule-based methods with GenRM).
**Support Training Modes:** Reward Loop supports multiple execution modes for reward training
22
23
23
-
Async Reward Computation
24
-
------------------------
24
+
- **Colocate Mode**: The reward model shares the same resource pool as the actor/rollout/reference models. In this setup, all rollouts must complete first, after which the reward model is awakened to perform inference.
25
+
- **Standalone Mode**: The reward model runs on a separate resource pool, independent from the actor/rollout/reference models. In this setup, each sample is evaluated by the reward model immediately after its rollout finishes.
The Reward Loop refactors the design of the reward manager so that each sample is processed asynchronously in the ``run_single`` function.
30
-
This asynchronous design enables the Reward Loop to handle multiple reward computations concurrently, significantly improving computation efficiency.
35
+
The ``RewardLoopWorker`` is responsible for handling batch-level reward computation across all supported execution modes, operating in an asynchronous manner.
To support flexible and scalable reward model computation, Reward Loop implement a reward router that coordinates requests among multiple reward model servers.
133
+
134
+
Each reward model runs as an independent server and is registered with the router.
135
+
This router will forward the requests to the registered reward servers with load balancing and return the results.
136
+
This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
config (RewardModelConfig): Reward model configuration.
155
+
resource_pool (RayResourcePool, optional): Resource pool. Defaults to None.
156
+
"""
157
+
self.config = config
158
+
self.resource_pool = resource_pool
159
+
self._initialize_llm_servers()
160
+
self._initialize_router()
161
+
assertself.config.rollout.skip_tokenizer_init isFalse, "Reward model should not skip tokenizer init."
162
+
ifself.config.rollout.free_cache_engine:
163
+
self.sleep()
60
164
61
165
User-defined reward functions can be implemented as either synchronous or asynchronous.
62
166
``RewardLoopManager`` automatically detects the type of the user-defined function and executes it accordingly, ensuring that the reward computation process remains non-blocking.
63
167
168
+
64
169
User-Customized Reward Function
65
170
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
66
171
@@ -105,90 +210,3 @@ A user-defined reward function may look like the following:
105
210
return {"score": score}
106
211
107
212
Runable examples are provided in the ``recipe/fapo`` directory for reference.
108
-
109
-
Reward Models and Router
110
-
------------------------
111
-
112
-
To support flexible and scalable reward model computation, RewardLoop implement a reward router that coordinates requests among multiple reward model servers.
113
-
114
-
Each reward model runs as an independent server and is registered with the router.
115
-
This router will forward the requests to the registered reward servers with load balancing and return the results.
116
-
This design allows us to expose a single unified router address to user-defined reward functions, enabling them to access various reward models seamlessly through the same interface.
0 commit comments