You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the filtering function is very strict, causing a large number of prompt groups to be discarded, the system will monitor the number of pending tasks in `remaining_batch_size`. Once the number of pending tasks drops below the target number (32) due to too many being discarded, the system will automatically trigger a new round of oversampling, requesting `over_sampling_batch_size` (64) new prompts again to repeat the above process.
Note: On-policy distillation (OPD) is now orthogonal to the advantage estimator. Use `--use-opd` and `--opd-kl-coef` to enable OPD on top of any estimator.
190
191
-`--calculate-per-token-loss`: By default, slime calculates loss on a per-sample basis, i.e., `mean(sum(sample_i) / len(sample_i))`. Enable this flag to calculate loss on a per-token basis, i.e., `sum(sum(sample_i)) / sum(len(sample_i))`.
191
192
-`--use-tis`: Enable this setting to use TIS (Truncated Importance Sampling) (https://fengyao.notion.site/off-policy-rl).
192
193
-`--true-on-policy-mode`: Enable True On-Policy mode, which strictly ensures that data is generated by the current policy during training.
@@ -266,19 +267,19 @@ slime supports customizing data generation (rollout) to various degrees.
266
267
- You can completely replace the `generate_rollout` in sglang\_example.py by using the `--rollout-function-path` parameter. You just need to ensure that the function signature passed via `--rollout-function-path` is as follows:
rollout_id: int, the id of the rollout, used for deterministic data generation
274
-
data_buffer: the data buffer to store the generated samples
275
+
data_source: the data source to get and store samples
275
276
evaluation: bool, whether the rollout is for evaluation or not
276
277
277
278
Returns:
278
-
list[list[Sample]]: a list of samples generated by the rollout
279
+
RolloutFnTrainOutput | RolloutFnEvalOutput: the output of the rollout
279
280
"""
280
281
...
281
-
returnsamples
282
+
returnoutput
282
283
```
283
284
284
285
Where:
@@ -287,7 +288,7 @@ slime supports customizing data generation (rollout) to various degrees.
287
288
288
289
-`rollout_id`: The ID of the current data generation round, used to ensure data order when resuming training.
289
290
290
-
-`data_buffer`: A globally unique data bufferin slime, which can be used to get initial prompts, data IDs, and store partially generated samples for later use.
291
+
-`data_source`: A globally unique data sourcein slime, which can be used to get initial prompts, data IDs, and store partially generated samples for later use.
291
292
292
293
-`evaluation`: A boolean indicating if the rollout isfor evaluation. You can configure a separate evaluation function using `--eval-function-path`.
293
294
@@ -296,10 +297,7 @@ slime supports customizing data generation (rollout) to various degrees.
296
297
-`tokens`: The tokens for the prompt + response.
297
298
-`response_length`: The total length of the response. For multi-turn tasks, this is the length of the tokens remaining after the first-turn prompt.
298
299
-`reward`: The reward for this data sample.
299
-
-`truncated`: Whether this data sample was truncated, similar to `finish_reason == length`in SGLang.
300
-
301
-
And if there are scenarios like tool calls or multi-turn usage, ensure the `loss_mask`is correct:
302
-
300
+
-`status`: The status of this data sample (e.g., `Sample.Status.COMPLETED`, `Sample.Status.TRUNCATED`, `Sample.Status.ABORTED`, `Sample.Status.FAILED`).
303
301
-`loss_mask` should be the same length as`response_length`, with`1`for tokens that should be included in the loss calculation and`0`for those that should be masked out.
304
302
305
303
- In some cases, you may only need to replace the data generation logic. You can do this using `--custom-generate-function-path`. A simplified implementation of this function isas follows:
@@ -325,9 +323,14 @@ slime supports customizing data generation (rollout) to various degrees.
Copy file name to clipboardExpand all lines: get_started/quick_start.html
+6-2Lines changed: 6 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -847,8 +847,12 @@ <h3>Dynamic Sampling<a class="headerlink" href="#dynamic-sampling" title="Link t
847
847
<p>Then each sampling will directly sample 64 prompts, and each prompt will be sampled 8 times. Because slime performs asynchronous sampling internally, we will successively obtain 8 responses for each prompt. When receiving responses, the function corresponding to <codeclass="docutils literal notranslate"><spanclass="pre">dynamic_sampling_filter_path</span></code> will be used for filtering. If it passes, these 8 pieces of data will be kept; otherwise, they will be discarded.</p>
848
848
<p>The filtering function <codeclass="docutils literal notranslate"><spanclass="pre">check_reward_nonzero_std</span></code> in the example will check whether the standard deviation of rewards for a group of samples is greater than zero, ensuring that the reward scores of each group of samples left have differences, thereby avoiding overly homogeneous data and improving data diversity.</p>
<p>If the filtering function is very strict, causing a large number of prompt groups to be discarded, the system will monitor the number of pending tasks in <codeclass="docutils literal notranslate"><spanclass="pre">remaining_batch_size</span></code>. Once the number of pending tasks drops below the target number (32) due to too many being discarded, the system will automatically trigger a new round of oversampling, requesting <codeclass="docutils literal notranslate"><spanclass="pre">over_sampling_batch_size</span></code> (64) new prompts again to repeat the above process.</p>
0 commit comments