THUDM
diff --git a/‎.doctrees/environment.pickle‎
0 Bytes b/‎.doctrees/environment.pickle‎
0 Bytes
diff --git a/‎.doctrees/get_started/customization.doctree‎
104 Bytes b/‎.doctrees/get_started/customization.doctree‎
104 Bytes
diff --git a/‎.doctrees/get_started/quick_start.doctree‎
290 Bytes b/‎.doctrees/get_started/quick_start.doctree‎
290 Bytes
diff --git a/‎.doctrees/get_started/usage.doctree‎
812 Bytes b/‎.doctrees/get_started/usage.doctree‎
812 Bytes
diff --git a/‎_sources/get_started/customization.md‎
Lines changed: 3 additions & 3 deletions b/‎_sources/get_started/customization.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎_sources/get_started/quick_start.md‎
Lines changed: 6 additions & 2 deletions b/‎_sources/get_started/quick_start.md‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎_sources/get_started/usage.md‎
Lines changed: 15 additions & 12 deletions b/‎_sources/get_started/usage.md‎
Lines changed: 15 additions & 12 deletions
diff --git a/‎get_started/customization.html‎
Lines changed: 3 additions & 3 deletions b/‎get_started/customization.html‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎get_started/quick_start.html‎
Lines changed: 6 additions & 2 deletions b/‎get_started/quick_start.html‎
Lines changed: 6 additions & 2 deletions
@@ -40,7 +40,7 @@ Below is a summary of all available customization interfaces and their purposes.
 
 **Signature**:
 ```python
-async def generate_rollout(args, rollout_id, *, evaluation=False) -> RolloutFnTrainOutput | RolloutFnEvalOutput
+def generate_rollout(args, rollout_id, data_source, evaluation=False) -> RolloutFnTrainOutput | RolloutFnEvalOutput
 ```
 
 **Use Cases**:
@@ -140,7 +140,7 @@ class DynamicFilterOutput:
 
 **Signature**:
 ```python
-def buffer_filter(samples: list[list[Sample]]) -> list[list[Sample]]
+def buffer_filter(args, rollout_id, buffer: list[list[Sample]], num_samples: int) -> list[list[Sample]]
 ```
 
 **Use Cases**:
@@ -177,7 +177,7 @@ def filter_function(args, samples: list[Sample]) -> None
 
 **Signature**:
 ```python
-def process_function(args, samples: list[list[Sample]]) -> None
+def process_function(args, samples: list[list[Sample]], data_source) -> None
 ```
 
 **Use Cases**:
 
@@ -359,8 +359,12 @@ The filtering function `check_reward_nonzero_std` in the example will check whet
 
 ```python
 def check_reward_nonzero_std(args, samples: list[Sample], **kwargs):
-    rewards = [sample.reward for sample in samples]
-    return torch.tensor(rewards, dtype=torch.float).std() > 0.0
+    rewards = [sample.get_reward_value(args) for sample in samples]
+    keep = torch.tensor(rewards, dtype=torch.float).std() > 0.0
+    return DynamicFilterOutput(
+        keep=keep,
+        reason=None if keep else f"zero_std_{round(rewards[0], 1)}",
+    )
 ```
 
 If the filtering function is very strict, causing a large number of prompt groups to be discarded, the system will monitor the number of pending tasks in `remaining_batch_size`. Once the number of pending tasks drops below the target number (32) due to too many being discarded, the system will automatically trigger a new round of oversampling, requesting `over_sampling_batch_size` (64) new prompts again to repeat the above process.
 
@@ -186,7 +186,8 @@ Additionally, we provide a `metadata_key`, which defaults to `"metadata"`. When
     - `gspo` ([https://arxiv.org/abs/2507.18071](https://arxiv.org/abs/2507.18071))
     - `reinforce_plus_plus` and `reinforce_plus_plus_baseline` ([https://arxiv.org/abs/2501.03262](https://arxiv.org/abs/2501.03262))
     - `ppo` ([https://arxiv.org/abs/1707.06347](https://arxiv.org/abs/1707.06347))
-    - `on_policy_distillation`
+
+  Note: On-policy distillation (OPD) is now orthogonal to the advantage estimator. Use `--use-opd` and `--opd-kl-coef` to enable OPD on top of any estimator.
 - `--calculate-per-token-loss`: By default, slime calculates loss on a per-sample basis, i.e., `mean(sum(sample_i) / len(sample_i))`. Enable this flag to calculate loss on a per-token basis, i.e., `sum(sum(sample_i)) / sum(len(sample_i))`.
 - `--use-tis`: Enable this setting to use TIS (Truncated Importance Sampling) (https://fengyao.notion.site/off-policy-rl).
 - `--true-on-policy-mode`: Enable True On-Policy mode, which strictly ensures that data is generated by the current policy during training.
@@ -266,19 +267,19 @@ slime supports customizing data generation (rollout) to various degrees.
   - You can completely replace the `generate_rollout` in sglang\_example.py by using the `--rollout-function-path` parameter. You just need to ensure that the function signature passed via `--rollout-function-path` is as follows:
 
     ```python
-    def generate_rollout(args, rollout_id, data_buffer, evaluation=False) -> list[list[Sample]]:
+    def generate_rollout(args, rollout_id, data_source, evaluation=False) -> RolloutFnTrainOutput | RolloutFnEvalOutput:
         """
         Args:
             args: the whole args
             rollout_id: int, the id of the rollout, used for deterministic data generation
-            data_buffer: the data buffer to store the generated samples
+            data_source: the data source to get and store samples
             evaluation: bool, whether the rollout is for evaluation or not
         
         Returns:
-            list[list[Sample]]: a list of samples generated by the rollout
+            RolloutFnTrainOutput | RolloutFnEvalOutput: the output of the rollout
         """
             ...
-            return samples
+            return output
     ```
 
     Where:
@@ -287,7 +288,7 @@ slime supports customizing data generation (rollout) to various degrees.
 
       - `rollout_id`: The ID of the current data generation round, used to ensure data order when resuming training.
 
-      - `data_buffer`: A globally unique data buffer in slime, which can be used to get initial prompts, data IDs, and store partially generated samples for later use.
+      - `data_source`: A globally unique data source in slime, which can be used to get initial prompts, data IDs, and store partially generated samples for later use.
 
       - `evaluation`: A boolean indicating if the rollout is for evaluation. You can configure a separate evaluation function using `--eval-function-path`.
 
@@ -296,10 +297,7 @@ slime supports customizing data generation (rollout) to various degrees.
           - `tokens`: The tokens for the prompt + response.
           - `response_length`: The total length of the response. For multi-turn tasks, this is the length of the tokens remaining after the first-turn prompt.
           - `reward`: The reward for this data sample.
-          - `truncated`: Whether this data sample was truncated, similar to `finish_reason == length` in SGLang.
-
-        And if there are scenarios like tool calls or multi-turn usage, ensure the `loss_mask` is correct:
-
+        - `status`: The status of this data sample (e.g., `Sample.Status.COMPLETED`, `Sample.Status.TRUNCATED`, `Sample.Status.ABORTED`, `Sample.Status.FAILED`).
           - `loss_mask` should be the same length as `response_length`, with `1` for tokens that should be included in the loss calculation and `0` for those that should be masked out.
 
   - In some cases, you may only need to replace the data generation logic. You can do this using `--custom-generate-function-path`. A simplified implementation of this function is as follows:
@@ -325,9 +323,14 @@ slime supports customizing data generation (rollout) to various degrees.
         # set sample
         sample.tokens = prompt_tokens_ids + response_token_ids
         sample.response_length = len(response_token_ids)
-        sample.truncated = output["meta_info"]["finish_reason"]["type"] == "length"
+        finish_reason = output["meta_info"]["finish_reason"]["type"]
+        if finish_reason == "length":
+            sample.status = Sample.Status.TRUNCATED
+        elif finish_reason == "abort":
+            sample.status = Sample.Status.ABORTED
+        else:
+            sample.status = Sample.Status.COMPLETED
         sample.response = output["text"]
-        sample.aborted = output["meta_info"]["finish_reason"]["type"] == "abort"
 
         return sample
     ```
 
@@ -574,7 +574,7 @@ <h3>1. Rollout Function (<code class="docutils literal notranslate"><span class=
 <p><strong>Default</strong>: <code class="docutils literal notranslate"><span class="pre">slime.rollout.sglang_rollout.generate_rollout</span></code></p>
 <p><strong>Purpose</strong>: Override the entire rollout generation logic.</p>
 <p><strong>Signature</strong>:</p>
-<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">async</span> <span class="k">def</span><span class="w"> </span><span class="nf">generate_rollout</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">rollout_id</span><span class="p">,</span> <span class="o">*</span><span class="p">,</span> <span class="n">evaluation</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">RolloutFnTrainOutput</span> <span class="o">|</span> <span class="n">RolloutFnEvalOutput</span>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">generate_rollout</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">rollout_id</span><span class="p">,</span> <span class="n">data_source</span><span class="p">,</span> <span class="n">evaluation</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">RolloutFnTrainOutput</span> <span class="o">|</span> <span class="n">RolloutFnEvalOutput</span>
 </pre></div>
 </div>
 <p><strong>Use Cases</strong>:</p>
@@ -662,7 +662,7 @@ <h3>5. Buffer Filter (<code class="docutils literal notranslate"><span class="pr
 <p><strong>Default</strong>: <code class="docutils literal notranslate"><span class="pre">None</span></code></p>
 <p><strong>Purpose</strong>: Filter samples in the rollout buffer before training.</p>
 <p><strong>Signature</strong>:</p>
-<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">buffer_filter</span><span class="p">(</span><span class="n">samples</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">Sample</span><span class="p">]])</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">Sample</span><span class="p">]]</span>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">buffer_filter</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">rollout_id</span><span class="p">,</span> <span class="n">buffer</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">Sample</span><span class="p">]],</span> <span class="n">num_samples</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">Sample</span><span class="p">]]</span>
 </pre></div>
 </div>
 <p><strong>Use Cases</strong>:</p>
@@ -694,7 +694,7 @@ <h3>7. Rollout All Samples Process (<code class="docutils literal notranslate"><
 <p><strong>Default</strong>: <code class="docutils literal notranslate"><span class="pre">None</span></code></p>
 <p><strong>Purpose</strong>: Process all samples (including filtered ones) after rollout.</p>
 <p><strong>Signature</strong>:</p>
-<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">process_function</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">samples</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">Sample</span><span class="p">]])</span> <span class="o">-&gt;</span> <span class="kc">None</span>
+<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">process_function</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">samples</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">list</span><span class="p">[</span><span class="n">Sample</span><span class="p">]],</span> <span class="n">data_source</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="kc">None</span>
 </pre></div>
 </div>
 <p><strong>Use Cases</strong>:</p>
 
@@ -847,8 +847,12 @@ <h3>Dynamic Sampling<a class="headerlink" href="#dynamic-sampling" title="Link t
 <p>Then each sampling will directly sample 64 prompts, and each prompt will be sampled 8 times. Because slime performs asynchronous sampling internally, we will successively obtain 8 responses for each prompt. When receiving responses, the function corresponding to <code class="docutils literal notranslate"><span class="pre">dynamic_sampling_filter_path</span></code> will be used for filtering. If it passes, these 8 pieces of data will be kept; otherwise, they will be discarded.</p>
 <p>The filtering function <code class="docutils literal notranslate"><span class="pre">check_reward_nonzero_std</span></code> in the example will check whether the standard deviation of rewards for a group of samples is greater than zero, ensuring that the reward scores of each group of samples left have differences, thereby avoiding overly homogeneous data and improving data diversity.</p>
 <div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span><span class="w"> </span><span class="nf">check_reward_nonzero_std</span><span class="p">(</span><span class="n">args</span><span class="p">,</span> <span class="n">samples</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">Sample</span><span class="p">],</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
-    <span class="n">rewards</span> <span class="o">=</span> <span class="p">[</span><span class="n">sample</span><span class="o">.</span><span class="n">reward</span> <span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">samples</span><span class="p">]</span>
-    <span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">rewards</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">float</span><span class="p">)</span><span class="o">.</span><span class="n">std</span><span class="p">()</span> <span class="o">&gt;</span> <span class="mf">0.0</span>
+    <span class="n">rewards</span> <span class="o">=</span> <span class="p">[</span><span class="n">sample</span><span class="o">.</span><span class="n">get_reward_value</span><span class="p">(</span><span class="n">args</span><span class="p">)</span> <span class="k">for</span> <span class="n">sample</span> <span class="ow">in</span> <span class="n">samples</span><span class="p">]</span>
+    <span class="n">keep</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">rewards</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="o">.</span><span class="n">float</span><span class="p">)</span><span class="o">.</span><span class="n">std</span><span class="p">()</span> <span class="o">&gt;</span> <span class="mf">0.0</span>
+    <span class="k">return</span> <span class="n">DynamicFilterOutput</span><span class="p">(</span>
+        <span class="n">keep</span><span class="o">=</span><span class="n">keep</span><span class="p">,</span>
+        <span class="n">reason</span><span class="o">=</span><span class="kc">None</span> <span class="k">if</span> <span class="n">keep</span> <span class="k">else</span> <span class="sa">f</span><span class="s2">&quot;zero_std_</span><span class="si">{</span><span class="nb">round</span><span class="p">(</span><span class="n">rewards</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="w"> </span><span class="mi">1</span><span class="p">)</span><span class="si">}</span><span class="s2">&quot;</span><span class="p">,</span>
+    <span class="p">)</span>
 </pre></div>
 </div>
 <p>If the filtering function is very strict, causing a large number of prompt groups to be discarded, the system will monitor the number of pending tasks in <code class="docutils literal notranslate"><span class="pre">remaining_batch_size</span></code>. Once the number of pending tasks drops below the target number (32) due to too many being discarded, the system will automatically trigger a new round of oversampling, requesting <code class="docutils literal notranslate"><span class="pre">over_sampling_batch_size</span></code> (64) new prompts again to repeat the above process.</p>