[on-policy distillation] support and related data handling by ahxt · Pull Request #673 · THUDM/slime

ahxt · 2025-11-03T00:05:05Z

PR: On-Policy Distillation Support

This PR introduces On-Policy Distillation to the slime framework, extending its reinforcement learning (RL) pipeline to support teacher–student distillation directly within on-policy training.

Thanks to the modular design of slime, integrating On-Policy Distillation is straightforward. In this PR, the teacher model acts as a reward model (RM) by providing teacher log probabilities as the supervision signal.

1. add on_policy_distillation example folder`examples/on_policy_distillation/`

on_policy_distillation.py — implements reward_func and post_process_rewards
run-qwen3-8B-opd.sh — example training script for Qwen3-8B student model and Qwen3-32B as teacher model

2. Advantage Estimator Extension (`loss.py`)

Added on_policy_distillation advantage estimator
Computes advantages as the difference between teacher and student log probabilities

3. Data Pipeline Integration (`rollout.py`, `data.py`)

Extended rollout data structure to include:
- teacher_log_probs
- teacher_token_ids

4. Teacher Model Server

a separate SGLang server to serve the teacher model and return log probabilities
only prefill stage for teacher model

yitianlian · 2025-11-12T06:01:57Z

examples/on_policy_distillation/run-qwen3-8B-opd.sh

+
+####clear after training
+pkill -9 sglang
+sleep 3
+ray stop --force
+pkill -9 ray
+pkill -9 python
+sleep 3
+pkill -9 ray
+pkill -9 python
+


I think this part could be deleted

yitianlian · 2025-11-12T06:02:39Z

slime/backends/megatron_utils/data.py


        for key, val in rollout_data.items():
-            if key == "tokens" or key == "loss_masks" or key == "sample_indices":
+            if key == "tokens" or key == "loss_masks" or key == "sample_indices" or key == "teacher_token_ids" or key == "rewards":


Why we exclude key=="reward"

I’ll add the reward back. However, I’m concerned that the reward here is meaningless—it’s just the average of the teacher log probabilities.

yitianlian · 2025-11-12T06:05:17Z

slime/backends/megatron_utils/loss.py

+        teacher_log_probs = [t_log_prob.to(device=device) for t_log_prob in teacher_log_probs]
+        teacher_log_probs = [t_log_prob[-response_length:] for t_log_prob, response_length in zip(teacher_log_probs, response_lengths)]
+        advantages = [teacher_log_prob - student_log_prob for teacher_log_prob, student_log_prob in zip(teacher_log_probs, student_log_probs)]


You should do the slice for teacher logs when creating samples.
Remove teacher token ids.

will do. the teacher token ids are for debugging.

yitianlian · 2025-11-12T06:08:51Z

examples/on_policy_distillation/on_policy_distillation.py

+def post_process_rewards(args, samples: list[Sample], **kwargs):
+    rewards = [sample.get_reward_value(args) for sample in samples]
+    teacher_log_probs = [torch.tensor([item[0] for item in reward["meta_info"]["input_token_logprobs"][1:]], dtype=torch.float32) for reward in rewards]
+    teacher_token_ids = [torch.tensor([item[1] for item in reward["meta_info"]["input_token_logprobs"][1:]], dtype=torch.int32) for reward in rewards]


Can we do [-response_length:] here?

yes. we can and will do.

yitianlian · 2025-11-12T06:10:45Z

examples/on_policy_distillation/on_policy_distillation.py

+
+    for sample, t_log_probs, t_token_ids in zip(samples, teacher_log_probs, teacher_token_ids):
+        sample.teacher_log_probs = t_log_probs
+        sample.teacher_token_ids = t_token_ids


I believe we don’t need teacher token IDs. We should maintain only one token ID list per sample.

yitianlian · 2025-11-12T06:11:21Z

examples/on_policy_distillation/run-qwen3-8B-opd.sh

+echo "Starting teacher model server..."
+
+## Wait for the server to be ready
+until curl -sf http://127.0.0.1:$TEACHER_PORT/health_generate > /dev/null; do


The IP should use the master address or 0.0.0.0.

This example only uses one node, so it should be 127.0.0.1 or localhost.

I think it will be better to use 0.0.0.0, as users might directly copy this script and run it on multiple nodes setting.

I think 0.0.0.0 won’t work — it should be either the local host (127.0.0.1) or a specific remote host (xxx.xxx.xxx.xx). In this case, the teacher_ip is the local host (127.0.0.1).

I’ll set it like this:

teacher_IP="127.0.0.1" # set to your teacher server's IP. teacher_host="13141"

yitianlian · 2025-11-12T06:12:20Z

slime/ray/rollout.py

+        if "teacher_token_ids" in samples[0].__dict__:
+            train_data["teacher_token_ids"] = [sample.teacher_token_ids for sample in samples]


Remove this

…plify return values

…obabilities and clean up server startup messages

ahxt · 2025-11-12T08:08:18Z

I added the first-step loss of this example for reference:

(MegatronTrainRayActor pid=46249) step 0: {'train/loss': 0.1438787877559662, 'train/pg_loss': 0.1438787877559662, 'train/entropy_loss': 0.24523311853408813, 'train/pg_clipfrac': 0.0, 'train/ppo_kl': 0.0, 'train/train_rollout_logprob_abs_diff': 0.012880997732281685, 'train/kl_loss': 0.0, 'train/grad_norm': 1.7898532502527087, 'train/lr-pg_0': 1e-06, 'train/lr-pg_1': 1e-06}

I also ran experiments with this code on the OpenThoughts3 dataset. Results (Math500, pass@1, 8 samples):

Qwen3-8B-Base + SFT: 76%
Qwen3-8B-Base + SFT + On-Policy-Distillation: 94%

yitianlian · 2025-11-12T09:55:57Z

examples/on_policy_distillation/run-qwen3-8B-opd.sh

+RM_ARGS=(
+   --custom-rm-path examples.on_policy_distillation.on_policy_distillation.reward_func
+   --custom-reward-post-process-path examples.on_policy_distillation.on_policy_distillation.post_process_rewards
+   --rm-url http://127.0.0.1:$TEACHER_PORT/generate


Also change here? I think a better script can be:

teach_IP="0.0.0.0" teacher_host="13141" ....

yitianlian

LGTM!

ahxt · 2025-11-12T10:52:48Z

Formatted the code to satisfy CI—let me know if anything else is required.”

liujiahua123123 · 2025-12-14T18:56:44Z

examples/on_policy_distillation/on_policy_distillation.py

+
+async def reward_func(args, sample, **kwargs):
+    payload = {
+        "text": sample.prompt + sample.response,


@ahxt we should probably use input_ids here, or there might be discrepancy.

Add on-policy distillation support and related data handling

f60820e

yitianlian reviewed Nov 12, 2025

View reviewed changes

ahxt added 2 commits November 11, 2025 22:47

Refactor reward processing: remove teacher_token_ids handling and sim…

7857e9e

…plify return values

Enhance reward processing: include response lengths in teacher log pr…

d081122

…obabilities and clean up server startup messages

yitianlian reviewed Nov 12, 2025

View reviewed changes

Fix teacher model server IP usage in script for clarity

12309a8

yitianlian approved these changes Nov 12, 2025

View reviewed changes

yitianlian added the ci label Nov 12, 2025

style: apply autoflake/isort/black fixes

8704522

yitianlian merged commit 12dd6b2 into THUDM:main Nov 12, 2025
3 of 4 checks passed

none0663 mentioned this pull request Nov 13, 2025

Consider Computing Teacher Logits Using Training Engine for On-Policy Distillation #714

Open

llltttwww pushed a commit to llltttwww/slime that referenced this pull request Nov 30, 2025

[on-policy distillation] support and related data handling (THUDM#673)

162b78d

liujiahua123123 reviewed Dec 14, 2025

View reviewed changes

Yangruipis pushed a commit to rednote-ai/slime that referenced this pull request Feb 28, 2026

[on-policy distillation] support and related data handling (THUDM#673)

c59d8f0

		if "teacher_token_ids" in samples[0].__dict__:
		train_data["teacher_token_ids"] = [sample.teacher_token_ids for sample in samples]

Conversation

ahxt commented Nov 3, 2025

PR: On-Policy Distillation Support

1. add on_policy_distillation example folderexamples/on_policy_distillation/

2. Advantage Estimator Extension (loss.py)

3. Data Pipeline Integration (rollout.py, data.py)

4. Teacher Model Server

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahxt commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yitianlian left a comment

Choose a reason for hiding this comment

Uh oh!

ahxt commented Nov 12, 2025

Uh oh!

Uh oh!

liujiahua123123 Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. add on_policy_distillation example folder`examples/on_policy_distillation/`

2. Advantage Estimator Extension (`loss.py`)

3. Data Pipeline Integration (`rollout.py`, `data.py`)

liujiahua123123 Dec 14, 2025 •

edited

Loading