You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On-policy distillation (OPD) enables a student model to learn from a larger teacher model by training on its own rollouts while matching the teacher's token-level log-probabilities. OPD is orthogonal to advantage estimators — it works as an additive KL penalty on top of any estimator (GRPO, PPO, REINFORCE++, etc.).
4
+
5
+
## Key Arguments
6
+
7
+
| Argument | Description |
8
+
|----------|-------------|
9
+
|`--use-opd`| Enable on-policy distillation. Required flag to use OPD. |
10
+
|`--opd-type`| Type of OPD: `sglang` or `megatron`. Required when `--use-opd` is set. |
11
+
|`--opd-kl-coef`| OPD KL penalty coefficient (default: 1.0). Controls the weight of the distillation signal relative to the RL advantage. |
12
+
|`--opd-teacher-load`| Path to teacher Megatron checkpoint. **Required** when `--opd-type=megatron`, **must not be set** when `--opd-type=sglang`. |
13
+
|`--opd-teacher-ckpt-step`| Optional checkpoint step for teacher model. |
14
+
15
+
## How It Works
16
+
17
+
OPD modifies the advantage computation by subtracting a KL penalty term that encourages the student to match the teacher's output distribution:
Where $A_t$ is the original advantage from the base estimator (e.g., GRPO), $\lambda_{\text{opd}}$ is `--opd-kl-coef`, and $D_{\text{KL}}$ is the token-level reverse KL divergence.
24
+
25
+
This means OPD can be combined with any advantage estimator, including GRPO, PPO, REINFORCE++, and GSPO.
26
+
27
+
## Two Teacher Modes
28
+
29
+
### SGLang Mode (`--opd-type sglang`)
30
+
31
+
The teacher runs on an external SGLang server. Teacher log-probs are obtained during the rollout phase.
32
+
33
+
**When to use**: The teacher has a different architecture from the student, or the teacher is too large to load alongside the training model.
34
+
35
+
**How it works**:
36
+
1. An external SGLang server runs the teacher model.
37
+
2. During rollout, the custom reward function (`slime.rollout.on_policy_distillation.reward_func`) sends each sample to the teacher server to obtain token-level log-probs.
38
+
3. The custom post-processing function (`slime.rollout.on_policy_distillation.post_process_rewards`) trims the teacher log-probs to the response span and stores them in `sample.teacher_log_probs`.
39
+
4. During training, the KL penalty is computed from the stored teacher log-probs and applied to advantages.
The teacher model is loaded directly into Megatron via `--opd-teacher-load`. Teacher log-probs are computed during the training forward pass.
54
+
55
+
**When to use**: The teacher has the same architecture as the student/reference model and fits in GPU memory.
56
+
57
+
**How it works**:
58
+
1. The teacher model is loaded as an additional Megatron model during initialization.
59
+
2. During the training forward pass, the teacher model computes log-probs for each sample.
60
+
3. The KL penalty is computed inline and applied to advantages.
61
+
62
+
**Configuration**:
63
+
```bash
64
+
--use-opd
65
+
--opd-type megatron
66
+
--opd-kl-coef 1.0
67
+
--opd-teacher-load /path/to/teacher_torch_dist
68
+
```
69
+
70
+
> **Note**: The teacher checkpoint must be in Megatron format (`torch_dist` or `torch`). You can convert from HuggingFace format using `tools/convert_hf_to_torch_dist.py`.
71
+
72
+
## Running the Examples
73
+
74
+
Complete example scripts are provided in `examples/on_policy_distillation/`:
Using Qwen3-8B-Base model SFT-ed on part of the [OpenThoughts3-1.2M](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M) dataset, on-policy distillation with a Qwen3-32B teacher on the remaining data yields:
0 commit comments