SDPG: Self-Distilled Policy Gradient

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. This repository implements the paper "Self-Distilled Policy Gradient" and related privileged-context training methods on top of the verl RLHF framework.

Authors: Yifeng Liu*, Shiyuan Zhang*, Yifan Zhang*, Quanquan Gu

[Webpage] [Huggingface]

Methods

Method	Loss	Teacher	Ref model
GRPO	DAPO dual-clip PPO	None	Optional
SDPG	DAPO clip + full-vocab KL distillation + α-reg	Current π_θ(·\|c,x)	Yes (frozen, for α term)
OPSD	PPO-REINFORCE with per-token weight	Frozen π_ref(·\|c,x)	Yes
RLSD	DAPO clip with evidence-reweighted advantage	Current π_θ(·\|c,x)	No

SDPG is the main contribution. It extends GRPO with an exact per-token forward KL between the actor (without privileged context) and itself conditioned on privileged context c:

$$\mathcal{L} = \underbrace{\ell^{\text{clip}}}_{\text{GRPO}} + \beta \cdot \underbrace{D_{\mathrm{KL}}(\pi_\theta(\cdot|x) | \pi_\theta(\cdot|c,x))}_{\text{self-distillation}} + \alpha \cdot \underbrace{f(\pi_\theta, \pi_{\text{ref}})}_{\text{KL reg}}$$

The KL is computed on-the-fly inside update_policy — no separate teacher log-prob pre-computation. The α-regularization supports four modes (fkl, rkl, ufkl, urkl).

Data Format

All methods that use privileged context share the same data format. The first message content encodes both the actor question and teacher context, separated by a special token:

prompt[0].content = "<actor question>[TEACHER_CONTEXT_TOKEN]<teacher context>"

At training time rl_dataset.py splits on this sentinel:

Actor receives everything before [TEACHER_CONTEXT_TOKEN] (plain question)
Teacher receives the full content including the privileged context after the token

Two datasets are provided:

File	Use
`math-dapo-noteacher-shuffled-boxed.parquet`	GRPO baseline (no teacher context)
`math-dapo-teacher-shuffled-boxed.parquet`	SDPG / OPSD / RLSD (includes teacher context)

The teacher context format is:

The correct answer to this problem is: {answer}
Use this to verify your reasoning, but show your full solution process.

Requirements

8× A100/H100/H200 GPUs (scripts default to 1 node, 8 GPUs)
verl dependencies installed
Ray cluster running locally (ray start --head --num-gpus=8 --num-cpus=104)
Model: Qwen/Qwen3-4B (or set MODEL_PATH to a local cache)
Data placed under $RAY_DATA_HOME/data/

Reproducing Qwen3-4B Experiments

All scripts are in examples/rpg2_trainer/. Set environment variables to override defaults.

GRPO Baseline

bash examples/rpg2_trainer/run_qwen3_4b_grpo_original_boxed.sh

Key settings: lr=1e-6, n=8, train_batch_size=128, gpu_memory_utilization=0.6.
Uses noteacher data. No ref model, no teacher.

SDPG

# Default: BETA=0.001, ALPHA=0.001, KL_MODE=urkl
bash examples/rpg2_trainer/run_qwen3_4b_sdpg_boxed.sh

# Custom hyperparameters
BETA=0.001 ALPHA=0.001 KL_MODE=urkl bash examples/rpg2_trainer/run_qwen3_4b_sdpg_boxed.sh

Key settings: lr=1e-6, n=8, train_batch_size=128, gpu_memory_utilization=0.75, entropy_checkpointing=True.
Uses teacher data. Spawns a frozen ref model worker for the α-regularization term.

KL mode options (KL_MODE):

Mode	α term
`fkl`	$\pi_{\text{ref}} / \pi_\theta$
`rkl`	$\tfrac{1}{2}(\log w + 1)^2$
`ufkl`	$\pi_{\text{ref}}/\pi_\theta + \log w$
`urkl`	$\tfrac{1}{2}(\log w)^2$ (default)

Beta schedule (optional):

BETA_WARMUP_STEPS=50 BETA_DECAY_STEPS=100 bash examples/rpg2_trainer/run_qwen3_4b_sdpg_boxed.sh

Distillation gating — restrict β term to positively-advantaged responses:

BETA_POSITIVE_ADV_ONLY=True bash examples/rpg2_trainer/run_qwen3_4b_sdpg_boxed.sh

Memory note: SDPG materializes (B, T, V) actor+teacher logits simultaneously during update_policy. If generate_sequences is slow (vLLM KV-cache preemptions), lower gpu_memory_utilization to 0.6.

OPSD

bash examples/rpg2_trainer/run_qwen3_4b_opsd_boxed.sh

Key settings: lr=5e-6, n=8, entropy_coeff=0.01, gpu_memory_utilization=0.6.
Uses teacher data. Teacher = frozen π_ref (initial weights, not the current actor).

RLSD

# Default: lambda=0.5, lambda_decay_steps=50, epsilon_w=0.2
bash examples/rpg2_trainer/run_qwen3_4b_rlsd_boxed.sh

RLSD_LAMBDA=0.5 RLSD_LAMBDA_DECAY_STEPS=50 bash examples/rpg2_trainer/run_qwen3_4b_rlsd_boxed.sh

Key settings: lr=1e-6, n=8, gpu_memory_utilization=0.75.
Uses teacher data. No frozen ref model. Teacher signal only reweights advantage magnitude.

Evaluation

All scripts evaluate on three benchmarks every test_freq=10 steps:

Dataset	File
AMC 2023	`amc-23-boxed.parquet`
AIME 2024	`aime-2024-boxed.parquet`
AIME 2025	`aime25-boxed.parquet`

Validation uses n=32 samples per problem at temperature=1.0.

Key Files

File	Description
`verl/trainer/ppo/core_algos.py`	All loss functions (`compute_sdpg_loss`, GRPO, OPSD, RLSD)
`verl/workers/actor/dp_actor.py`	`update_policy`: dispatches loss modes, runs on-the-fly KL for SDPG
`verl/trainer/ppo/ray_trainer.py`	Training loop: teacher log-prob computation for RLSD/OPSD
`verl/utils/dataset/rl_dataset.py`	`[TEACHER_CONTEXT_TOKEN]` splitting, `teacher_input_ids` tokenization
`verl/trainer/ppo/utils.py`	`need_reference_policy()`: spawns frozen ref worker for SDPG/OPSD
`verl/workers/config/actor.py`	`PolicyLossConfig`: β, α, `kl_mode`, beta schedule fields

Acknowledgements

volcengine/verl: verl: Volcano Engine Reinforcement Learning for LLMs for providing coding base

Citation

If you use SDPG in your research or application, please consider citing it!

@article{liu2026self,
      title={Self-Distilled Policy Gradient}, 
      author={Liu, Yifeng and Zhang, Shiyuan and Zhang, Yifan and Gu, Quanquan},
      journal={arXiv preprint arXiv:2606.04036},
      year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
data		data
docker		docker
docs		docs
examples		examples
figures		figures
scripts		scripts
tests		tests
verl		verl
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
README_verl.md		README_verl.md
index.html		index.html
paper.pdf		paper.pdf
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py
styles.css		styles.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SDPG: Self-Distilled Policy Gradient

Methods

Data Format

Requirements

Reproducing Qwen3-4B Experiments

GRPO Baseline

SDPG

OPSD

RLSD

Evaluation

Key Files

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SDPG: Self-Distilled Policy Gradient

Methods

Data Format

Requirements

Reproducing Qwen3-4B Experiments

GRPO Baseline

SDPG

OPSD

RLSD

Evaluation

Key Files

Acknowledgements

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages