experimental: Self-Distillation Zero by LeonEricsson · Pull Request #5609 · huggingface/trl

LeonEricsson · 2026-04-20T20:22:28Z

Implements Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision on top of #5573.

SD-Zero is composed of two stages:

Self-Revision Training (SRT):
We sample model responses, evaluate correctness, and prompt the model to revise incorrect outputs. Only traces where the revision succeeds are retained, and the model is fine-tuned on this filtered dataset.
Self-Distillation:
The reviser is used as a teacher to provide token-level supervision over the generator’s responses, effectively converting outcome-level (binary) rewards into dense token-level supervision.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Note

Medium Risk
Adds new experimental trainers/scripts and significantly refactors SDFTTrainer/SDPOTrainer configuration and loss/reward plumbing; mistakes could change training behavior or silently degrade optimization despite limited scope to experimental modules.

Overview
Adds experimental Self-Distillation Zero (SD-Zero) support by introducing SRTTrainer (phase-1 self-revision supervised fine-tuning), SDZeroTrainer (phase-2 on-policy self-distillation with a binary verifier), plus runnable scripts for both phases and an offline dataset collection pipeline (srt_collect.py). Documentation is updated to describe SD-Zero and expected dataset schemas.

Refactors the experimental self-distillation stack: SDFTTrainer is rebuilt on BaseSelfDistillationTrainer (new finalize_batch path, updated masking to support num_loss_tokens_to_skip), and SDPO moves to an explicit distillation_mode API (replacing full_logit_distillation) while expanding SDPOConfig with GRPO-style policy settings (beta, epsilon*, reward scaling/weights, token vs sequence importance sampling) and adding a new policy_only mode.

Improves coverage and examples: adds a dedicated BaseSelfDistillationTrainer test suite, updates SDPO/SDFT docs and tests to the new config knobs and prompt-handling callbacks, and tweaks a GLM-4-MoE training chat template formatting without behavior changes.

^{Reviewed by Cursor Bugbot for commit 3f93a8b. Bugbot is set up for automated code reviews on this repo. Configure here.}

…onfig parameters moved to sdpoconfig, + other nits

BaseSelfDistillationTrainer was populating _metrics in _log_self_distillation_metric but had no log() override, so those metrics were never forwarded to the Trainer's logging system. The fix merges _metrics into the log dict, prefixes eval keys, and clears after each logging step.

cursor · 2026-04-22T19:53:28Z

+        self.scale_rewards = args.scale_rewards
+        self.epsilon_low = args.epsilon
+        self.epsilon_high = args.epsilon_high
+        self.beta = args.beta


Unused beta parameter stored but never applied

Medium Severity

SDPOConfig declares a beta parameter documented as "Reference-model KL coefficient for online policy optimization," and SDPOTrainer.__init__ stores it as self.beta. However, _compute_policy_loss never uses self.beta — there is no reference-model KL penalty term in the loss. A user setting beta > 0 would expect a KL regularization effect but get none, leading to silently incorrect training behavior.

Additional Locations (1)

trl/experimental/sdpo/sdpo_config.py#L80-L83

^{Reviewed by Cursor Bugbot for commit ab20ace. Configure here.}

…l-self-distillation # Conflicts: # trl/experimental/sdft/sdft_trainer.py # trl/experimental/sdpo/sdpo_trainer.py # trl/experimental/self_distillation/base_self_distillation_trainer.py # trl/experimental/self_distillation/online_rollout_mixin.py # trl/experimental/self_distillation/teacher_context.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 3f93a8b. Configure here.}

cursor · 2026-05-05T06:34:43Z

+        teacher_prompt_ids = [torch.tensor(ids) for ids in teacher_prompt_ids_list]
+        teacher_prompt_mask = [torch.ones_like(ids, dtype=torch.long) for ids in teacher_prompt_ids]
+        teacher_prompt_ids = pad(teacher_prompt_ids, padding_value=self.pad_token_id, padding_side="left").to(
+            device=device


Missing pad_token_id attribute causes AttributeError

High Severity

SDZeroTrainer.finalize_batch references self.pad_token_id, but this attribute is never defined on the trainer or any of its parent classes. The base class BaseSelfDistillationTrainer consistently uses self._tokenizer.pad_token_id for padding operations. This will raise an AttributeError at runtime when finalize_batch is called during training.

^{Reviewed by Cursor Bugbot for commit 3f93a8b. Configure here.}

HuggingFaceDocBuilderDev · 2026-05-05T06:35:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LeonEricsson added 22 commits April 20, 2026 22:04

v0.1 transition sdft into unified base

06f02a8

sdft transition v1 complete, starting on sdpo

be1bcbc

sdpo transitioned, needs testing

0628701

remove legacy trainers

55111ff

sdft and sdpo transitioned and tested with new base

81def8a

restructure training batch builder

bad6b62

nits

ef43c95

wip removing mixin

efe0eda

remove mixin, refactoring and cleanup

fa1a8f3

always set teacher_model

6a7d5a8

align generation tokenization with grpotrainer

56b2fd1

fix: generation_kwargs bug

4a9d527

fix: incorrect import source

196feee

fixes: cleanup, standardized tokenization, distill loss=0 fix, sdpo c…

3c87400

…onfig parameters moved to sdpoconfig, + other nits

tests: ported old tests + new tests for base class

d2a78e2

couple more tests and test cleanup

8807088

test: nit fix

0612699

move loss aggregation to loss_util + a few docstrings

3d0cd72

fix: minor cursor issues + config docstrings

a432c20

fix: rename full logit distillation+topk into explicit flags

e30ca04

fix(self-distillation): warn on preloaded peft students

3a9ecb2

LeonEricsson force-pushed the feature/sd-zero branch from 1c4a8f7 to a110ba8 Compare April 21, 2026 17:35

LeonEricsson added 7 commits April 22, 2026 07:47

docs: cleanup

03718eb

feat: srt implemented and validated, sdzero wip

d0e6657

feat: sdzero phase 2

517d4f4

docs: update paper index

f35011c

fix: default sync to

bd00d4b

fix: adapt new distillation config params

c1af84e

fix: default behavior when args=None

4003c2f

LeonEricsson added 5 commits April 22, 2026 07:54

wip: review minors

289bc8c

docs: cleanup

5391c6c

fix: srt chat template and tokenization of dataset

04e1a5d

fix: wrap teacher prompt building in tokenizer apply chat template

701e8eb

feat: add chat_kwargs and cleanup code

17fe1c2

LeonEricsson force-pushed the feature/sd-zero branch from a110ba8 to ab20ace Compare April 22, 2026 19:46

docs: add docstring

ab20ace

LeonEricsson marked this pull request as ready for review April 22, 2026 19:46

cursor Bot reviewed Apr 22, 2026

View reviewed changes

LeonEricsson mentioned this pull request May 4, 2026

refactor: self distillation trainers (sdpo/sdft/...) #5573

Open

8 tasks

LeonEricsson added 5 commits May 4, 2026 20:05

fix: distillation mode default hparams

1ac2f3c

merge peft validation, tokenizer fixes, etc from upstream/main

e4bfd50

chore(sdzero): sync experimental self-distillation branch

6e3c713

test(sdzero): remove branch sync tests

3f93a8b

cursor Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental: Self-Distillation Zero#5609

experimental: Self-Distillation Zero#5609
LeonEricsson wants to merge 40 commits intohuggingface:mainfrom
LeonEricsson:feature/sd-zero

LeonEricsson commented Apr 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot Apr 22, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 5, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LeonEricsson commented Apr 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

AI writing disclosure

Who can review?

Uh oh!

Uh oh!

cursor Bot Apr 22, 2026

Choose a reason for hiding this comment

Unused beta parameter stored but never applied

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 5, 2026

Choose a reason for hiding this comment

Missing pad_token_id attribute causes AttributeError

Uh oh!

HuggingFaceDocBuilderDev commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LeonEricsson commented Apr 20, 2026 •

edited by cursor Bot

Loading

Unused `beta` parameter stored but never applied

Missing `pad_token_id` attribute causes AttributeError