experimental: Self-Distillation Zero#5609
experimental: Self-Distillation Zero#5609LeonEricsson wants to merge 40 commits intohuggingface:mainfrom
Conversation
…onfig parameters moved to sdpoconfig, + other nits
BaseSelfDistillationTrainer was populating _metrics in _log_self_distillation_metric but had no log() override, so those metrics were never forwarded to the Trainer's logging system. The fix merges _metrics into the log dict, prefixes eval keys, and clears after each logging step.
1c4a8f7 to
a110ba8
Compare
a110ba8 to
ab20ace
Compare
| self.scale_rewards = args.scale_rewards | ||
| self.epsilon_low = args.epsilon | ||
| self.epsilon_high = args.epsilon_high | ||
| self.beta = args.beta |
There was a problem hiding this comment.
Unused beta parameter stored but never applied
Medium Severity
SDPOConfig declares a beta parameter documented as "Reference-model KL coefficient for online policy optimization," and SDPOTrainer.__init__ stores it as self.beta. However, _compute_policy_loss never uses self.beta — there is no reference-model KL penalty term in the loss. A user setting beta > 0 would expect a KL regularization effect but get none, leading to silently incorrect training behavior.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit ab20ace. Configure here.
…l-self-distillation # Conflicts: # trl/experimental/sdft/sdft_trainer.py # trl/experimental/sdpo/sdpo_trainer.py # trl/experimental/self_distillation/base_self_distillation_trainer.py # trl/experimental/self_distillation/online_rollout_mixin.py # trl/experimental/self_distillation/teacher_context.py
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3f93a8b. Configure here.
| teacher_prompt_ids = [torch.tensor(ids) for ids in teacher_prompt_ids_list] | ||
| teacher_prompt_mask = [torch.ones_like(ids, dtype=torch.long) for ids in teacher_prompt_ids] | ||
| teacher_prompt_ids = pad(teacher_prompt_ids, padding_value=self.pad_token_id, padding_side="left").to( | ||
| device=device |
There was a problem hiding this comment.
Missing pad_token_id attribute causes AttributeError
High Severity
SDZeroTrainer.finalize_batch references self.pad_token_id, but this attribute is never defined on the trainer or any of its parent classes. The base class BaseSelfDistillationTrainer consistently uses self._tokenizer.pad_token_id for padding operations. This will raise an AttributeError at runtime when finalize_batch is called during training.
Reviewed by Cursor Bugbot for commit 3f93a8b. Configure here.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |


Implements Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision on top of #5573.
SD-Zero is composed of two stages:
Self-Revision Training (SRT):
We sample model responses, evaluate correctness, and prompt the model to revise incorrect outputs. Only traces where the revision succeeds are retained, and the model is fine-tuned on this filtered dataset.
Self-Distillation:
The reviser is used as a teacher to provide token-level supervision over the generator’s responses, effectively converting outcome-level (binary) rewards into dense token-level supervision.
Before submitting
AI writing disclosure
We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Note
Medium Risk
Adds new experimental trainers/scripts and significantly refactors
SDFTTrainer/SDPOTrainerconfiguration and loss/reward plumbing; mistakes could change training behavior or silently degrade optimization despite limited scope to experimental modules.Overview
Adds experimental Self-Distillation Zero (SD-Zero) support by introducing
SRTTrainer(phase-1 self-revision supervised fine-tuning),SDZeroTrainer(phase-2 on-policy self-distillation with a binary verifier), plus runnable scripts for both phases and an offline dataset collection pipeline (srt_collect.py). Documentation is updated to describe SD-Zero and expected dataset schemas.Refactors the experimental self-distillation stack:
SDFTTraineris rebuilt onBaseSelfDistillationTrainer(newfinalize_batchpath, updated masking to supportnum_loss_tokens_to_skip), and SDPO moves to an explicitdistillation_modeAPI (replacingfull_logit_distillation) while expandingSDPOConfigwith GRPO-style policy settings (beta,epsilon*, reward scaling/weights, token vs sequence importance sampling) and adding a newpolicy_onlymode.Improves coverage and examples: adds a dedicated
BaseSelfDistillationTrainertest suite, updates SDPO/SDFT docs and tests to the new config knobs and prompt-handling callbacks, and tweaks a GLM-4-MoE training chat template formatting without behavior changes.Reviewed by Cursor Bugbot for commit 3f93a8b. Bugbot is set up for automated code reviews on this repo. Configure here.