Skip to content

experimental: Self-Distillation Zero#5609

Open
LeonEricsson wants to merge 40 commits intohuggingface:mainfrom
LeonEricsson:feature/sd-zero
Open

experimental: Self-Distillation Zero#5609
LeonEricsson wants to merge 40 commits intohuggingface:mainfrom
LeonEricsson:feature/sd-zero

Conversation

@LeonEricsson
Copy link
Copy Markdown
Collaborator

@LeonEricsson LeonEricsson commented Apr 20, 2026

Implements Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision on top of #5573.

SD-Zero is composed of two stages:

  1. Self-Revision Training (SRT):
    We sample model responses, evaluate correctness, and prompt the model to revise incorrect outputs. Only traces where the revision succeeds are retained, and the model is fine-tuned on this filtered dataset.

  2. Self-Distillation:
    The reviser is used as a teacher to provide token-level supervision over the generator’s responses, effectively converting outcome-level (binary) rewards into dense token-level supervision.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

AI writing disclosure

We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.

  • No AI usage: the PR was written entirely by a human.
  • AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
  • AI-generated: the PR was mostly or fully generated by an AI tool.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.


Note

Medium Risk
Adds new experimental trainers/scripts and significantly refactors SDFTTrainer/SDPOTrainer configuration and loss/reward plumbing; mistakes could change training behavior or silently degrade optimization despite limited scope to experimental modules.

Overview
Adds experimental Self-Distillation Zero (SD-Zero) support by introducing SRTTrainer (phase-1 self-revision supervised fine-tuning), SDZeroTrainer (phase-2 on-policy self-distillation with a binary verifier), plus runnable scripts for both phases and an offline dataset collection pipeline (srt_collect.py). Documentation is updated to describe SD-Zero and expected dataset schemas.

Refactors the experimental self-distillation stack: SDFTTrainer is rebuilt on BaseSelfDistillationTrainer (new finalize_batch path, updated masking to support num_loss_tokens_to_skip), and SDPO moves to an explicit distillation_mode API (replacing full_logit_distillation) while expanding SDPOConfig with GRPO-style policy settings (beta, epsilon*, reward scaling/weights, token vs sequence importance sampling) and adding a new policy_only mode.

Improves coverage and examples: adds a dedicated BaseSelfDistillationTrainer test suite, updates SDPO/SDFT docs and tests to the new config knobs and prompt-handling callbacks, and tweaks a GLM-4-MoE training chat template formatting without behavior changes.

Reviewed by Cursor Bugbot for commit 3f93a8b. Bugbot is set up for automated code reviews on this repo. Configure here.

@LeonEricsson LeonEricsson marked this pull request as ready for review April 22, 2026 19:46
Comment thread trl/experimental/sdft/sdft_trainer.py
self.scale_rewards = args.scale_rewards
self.epsilon_low = args.epsilon
self.epsilon_high = args.epsilon_high
self.beta = args.beta
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused beta parameter stored but never applied

Medium Severity

SDPOConfig declares a beta parameter documented as "Reference-model KL coefficient for online policy optimization," and SDPOTrainer.__init__ stores it as self.beta. However, _compute_policy_loss never uses self.beta — there is no reference-model KL penalty term in the loss. A user setting beta > 0 would expect a KL regularization effect but get none, leading to silently incorrect training behavior.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ab20ace. Configure here.

…l-self-distillation

# Conflicts:
#	trl/experimental/sdft/sdft_trainer.py
#	trl/experimental/sdpo/sdpo_trainer.py
#	trl/experimental/self_distillation/base_self_distillation_trainer.py
#	trl/experimental/self_distillation/online_rollout_mixin.py
#	trl/experimental/self_distillation/teacher_context.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3f93a8b. Configure here.

teacher_prompt_ids = [torch.tensor(ids) for ids in teacher_prompt_ids_list]
teacher_prompt_mask = [torch.ones_like(ids, dtype=torch.long) for ids in teacher_prompt_ids]
teacher_prompt_ids = pad(teacher_prompt_ids, padding_value=self.pad_token_id, padding_side="left").to(
device=device
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing pad_token_id attribute causes AttributeError

High Severity

SDZeroTrainer.finalize_batch references self.pad_token_id, but this attribute is never defined on the trainer or any of its parent classes. The base class BaseSelfDistillationTrainer consistently uses self._tokenizer.pad_token_id for padding operations. This will raise an AttributeError at runtime when finalize_batch is called during training.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3f93a8b. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants