[GKD] Buffer Implementation for Distillation Trainer by cmpatino · Pull Request #5137 · huggingface/trl

cmpatino · 2026-02-20T22:27:41Z

Implement Buffer for Distillation Trainer (`GOLDTrainer`)

Implement generation buffering and multi-generation support for GOLDTrainer

Add a prompt-level generation buffer that decouples generation from the
optimization steps. We adopt a buffer similar to GRPO to generate all rollouts for all mini-batches within an optimization step, leveraging parallel inference engines. This means each worker handles a buffer of per_device_train_batch_size * gradient_accumulation_steps.

Buffer Details

We allow multiple rollouts per prompt, following Thinking Machine’s Tinker example. The number of rollouts per prompt is determined by the num_generations parameter. To keep the effective batch size constant, we introduce the generation_batch_size parameter, which controls how many unique prompts we pass to the inference engine. We enforce generation_batch_size = per_device_train_batch_size * gradient_accumulation_steps // num_generations to ensure the effective batch size is invariant across setups.

Benchmarks

We can replicate Thinking Machine’s results using both non-Liger and Liger losses, achieving a 3x speedup on a setup with 8 training nodes in colocate mode.

Phase	Tinker (s)	TRL (s)
Sampling	329.83	130
Loss	37.96	-
Training	98.69	38
Total	492.28	173

Before submitting

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

Medium Risk
Touches core training-loop mechanics (dataloader sampling, buffering, and generation integration), which can affect correctness and throughput across distributed/grad-accumulation setups despite added validation and tests.

Overview
Implements prompt-level rollout buffering in GOLDTrainer so on-policy generations are produced once per optimizer window and reused across gradient_accumulation_steps, including a custom get_train_dataloader() + RepeatSampler strategy and new buffer management (_fill_buffer, slice selection, and logging).

Adds multi-generation support via new GOLDConfig knobs (num_generations, generation_batch_size) with strict validation, updates vLLM generation paths to handle n>1 (with prompt deduplication), and refactors completion processing to rebuild sequences/labels consistently (shared _build_sequence_batch).

Also tightens model/teacher revision handling (teacher_model_revision, revised init kwargs merging), adjusts Liger fused loss invocation (explicit hard/soft weights + ZeRO-3 gather context), updates docs/examples, and adds a targeted unit test ensuring prompt retokenization uses left padding when stitching vLLM completions.

^{Written by Cursor Bugbot for commit 1cbdc32. This will update automatically on new commits. Configure here.}

Avoid crashing when using DeepSpeed ZeRO-3 and set up the correct values for `weight_hard_loss` and `weight_soft_loss`

KD Buffer Simplification

Add scripts to run GOLD

Copilot

Pull request overview

Implements prompt-level rollout buffering and multi-generation support for GOLDTrainer, decoupling generation from optimization to improve throughput (similar to GRPO-style buffering).

Changes:

Add buffered generation across gradient-accumulation windows, including multi-generation per prompt and vLLM dedup/remapping logic.
Introduce new config knobs (num_generations, generation_batch_size) with validation and updated revision handling (student_model_revision vs model_revision).
Update docs and the example training script to reflect the new configuration behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`trl/experimental/gold/gold_trainer.py`	Adds buffered dataloader strategy + vLLM multi-generation processing and related training-step changes.
`trl/experimental/gold/gold_config.py`	Adds `num_generations` / `generation_batch_size` and validates optimizer-window batch partitioning.
`trl/experimental/gold/gold.py`	Aligns model revision handling and teacher init kwargs; updates example wiring.
`docs/source/gold_trainer.md`	Documents new buffering knobs, revision behavior, and last-batch drop warning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

trl/experimental/gold/gold_trainer.py

trl/experimental/gold/gold.py

trl/experimental/gold/gold_trainer.py

trl/experimental/gold/gold.py

trl/experimental/gold/gold_trainer.py

lewtun · 2026-03-04T11:12:55Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e9cb5eb56

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

trl/experimental/gold/gold_trainer.py

HuggingFaceDocBuilderDev · 2026-03-12T13:38:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/experimental/gold/gold.py

trl/experimental/gold/gold_trainer.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

trl/experimental/gold/gold_trainer.py

qgallouedec · 2026-03-14T02:37:07Z

trl/experimental/gold/gold_trainer.py

+                    gen_prompts = all_prompts_text
+                    gen_n = 1
+
+                completion_ids = self.vllm_client.generate(


ideally we would like to use trl.generation.VLLMGeneration, but it can be done in a future PR if it's too hard

qgallouedec · 2026-03-14T02:42:48Z

I haven't reviewed it in detail; I have a general idea of what it's about, but I'm leaving the implementation mostly up to you. In future PRs, we can try to align it better with the rest of the codebase, but what matters most right now are the results you're getting.

Make sure to run make precommit to make to CI happy

qgallouedec · 2026-03-14T02:47:21Z

docs/source/gold_trainer.md

  sampling ratio.
+* `num_generations`, `generation_batch_size` – control buffered rollout generation across gradient accumulation windows.
+  `generation_batch_size` is the number of unique prompts per worker per optimizer step.
+* `student_model_revision` and `model_revision` – if `student_model_revision` is unset, GOLD uses `model_revision`.


student_model_revision is removed in this pr no?

qgallouedec · 2026-03-14T02:48:30Z

trl/experimental/gold/gold_config.py

@@ -365,12 +382,6 @@ class GOLDConfig(SFTConfig):
    num_completions_to_print: int = field(default=5, metadata={"help": "Number of completions to print."})


Suggested change

num_completions_to_print: int = field(default=5, metadata={"help": "Number of completions to print."})

duplicated

qgallouedec

just ensure the CI is green before merging

cmpatino added 25 commits February 18, 2026 12:30

Implement buffer for GOLDTrainer

719c644

Clean up code from KD buffer

904378b

Test scripts for trial run

6a2ece5

Apply fixes to the Liger loss setting

ee07aec

Avoid crashing when using DeepSpeed ZeRO-3 and set up the correct values for `weight_hard_loss` and `weight_soft_loss`

Remove test scripts

a3fd2af

Handle config parameters better in gold script

b0669d9

Upload provisional SLURM script for GOLD

b0c4f3e

Refine logic and comments

602e564

Improve clarity of buffer implementation

c4f9a64

Add validation for num_generations

111b85e

Add clarifying comment to num_generations

022af62

Patch issue with ZeRO-3

33e0a82

Refactor context for ZeRO-3 + Liger

dbb6e70

Simplify comments and code logic

9da54b3

Merge pull request #1 from cmpatino/kd-buffer-fix

1cec9ea

KD Buffer Simplification

Add scripts to run GOLD

4435409

Merge pull request #2 from cmpatino/kd-buffer-fix

ce41aba

Add scripts to run GOLD

Merge branch 'kd-buffering' of github.com:cmpatino/trl into kd-buffering

c0a857f

Merge branch 'main' into kd-buffering

fa62472

Refactor to simplify logic

31161a0

Handle student versioning params

da7ef50

Add warning when dropping incomplete batches

e24e681

Add clarifying note in docs

8d31b7a

Remove SLURM script used for testing

1ef205b

Remove reference to wandb

506afc1

cmpatino requested review from Copilot, edbeeching, kashif and lewtun March 3, 2026 21:24

Copilot started reviewing on behalf of cmpatino March 3, 2026 21:24 View session

cmpatino requested a review from qgallouedec March 3, 2026 21:24

cmpatino marked this pull request as ready for review March 3, 2026 21:24

Copilot AI reviewed Mar 3, 2026

View reviewed changes

edbeeching reviewed Mar 4, 2026

View reviewed changes

trl/experimental/gold/gold.py Outdated Show resolved Hide resolved

edbeeching reviewed Mar 4, 2026

View reviewed changes

trl/experimental/gold/gold_trainer.py Outdated Show resolved Hide resolved

Merge branch 'main' into kd-buffering

7e9cb5e

chatgpt-codex-connector bot reviewed Mar 4, 2026

View reviewed changes

trl/experimental/gold/gold_trainer.py Outdated Show resolved Hide resolved

cmpatino added 2 commits March 5, 2026 18:57

Remove _RepeatEachBatchDataLoader to simplify codebase

98ec20c

Merge branch 'kd-buffering' of github.com:cmpatino/trl into kd-buffering

f89e77f

cursor bot reviewed Mar 5, 2026

View reviewed changes

trl/experimental/gold/gold_trainer.py Outdated Show resolved Hide resolved

trl/experimental/gold/gold_trainer.py Show resolved Hide resolved

Remove support for student_model_revision arg

da57e47

Fix prompt length calculation

c3a8d73

cursor bot reviewed Mar 12, 2026

View reviewed changes

trl/experimental/gold/gold.py Show resolved Hide resolved

cmpatino added 2 commits March 12, 2026 14:52

Fix logic of padding tokens and prompt lengths

d185716

Add teacher_model_revision arg

30a0fd5

cursor bot reviewed Mar 12, 2026

View reviewed changes

trl/experimental/gold/gold_trainer.py Show resolved Hide resolved

cmpatino added 2 commits March 12, 2026 16:12

Avoid creating padding gaps

58b9f74

Merge branch 'main' into kd-buffering

c8cd1f0

cursor bot reviewed Mar 12, 2026

View reviewed changes

trl/experimental/gold/gold_trainer.py Outdated Show resolved Hide resolved

cmpatino added 2 commits March 12, 2026 17:39

Fix prompt completion calculation for transformers

ea72770

Merge branch 'kd-buffering' of github.com:cmpatino/trl into kd-buffering

1cbdc32

cursor bot reviewed Mar 12, 2026

View reviewed changes

trl/experimental/gold/gold_trainer.py Show resolved Hide resolved

trl/experimental/gold/gold_trainer.py Show resolved Hide resolved

qgallouedec reviewed Mar 14, 2026

View reviewed changes

qgallouedec approved these changes Mar 14, 2026

View reviewed changes

		@@ -365,12 +382,6 @@ class GOLDConfig(SFTConfig):
		num_completions_to_print: int = field(default=5, metadata={"help": "Number of completions to print."})

Conversation

cmpatino commented Feb 20, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implement Buffer for Distillation Trainer (GOLDTrainer)

Buffer Details

Benchmarks

Before submitting

Who can review?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lewtun commented Mar 4, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qgallouedec Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec commented Mar 14, 2026

Uh oh!

qgallouedec Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cmpatino commented Feb 20, 2026 •

edited by cursor bot

Loading

Implement Buffer for Distillation Trainer (`GOLDTrainer`)