fix: invalidate ZeRO-3 param coordinator trace in add_hooks #4693

roycho96 · 2025-12-15T04:23:29Z

What does this PR do?

This PR fixes a IndexError: pop from an empty deque error that occurs when using GKDTrainer with DeepSpeed ZeRO-3.

Problem

When using ZeRO-3, training fails on the second step with:

File "deepspeed/runtime/zero/partitioned_param_coordinator.py", line 217, in record_parameters
    step_id = self.__step_id_module_fetched_for[sub_module.ds_id].popleft()
IndexError: pop from an empty deque

Root Cause

DeepSpeed registers fwd_pre_hook separately from forward_hooks list. The current remove_hooks() doesn't remove fwd_pre_hook, so during generation, reset_step() is still called on every forward pass, corrupting the PartitionedParameterCoordinator's internal trace state.

When add_hooks() restores hooks without resetting this corrupted state, the subsequent training forward fails.

Solution

Call coordinator._invalidate_trace() in add_hooks() before re-registering hooks to reset the coordinator to a clean state.

Testing

Tested manually with GKDTrainer + DeepSpeed ZeRO-3
Config: lmbda=0.7, seq_kd=True
Model: Llama 3.2 1B (student model), Llama 3.2 3B (teacher model)
DeepSpeed version: 0.17.6

lmbda	seq_kd	Before fix	After fix
0.0	True	✅ Pass	✅ Pass
0.7	True	❌ Fail	✅ Pass

This fix also benefits other trainers using unwrap_model_for_generation with ZeRO-3

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

kashif · 2025-12-15T09:46:47Z

thanks @roycho96 checking!

trl/models/utils.py

qgallouedec · 2025-12-15T16:28:20Z

trl/models/utils.py

+    if hasattr(optimizer_offload, "param_coordinator"):
+        coordinator = optimizer_offload.param_coordinator
+        # Only invalidate if trace is not already invalid
+        is_invalid = getattr(coordinator, "is_invalid_trace", lambda: False)()
+        if not is_invalid:
+            if hasattr(coordinator, "_invalidate_trace"):
+                coordinator._invalidate_trace()


each time you use a getattr/hasattr, can you please explicitly and clearly mention via a comment in which case the object doesn't have the attribute? Eg:

if not hasattr(model, "optimizer"): # before the first training step, the model has no optimizer

Thanks for the review! Updated the code.
I removed unnecessary hasattr/getattr checks for is_invalid_trace() and _invalidate_trace() since these methods always exist in PartitionedParameterCoordinator.
Also, I added a comment that param_coordinator only exists in ZeRO stage 3.

fix: invalidate ZeRO-3 trace after generation in add_hooks

e1a1793

kashif self-assigned this Dec 15, 2025

kashif reviewed Dec 15, 2025

View reviewed changes

trl/models/utils.py Outdated Show resolved Hide resolved

Apply suggestion from @kashif

03f3dd0

kashif approved these changes Dec 15, 2025

View reviewed changes

qgallouedec reviewed Dec 15, 2025

View reviewed changes

refactor: simplify hasattr checks and add clarifying comments

dbe0d3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: invalidate ZeRO-3 param coordinator trace in add_hooks #4693

fix: invalidate ZeRO-3 param coordinator trace in add_hooks #4693

Uh oh!

roycho96 commented Dec 15, 2025

Uh oh!

kashif commented Dec 15, 2025

Uh oh!

Uh oh!

qgallouedec Dec 15, 2025 •

edited

Loading

Uh oh!

roycho96 Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: invalidate ZeRO-3 param coordinator trace in add_hooks #4693

Are you sure you want to change the base?

fix: invalidate ZeRO-3 param coordinator trace in add_hooks #4693

Uh oh!

Conversation

roycho96 commented Dec 15, 2025

What does this PR do?

Problem

Root Cause

Solution

Testing

Before submitting

Who can review?

Uh oh!

kashif commented Dec 15, 2025

Uh oh!

Uh oh!

qgallouedec Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roycho96 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qgallouedec Dec 15, 2025 •

edited

Loading