Skip to content

Conversation

@roycho96
Copy link

What does this PR do?

This PR fixes a IndexError: pop from an empty deque error that occurs when using GKDTrainer with DeepSpeed ZeRO-3.

Problem

When using ZeRO-3, training fails on the second step with:

File "deepspeed/runtime/zero/partitioned_param_coordinator.py", line 217, in record_parameters
    step_id = self.__step_id_module_fetched_for[sub_module.ds_id].popleft()
IndexError: pop from an empty deque

Root Cause

DeepSpeed registers fwd_pre_hook separately from forward_hooks list. The current remove_hooks() doesn't remove fwd_pre_hook, so during generation, reset_step() is still called on every forward pass, corrupting the PartitionedParameterCoordinator's internal trace state.

When add_hooks() restores hooks without resetting this corrupted state, the subsequent training forward fails.

Solution

Call coordinator._invalidate_trace() in add_hooks() before re-registering hooks to reset the coordinator to a clean state.

Testing

  • Tested manually with GKDTrainer + DeepSpeed ZeRO-3
  • Config: lmbda=0.7, seq_kd=True
  • Model: Llama 3.2 1B (student model), Llama 3.2 3B (teacher model)
  • DeepSpeed version: 0.17.6
lmbda seq_kd Before fix After fix
0.0 True ✅ Pass ✅ Pass
0.7 True ❌ Fail ✅ Pass

This fix also benefits other trainers using unwrap_model_for_generation with ZeRO-3

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@kashif
Copy link
Collaborator

kashif commented Dec 15, 2025

thanks @roycho96 checking!

@kashif kashif self-assigned this Dec 15, 2025
Comment on lines 89 to 95
if hasattr(optimizer_offload, "param_coordinator"):
coordinator = optimizer_offload.param_coordinator
# Only invalidate if trace is not already invalid
is_invalid = getattr(coordinator, "is_invalid_trace", lambda: False)()
if not is_invalid:
if hasattr(coordinator, "_invalidate_trace"):
coordinator._invalidate_trace()
Copy link
Member

@qgallouedec qgallouedec Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each time you use a getattr/hasattr, can you please explicitly and clearly mention via a comment in which case the object doesn't have the attribute? Eg:

if not hasattr(model, "optimizer"):  # before the first training step, the model has no optimizer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Updated the code.
I removed unnecessary hasattr/getattr checks for is_invalid_trace() and _invalidate_trace() since these methods always exist in PartitionedParameterCoordinator.
Also, I added a comment that param_coordinator only exists in ZeRO stage 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants