Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] update results of state_dict loading, embedding resizing to secondary partitions (hpz) #7130

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cyr0930
Copy link

@cyr0930 cyr0930 commented Mar 11, 2025

After this commit (#4906), secondary partitioned tensors are updated only after optimizer.step().

When loading state_dict or resizing embedding after init, secondary partitioned tensors should be updated.

e.g., https://github.com/huggingface/transformers/blob/1c4b62b219323a31011bac3bd3cece7675d9e4c3/src/transformers/integrations/deepspeed.py#L344

@tjruwase
Copy link
Contributor

@cyr0930, thanks for this PR. Can you provide more context about the failure this fixes? For example, did you encounter convergence issues after checkpoint loading?

@cyr0930
Copy link
Author

cyr0930 commented Mar 12, 2025

Partitioned parameters are updated only when ds_secondary_partition_tensor is None by this commit (#4906).
And ds_secondary_partition_tensors only become None after optimizer.step function is called (that function contains logic that invalidate secondary tensor)
But there's other cases that partitioned parameters should be updated such as loading state_dict or resizing embeddings.
(Above explanation is based on transformers implementation)

For now, after parameter initialization, ds_secondary_partition_tensors are created and existed for each params, so they are not updated when we perform state_dict loading or embedding resizing.
Maybe we can add a function that invalidate ds_secondary_partition and call it after every parameter changes.
But IMO it is quite messy and I decided to revert has_been_updated part of the previous commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants