Open
Description
I have a codebase forked from torchtitan with minor changes. FSDP trains very well with minimal instability, but HSDP on the same codebase exhibits loss spikes.
Is there some reason for this you folks can think of? Note that I have implemented gradient accumulation in my fork, though without changing any sharding behavior (just to accumulate the gradients on a larger batchsize)
Metadata
Assignees
Labels
No labels