Open
Description
deepspeed 0.15.3
zero 3 is used
For "safe_get_full_grad", does it return the same gradient values on each process/rank?
As for "safe_set_full_grad", should it be called on all the processes/ranks? or just one of them is enough?
If it's the former one, users will need to ensure gradient values to be set on each process/rank are the same?
Also, which float type should be used for "safe_set_full_grad"? any way to check this?
Metadata
Metadata
Assignees
Labels
No labels
Activity