-
Notifications
You must be signed in to change notification settings - Fork 295
Description
I have been comparing DDP with Fairscale Sharded DDP + OSS, and found the training progress of our model to be very different between the two setups.
After a bit of investigation, I suspected there was a race condition in the broadcasting of gradients in sharded DDP. To confirm this, I changed ShardedDataParallel._try_consume_work_handles to call _consume_work_handles instead - if I understand correctly, this should just add additional waits for all pending reduces to finish, but also be a safe change in absence of races (as it is possible that the async reduces would always be finished by the time _try_consume_work_handles is called).
This gave us three conditions to check:
- DDP
- Sharded DDP
- Sharded DDP (extra syncs)
We found that "DDP" and "Sharded DDP (extra syncs)" were exactly reproducible between runs, and the loss values produced were similar between the two conditions but not exactly identical. The normal "Sharded DDP" was not reproducible between runs, the first few steps were identical in repeat runs and then they would diverge. The loss values produced were also significantly different to both the baseline "DDP" and "Sharded DDP (extra syncs)".
This raises a few questions that I'd like to get some help with:
- Is the modification I made to add extra syncs correct? If yes, this suggests there is a race condition in at least our usage of Sharded DDP, but I don't think we're doing anything unusual.
- Is it expected that "Sharded DDP" and "DDP" produce significantly different training dynamics?