[question] Different training between DDP & Sharded DDP

I have been comparing DDP with Fairscale Sharded DDP + OSS, and found the training progress of our model to be very different between the two setups.

After a bit of investigation, I suspected there was a race condition in the broadcasting of gradients in sharded DDP. To confirm this, I changed `ShardedDataParallel._try_consume_work_handles` to call `_consume_work_handles` instead - if I understand correctly, this should just add additional waits for all pending reduces to finish, but also be a safe change in absence of races (as it is possible that the async reduces would always be finished by the time `_try_consume_work_handles` is called).

This gave us three conditions to check:

1. DDP
2. Sharded DDP
3. Sharded DDP (extra syncs)

We found that "DDP" and "Sharded DDP (extra syncs)" were exactly reproducible between runs, and the loss values produced were similar between the two conditions but not exactly identical. The normal "Sharded DDP" was not reproducible between runs, the first few steps were identical in repeat runs and then they would diverge. The loss values produced were also significantly different to both the baseline "DDP" and "Sharded DDP (extra syncs)".

This raises a few questions that I'd like to get some help with:

1. Is the modification I made to add extra syncs correct? If yes, this suggests there is a race condition in at least our usage of Sharded DDP, but I don't think we're doing anything unusual.
2. Is it expected that "Sharded DDP" and "DDP" produce significantly different training dynamics?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question] Different training between DDP & Sharded DDP #1172

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[question] Different training between DDP & Sharded DDP #1172

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions