Open
Description
We currently always heal on step 0 to avoid synchronization issues. We want an option to support skipping this sync for users who set the PyTorch seed so all ranks are initialized with the same values.
This should match the name init_sync
from pytorch/pytorch#142824
Bonus would be to randomly initialize a value in Manager so we can detect whether or not ranks are seeded and throw an error if there's a mismatch on first quorum.
Relevant code:
- Manager https://github.com/pytorch/torchft/blob/main/torchft/manager.py
max_step == 0 && primary.replica_id != p.replica_id
Lines 403 to 410 in d427bef