-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
Thanks for the code release!
Heads up for other users who want to resume training from a checkpoint: you will want to
- de-indent DDP_main.py:80 so that all devices can load the checkpoint
- load the optimizer and scheduler states on line DDP_main:146
- set the index of the dataloader to the correct example before actually training
I'm not totally sure this solves everything like logging, but might work ok.
Note: There's also a separate issue that your checkpoints might get overwritten between epochs, so be sure you're loading the right thing and saving where you want.
Metadata
Metadata
Assignees
Labels
No labels