Resuming training via `--load_step`

Thanks for the code release!

Heads up for other users who want to resume training from a checkpoint: you will want to

1. de-indent DDP_main.py:80 so that all devices can load the checkpoint
2. load the optimizer and scheduler states on line DDP_main:146
3. set the index of the dataloader to the correct example before actually training

I'm not totally sure this solves everything like logging, but might work ok.

Note: There's also a separate issue that your checkpoints might get overwritten between epochs, so be sure you're loading the right thing and saving where you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resuming training via `--load_step` #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Resuming training via --load_step #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Resuming training via `--load_step` #30