Fix dataloader not reloading when resuming from checkpoint#21514
Fix dataloader not reloading when resuming from checkpoint#21514littlebullGit wants to merge 10 commits into
Conversation
When resuming from a checkpoint with reload_dataloaders_every_n_epochs,
the dataloader was not being reloaded at the correct epoch. This was
because setup_data() was overwriting _last_train_dl_reload_epoch with
the current epoch during checkpoint restoration, losing the information
about when the dataloader was actually last reloaded.
The fix:
1. Save _last_train_dl_reload_epoch in checkpoint state
2. Restore _last_train_dl_reload_epoch from checkpoint on load
3. Only update _last_train_dl_reload_epoch when actually reloading
the dataloader or during initial setup (not when resuming)
This ensures _should_reload_train_dl returns the correct value after
resuming from a checkpoint.
Backward compatible: old checkpoints without this key will default to
float('-inf'), which triggers a reload (the safest behavior).
Fixes Lightning-AI#21492
5c24d70 to
6afeb53
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21514 +/- ##
=========================================
- Coverage 87% 79% -8%
=========================================
Files 270 267 -3
Lines 23888 23846 -42
=========================================
- Hits 20668 18777 -1891
- Misses 3220 5069 +1849 |
|
@SkafteNicki @Borda @deependujha Can you take a look of this PR ? |
|
Hi @littlebullGit, thanks for the fix and for the PR. The fix itself looks neat. I’m a bit mixed on the tests, though: they’re very thorough, but they also feel a bit verbose/heavy for the behavior we’re trying to validate. Do you think we could simplify them without losing coverage? My concern is this test touches too many parts and seems more like |
@deependujha , thank you for the feedback. Good point. I agreed the original test was too integration heavy. I have simplified the test now just records the |
When resuming from a checkpoint with reload_dataloaders_every_n_epochs, the dataloader was not being reloaded at the correct epoch. This was because setup_data() was overwriting _last_train_dl_reload_epoch with the current epoch during checkpoint restoration, losing the information about when the dataloader was actually last reloaded.
The fix:
This ensures _should_reload_train_dl returns the correct value after resuming from a checkpoint.
Backward compatible: old checkpoints without this key will default to float('-inf'), which triggers a reload (the safest behavior).
Fixes #21492
📚 Documentation preview 📚: https://pytorch-lightning--21514.org.readthedocs.build/en/21514/