Skip to content

Fix dataloader not reloading when resuming from checkpoint#21514

Open
littlebullGit wants to merge 10 commits into
Lightning-AI:masterfrom
littlebullGit:fix/21492-dataloader-reload-checkpoint
Open

Fix dataloader not reloading when resuming from checkpoint#21514
littlebullGit wants to merge 10 commits into
Lightning-AI:masterfrom
littlebullGit:fix/21492-dataloader-reload-checkpoint

Conversation

@littlebullGit
Copy link
Copy Markdown
Contributor

@littlebullGit littlebullGit commented Jan 28, 2026

When resuming from a checkpoint with reload_dataloaders_every_n_epochs, the dataloader was not being reloaded at the correct epoch. This was because setup_data() was overwriting _last_train_dl_reload_epoch with the current epoch during checkpoint restoration, losing the information about when the dataloader was actually last reloaded.

The fix:

  1. Save _last_train_dl_reload_epoch in checkpoint state
  2. Restore _last_train_dl_reload_epoch from checkpoint on load
  3. Only update _last_train_dl_reload_epoch when actually reloading the dataloader or during initial setup (not when resuming)

This ensures _should_reload_train_dl returns the correct value after resuming from a checkpoint.

Backward compatible: old checkpoints without this key will default to float('-inf'), which triggers a reload (the safest behavior).

Fixes #21492


📚 Documentation preview 📚: https://pytorch-lightning--21514.org.readthedocs.build/en/21514/

@github-actions github-actions Bot added the pl Generic label for PyTorch Lightning package label Jan 28, 2026
When resuming from a checkpoint with reload_dataloaders_every_n_epochs,
the dataloader was not being reloaded at the correct epoch. This was
because setup_data() was overwriting _last_train_dl_reload_epoch with
the current epoch during checkpoint restoration, losing the information
about when the dataloader was actually last reloaded.

The fix:
1. Save _last_train_dl_reload_epoch in checkpoint state
2. Restore _last_train_dl_reload_epoch from checkpoint on load
3. Only update _last_train_dl_reload_epoch when actually reloading
   the dataloader or during initial setup (not when resuming)

This ensures _should_reload_train_dl returns the correct value after
resuming from a checkpoint.

Backward compatible: old checkpoints without this key will default to
float('-inf'), which triggers a reload (the safest behavior).

Fixes Lightning-AI#21492
@littlebullGit littlebullGit force-pushed the fix/21492-dataloader-reload-checkpoint branch from 5c24d70 to 6afeb53 Compare January 28, 2026 02:04
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (7983ecb) to head (234f734).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (7983ecb) and HEAD (234f734). Click for more details.

HEAD has 546 uploads less than BASE
Flag BASE (7983ecb) HEAD (234f734)
python3.10 12 3
cpu 168 42
lightning 60 15
pytest 84 0
lightning_fabric 54 0
python3.12 48 12
python 12 3
python3.11 24 6
python3.13 36 9
python3.12.7 36 9
pytorch_lightning 54 27
pytorch2.7 6 3
pytest-full 84 42
pytorch2.3 6 3
pytorch2.1 12 6
pytorch2.9 12 6
pytorch2.8 12 6
pytorch2.5.1 6 3
pytorch2.2.2 6 3
pytorch2.10 12 6
pytorch2.6 6 3
pytorch2.4.1 6 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21514     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23888    23846     -42     
=========================================
- Hits        20668    18777   -1891     
- Misses       3220     5069   +1849     

@littlebullGit
Copy link
Copy Markdown
Contributor Author

@SkafteNicki @Borda @deependujha Can you take a look of this PR ?

@deependujha
Copy link
Copy Markdown
Collaborator

Hi @littlebullGit, thanks for the fix and for the PR.

The fix itself looks neat. I’m a bit mixed on the tests, though: they’re very thorough, but they also feel a bit verbose/heavy for the behavior we’re trying to validate.

Do you think we could simplify them without losing coverage?

My concern is this test touches too many parts and seems more like integration test, which might become less maintainable. But, if no easier way to verify seem suitable, it will be fine.

Comment thread src/lightning/pytorch/CHANGELOG.md Outdated
@littlebullGit
Copy link
Copy Markdown
Contributor Author

Hi @littlebullGit, thanks for the fix and for the PR.

The fix itself looks neat. I’m a bit mixed on the tests, though: they’re very thorough, but they also feel a bit verbose/heavy for the behavior we’re trying to validate.

Do you think we could simplify them without losing coverage?

My concern is this test touches too many parts and seems more like integration test, which might become less maintainable. But, if no easier way to verify seem suitable, it will be fine.

@deependujha , thank you for the feedback. Good point. I agreed the original test was too integration heavy. I have simplified the test now just records the train_dataloader() call epochs and asserts the expected sequence. Please take a look.

Comment thread src/lightning/pytorch/loops/fit_loop.py
Comment thread src/lightning/pytorch/loops/fit_loop.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataloader reload bug when loading from checkpoint

2 participants