Fix dataloader not reloading when resuming from checkpoint by littlebullGit · Pull Request #21514 · Lightning-AI/pytorch-lightning

littlebullGit · 2026-01-28T01:33:17Z

When resuming from a checkpoint with reload_dataloaders_every_n_epochs, the dataloader was not being reloaded at the correct epoch. This was because setup_data() was overwriting _last_train_dl_reload_epoch with the current epoch during checkpoint restoration, losing the information about when the dataloader was actually last reloaded.

The fix:

Save _last_train_dl_reload_epoch in checkpoint state
Restore _last_train_dl_reload_epoch from checkpoint on load
Only update _last_train_dl_reload_epoch when actually reloading the dataloader or during initial setup (not when resuming)

This ensures _should_reload_train_dl returns the correct value after resuming from a checkpoint.

Backward compatible: old checkpoints without this key will default to float('-inf'), which triggers a reload (the safest behavior).

Fixes #21492

📚 Documentation preview 📚: https://pytorch-lightning--21514.org.readthedocs.build/en/21514/

When resuming from a checkpoint with reload_dataloaders_every_n_epochs, the dataloader was not being reloaded at the correct epoch. This was because setup_data() was overwriting _last_train_dl_reload_epoch with the current epoch during checkpoint restoration, losing the information about when the dataloader was actually last reloaded. The fix: 1. Save _last_train_dl_reload_epoch in checkpoint state 2. Restore _last_train_dl_reload_epoch from checkpoint on load 3. Only update _last_train_dl_reload_epoch when actually reloading the dataloader or during initial setup (not when resuming) This ensures _should_reload_train_dl returns the correct value after resuming from a checkpoint. Backward compatible: old checkpoints without this key will default to float('-inf'), which triggers a reload (the safest behavior). Fixes Lightning-AI#21492

codecov · 2026-01-28T02:22:04Z

Codecov Report

❌ Patch coverage is 95.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 79%. Comparing base (7983ecb) to head (234f734).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (7983ecb) and HEAD (234f734). Click for more details.

HEAD has 546 uploads less than BASE

Flag BASE (7983ecb) HEAD (234f734)

python3.10 12 3

cpu 168 42

lightning 60 15

pytest 84 0

lightning_fabric 54 0

python3.12 48 12

python 12 3

python3.11 24 6

python3.13 36 9

python3.12.7 36 9

pytorch_lightning 54 27

pytorch2.7 6 3

pytest-full 84 42

pytorch2.3 6 3

pytorch2.1 12 6

pytorch2.9 12 6

pytorch2.8 12 6

pytorch2.5.1 6 3

pytorch2.2.2 6 3

pytorch2.10 12 6

pytorch2.6 6 3

pytorch2.4.1 6 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21514     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23888    23846     -42     
=========================================
- Hits        20668    18777   -1891     
- Misses       3220     5069   +1849

littlebullGit · 2026-03-11T18:06:57Z

@SkafteNicki @Borda @deependujha Can you take a look of this PR ?

deependujha · 2026-03-17T10:03:23Z

Hi @littlebullGit, thanks for the fix and for the PR.

The fix itself looks neat. I’m a bit mixed on the tests, though: they’re very thorough, but they also feel a bit verbose/heavy for the behavior we’re trying to validate.

Do you think we could simplify them without losing coverage?

My concern is this test touches too many parts and seems more like integration test, which might become less maintainable. But, if no easier way to verify seem suitable, it will be fine.

littlebullGit · 2026-03-18T03:01:38Z

Hi @littlebullGit, thanks for the fix and for the PR.

The fix itself looks neat. I’m a bit mixed on the tests, though: they’re very thorough, but they also feel a bit verbose/heavy for the behavior we’re trying to validate.

Do you think we could simplify them without losing coverage?

My concern is this test touches too many parts and seems more like integration test, which might become less maintainable. But, if no easier way to verify seem suitable, it will be fine.

@deependujha , thank you for the feedback. Good point. I agreed the original test was too integration heavy. I have simplified the test now just records the train_dataloader() call epochs and asserts the expected sequence. Please take a look.

littlebullGit requested review from ethanwharris, justusschock, lantiga and tchaton as code owners January 28, 2026 01:33

github-actions Bot added the pl Generic label for PyTorch Lightning package label Jan 28, 2026

littlebullGit force-pushed the fix/21492-dataloader-reload-checkpoint branch from 5c24d70 to 6afeb53 Compare January 28, 2026 02:04

github-actions Bot added the has conflicts label Jan 28, 2026

Merge branch 'master' into fix/21492-dataloader-reload-checkpoint

267371d

github-actions Bot removed the has conflicts label Jan 30, 2026

Merge branch 'master' into fix/21492-dataloader-reload-checkpoint

c656110

deependujha reviewed Mar 17, 2026

View reviewed changes

Comment thread src/lightning/pytorch/CHANGELOG.md Outdated

deependujha and others added 3 commits March 17, 2026 15:36

Apply suggestion from @deependujha

85005a4

Merge branch 'master' into fix/21492-dataloader-reload-checkpoint

5aca1c2

test: simplify checkpoint resume dataloader regression

3c61710

deependujha reviewed Mar 18, 2026

View reviewed changes

Comment thread src/lightning/pytorch/loops/fit_loop.py

littlebullGit added 2 commits March 19, 2026 11:31

fix: handle legacy checkpoint dataloader reload state

9eee2a9

fix: annotate legacy checkpoint reload state

e23de3b

deependujha reviewed Mar 20, 2026

View reviewed changes

Comment thread src/lightning/pytorch/loops/fit_loop.py Outdated

refactor: rename missing reload state helper

0ce5089

deependujha approved these changes Mar 21, 2026

View reviewed changes

Merge branch 'master' into fix/21492-dataloader-reload-checkpoint

234f734

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataloader not reloading when resuming from checkpoint#21514

Fix dataloader not reloading when resuming from checkpoint#21514
littlebullGit wants to merge 10 commits into
Lightning-AI:masterfrom
littlebullGit:fix/21492-dataloader-reload-checkpoint

littlebullGit commented Jan 28, 2026 •

edited by github-actions Bot

Loading

Uh oh!

codecov Bot commented Jan 28, 2026 •

edited

Loading

Uh oh!

littlebullGit commented Mar 11, 2026

Uh oh!

deependujha commented Mar 17, 2026

Uh oh!

Uh oh!

littlebullGit commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

littlebullGit commented Jan 28, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

littlebullGit commented Mar 11, 2026

Uh oh!

deependujha commented Mar 17, 2026

Uh oh!

Uh oh!

littlebullGit commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

littlebullGit commented Jan 28, 2026 •

edited by github-actions Bot

Loading

codecov Bot commented Jan 28, 2026 •

edited

Loading