Fix ModelCheckpoint file_exists OOM in DDP #21380

littlebullGit · 2025-11-24T05:03:31Z

What does this PR do?

Use [strategy.reduce_boolean_decision] instead of [broadcast] in [ModelCheckpoint.file_exists].
Ensure only global rank 0 touches the filesystem when checking for existing checkpoints.
Avoid broadcast_object_list for a simple boolean in DDP to reduce memory pressure in the checkpoint path.
Add a small DDP test with monitor=None to exercise this path.

Fixes #19674

Motivation and context

In DDP, [strategy.broadcast] is implemented via torch.distributed.broadcast_object_list, which serializes the Python object and can allocate unnecessary GPU memory even for a single boolean. For the “file exists” decision we only need a tiny boolean reduction, so [reduce_boolean_decision] is a better fit and addresses the CUDA OOM reported in #19674 while preserving behavior.

Dependencies

No new runtime dependencies introduced by this PR.
Tests rely on pytorch_lightning_enterprise being available, as required by [tests/tests_pytorch/conftest.py].

Tests

All run inside the project .venv:

python -m pytest tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py
python -m pytest tests/tests_pytorch/checkpointing -k "not legacy_checkpoints"
python -m pytest tests/tests_pytorch/callbacks/test_model_checkpoint_*.py tests/tests_pytorch/trainer/test_trainer.py

📚 Documentation preview 📚: https://pytorch-lightning--21380.org.readthedocs.build/en/21380/

- Use strategy.reduce_boolean_decision instead of broadcast in ModelCheckpoint.file_exists - Ensure only global rank 0 touches the filesystem - Avoid broadcast_object_list for a simple boolean in DDP - Add a small DDP test with monitor=None to exercise this path

codecov · 2025-11-24T05:21:24Z

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82%. Comparing base (8f702b3) to head (0ac7fcf).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (8f702b3) and HEAD (0ac7fcf). Click for more details.

HEAD has 1186 uploads less than BASE

Flag BASE (8f702b3) HEAD (0ac7fcf)

cpu 296 29

lightning_fabric 74 0

pytest 149 0

python3.12 90 9

python3.12.7 89 8

python3.10 30 3

lightning 149 14

python3.11 60 6

python 27 3

pytorch2.2.2 15 3

pytest-full 147 29

pytorch2.4.1 14 2

pytorch2.3 15 3

pytorch2.1 28 6

pytorch2.9 15 3

pytorch_lightning 73 15

pytorch2.7 15 3

pytorch2.5.1 15 3

pytorch2.8 15 3

pytorch2.6 15 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #21380     +/-   ##
=========================================
- Coverage      89%      82%     -7%     
=========================================
  Files         269      266      -3     
  Lines       22050    21997     -53     
=========================================
- Hits        19727    18065   -1662     
- Misses       2323     3932   +1609

justusschock

Great job @littlebullGit ,

One minor comment. Could you also please add a changelog entry?

src/lightning/pytorch/callbacks/model_checkpoint.py

littlebullGit requested review from ethanwharris, justusschock, lantiga and tchaton as code owners November 24, 2025 05:03

github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 24, 2025

justusschock reviewed Nov 24, 2025

View reviewed changes

src/lightning/pytorch/callbacks/model_checkpoint.py Outdated Show resolved Hide resolved

add test coverage

aa4cef6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ModelCheckpoint file_exists OOM in DDP #21380

Fix ModelCheckpoint file_exists OOM in DDP #21380

littlebullGit commented Nov 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

justusschock left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix ModelCheckpoint file_exists OOM in DDP #21380

Are you sure you want to change the base?

Fix ModelCheckpoint file_exists OOM in DDP #21380

Conversation

littlebullGit commented Nov 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation and context

Dependencies

Tests

Uh oh!

codecov bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

justusschock left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

littlebullGit commented Nov 24, 2025 •

edited by github-actions bot

Loading

codecov bot commented Nov 24, 2025 •

edited

Loading