Skip to content

Conversation

@littlebullGit
Copy link
Contributor

@littlebullGit littlebullGit commented Nov 24, 2025

What does this PR do?

  • Use [strategy.reduce_boolean_decision] instead of [broadcast] in [ModelCheckpoint.file_exists].
  • Ensure only global rank 0 touches the filesystem when checking for existing checkpoints.
  • Avoid broadcast_object_list for a simple boolean in DDP to reduce memory pressure in the checkpoint path.
  • Add a small DDP test with monitor=None to exercise this path.

Fixes #19674

Motivation and context

In DDP, [strategy.broadcast] is implemented via torch.distributed.broadcast_object_list, which serializes the Python object and can allocate unnecessary GPU memory even for a single boolean. For the “file exists” decision we only need a tiny boolean reduction, so [reduce_boolean_decision] is a better fit and addresses the CUDA OOM reported in #19674 while preserving behavior.

Dependencies

  • No new runtime dependencies introduced by this PR.
  • Tests rely on pytorch_lightning_enterprise being available, as required by [tests/tests_pytorch/conftest.py].

Tests

All run inside the project .venv:

  • python -m pytest tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.py
  • python -m pytest tests/tests_pytorch/checkpointing -k "not legacy_checkpoints"
  • python -m pytest tests/tests_pytorch/callbacks/test_model_checkpoint_*.py tests/tests_pytorch/trainer/test_trainer.py

📚 Documentation preview 📚: https://pytorch-lightning--21380.org.readthedocs.build/en/21380/

- Use strategy.reduce_boolean_decision instead of broadcast in ModelCheckpoint.file_exists
- Ensure only global rank 0 touches the filesystem
- Avoid broadcast_object_list for a simple boolean in DDP
- Add a small DDP test with monitor=None to exercise this path
@github-actions github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 24, 2025
@codecov
Copy link

codecov bot commented Nov 24, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 82%. Comparing base (8f702b3) to head (0ac7fcf).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (8f702b3) and HEAD (0ac7fcf). Click for more details.

HEAD has 1186 uploads less than BASE
Flag BASE (8f702b3) HEAD (0ac7fcf)
cpu 296 29
lightning_fabric 74 0
pytest 149 0
python3.12 90 9
python3.12.7 89 8
python3.10 30 3
lightning 149 14
python3.11 60 6
python 27 3
pytorch2.2.2 15 3
pytest-full 147 29
pytorch2.4.1 14 2
pytorch2.3 15 3
pytorch2.1 28 6
pytorch2.9 15 3
pytorch_lightning 73 15
pytorch2.7 15 3
pytorch2.5.1 15 3
pytorch2.8 15 3
pytorch2.6 15 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21380     +/-   ##
=========================================
- Coverage      89%      82%     -7%     
=========================================
  Files         269      266      -3     
  Lines       22050    21997     -53     
=========================================
- Hits        19727    18065   -1662     
- Misses       2323     3932   +1609     

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job @littlebullGit ,

One minor comment. Could you also please add a changelog entry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA memory increase (caused CUDA OOM) when saving checkpoint at the train_epoch_end

2 participants