Fix ModelCheckpoint file_exists OOM in DDP #21380
Open
+65
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
broadcast_object_listfor a simple boolean in DDP to reduce memory pressure in the checkpoint path.monitor=Noneto exercise this path.Fixes #19674
Motivation and context
In DDP, [strategy.broadcast] is implemented via
torch.distributed.broadcast_object_list, which serializes the Python object and can allocate unnecessary GPU memory even for a single boolean. For the “file exists” decision we only need a tiny boolean reduction, so [reduce_boolean_decision] is a better fit and addresses the CUDA OOM reported in #19674 while preserving behavior.Dependencies
pytorch_lightning_enterprisebeing available, as required by [tests/tests_pytorch/conftest.py].Tests
All run inside the project
.venv:python -m pytest tests/tests_pytorch/checkpointing/test_checkpoint_callback_frequency.pypython -m pytest tests/tests_pytorch/checkpointing -k "not legacy_checkpoints"python -m pytest tests/tests_pytorch/callbacks/test_model_checkpoint_*.py tests/tests_pytorch/trainer/test_trainer.py📚 Documentation preview 📚: https://pytorch-lightning--21380.org.readthedocs.build/en/21380/