docs: expand DDP metric synchronization guidance#21685
Open
c-pozzi wants to merge 2 commits intoLightning-AI:masterfrom
Open
docs: expand DDP metric synchronization guidance#21685c-pozzi wants to merge 2 commits intoLightning-AI:masterfrom
c-pozzi wants to merge 2 commits intoLightning-AI:masterfrom
Conversation
Restructure the "Synchronize validation and test logging" section in accelerator_prepare.rst into a problem-framing intro plus three subsections (sync_dist, TorchMetrics, manual all_gather), a decision table, and a common-pitfalls list. Directly addresses the custom-metric case: accumulate per-step outputs, call all_gather at epoch end, and compute the metric. The "my compute runs N times" confusion is called out and resolved — after all_gather every rank holds the same data, so the redundant compute is cheap and correct; only self.log needs the rank_zero_only guard. Refs Lightning-AI#20117
Contributor
Author
|
@deependujha thanks for hte green light on this. It ended up being a more opinionated rewrite that I originally thought. I restructured the section into subsections per tool in stead of augmenting the flat version, because the three cases seem to answer different questions and readers can land to the wrong one. |
deependujha
approved these changes
Apr 29, 2026
Collaborator
deependujha
left a comment
There was a problem hiding this comment.
great work. Thanks for the help :)
Collaborator
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Expands the "Synchronize validation and test logging" section in
docs/source-pytorch/accelerators/accelerator_prepare.rstto give clearer DDP metric-sync guidance, with focus on the custom-metric case raised in #20117.The existing section prescribes
sync_dist=Trueand shows a terseall_gatherexample, but doesn't explain:1/world_sizeview).sync_dist=Trueis silently wrong for non-averageable metrics (F1, AUC, precision/recall on imbalanced classes).[world_size, *tensor_shape]shape returned byself.all_gather().all_gatherevery rank holds the same data, so the redundant work is cheap and correct.is_global_zeromust wrapself.log, notall_gatheritself (collective → hang).What changed
Restructured the section into:
sync_dist=True, TorchMetrics, manualall_gather. Theall_gatherexample mirrors the pattern in Have an example of showing explicitly how to calculate metrics in DDP for lightning 2.2.0 #20117: accumulate per-step outputs, gather inon_validation_epoch_end, compute the metric, log on rank 0.Docs-only — no behavior change.
Fixes #20117
Before submitting
📚 Documentation preview 📚: https://pytorch-lightning--21685.org.readthedocs.build/en/21685/