docs: expand DDP metric synchronization guidance by c-pozzi · Pull Request #21685 · Lightning-AI/pytorch-lightning

c-pozzi · 2026-04-24T09:05:24Z

What does this PR do?

Expands the "Synchronize validation and test logging" section in docs/source-pytorch/accelerators/accelerator_prepare.rst to give clearer DDP metric-sync guidance, with focus on the custom-metric case raised in #20117.

The existing section prescribes sync_dist=True and shows a terse all_gather example, but doesn't explain:

Why the sync is needed (failure mode: checkpoint selection driven by rank 0's 1/world_size view).
That sync_dist=True is silently wrong for non-averageable metrics (F1, AUC, precision/recall on imbalanced classes).
The [world_size, *tensor_shape] shape returned by self.all_gather().
The reporter's "my compute runs N times" confusion — after all_gather every rank holds the same data, so the redundant work is cheap and correct.
That is_global_zero must wrap self.log, not all_gather itself (collective → hang).

What changed

Restructured the section into:

Problem framing (what goes wrong without sync).
Three subsections, one per tool: sync_dist=True, TorchMetrics, manual all_gather. The all_gather example mirrors the pattern in Have an example of showing explicitly how to calculate metrics in DDP for lightning 2.2.0 #20117: accumulate per-step outputs, gather in on_validation_epoch_end, compute the metric, log on rank 0.
"Which one should I use?" decision table.
"Common pitfalls" list.
"See also" link to the TorchMetrics DDP guide.

Docs-only — no behavior change.

Fixes #20117

Before submitting

Was this discussed/agreed via a GitHub issue? Yes — Have an example of showing explicitly how to calculate metrics in DDP for lightning 2.2.0 #20117, ack from @deependujha on 2026-04-20.
Did you read the contributor guideline's PR section?
Did you make sure your PR does only one thing?
Did you make sure to update the documentation with your changes? Yes — this IS the doc update.
Did you write any new necessary tests? N/A — docs-only.
Did you verify tests pass locally? Built docs locally with `sphinx-build -n`; no warnings emitted for this file.
Did you list all breaking changes? None.
Did you update the CHANGELOG? N/A — docs-only per the template note.

📚 Documentation preview 📚: https://pytorch-lightning--21685.org.readthedocs.build/en/21685/

Restructure the "Synchronize validation and test logging" section in accelerator_prepare.rst into a problem-framing intro plus three subsections (sync_dist, TorchMetrics, manual all_gather), a decision table, and a common-pitfalls list. Directly addresses the custom-metric case: accumulate per-step outputs, call all_gather at epoch end, and compute the metric. The "my compute runs N times" confusion is called out and resolved — after all_gather every rank holds the same data, so the redundant compute is cheap and correct; only self.log needs the rank_zero_only guard. Refs Lightning-AI#20117

c-pozzi · 2026-04-24T09:10:26Z

@deependujha thanks for hte green light on this. It ended up being a more opinionated rewrite that I originally thought. I restructured the section into subsections per tool in stead of augmenting the flat version, because the three cases seem to answer different questions and readers can land to the wrong one.

deependujha

great work. Thanks for the help :)

deependujha · 2026-04-29T08:22:16Z

linkchecks failures are due to github 502 errors, and probably github is also in maintenance mode rn, due to another issue.

c-pozzi requested review from ethanwharris, justusschock and tchaton as code owners April 24, 2026 09:05

github-actions Bot added docs Documentation related pl Generic label for PyTorch Lightning package labels Apr 24, 2026

Merge branch 'master' into docs/ddp-metrics-20117

8a2ffd0

deependujha approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: expand DDP metric synchronization guidance#21685

docs: expand DDP metric synchronization guidance#21685
c-pozzi wants to merge 2 commits intoLightning-AI:masterfrom
c-pozzi:docs/ddp-metrics-20117

c-pozzi commented Apr 24, 2026 •

edited by github-actions Bot

Loading

Uh oh!

c-pozzi commented Apr 24, 2026 •

edited

Loading

Uh oh!

deependujha left a comment

Uh oh!

deependujha commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

c-pozzi commented Apr 24, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What changed

Uh oh!

c-pozzi commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deependujha left a comment

Choose a reason for hiding this comment

Uh oh!

deependujha commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

c-pozzi commented Apr 24, 2026 •

edited by github-actions Bot

Loading

c-pozzi commented Apr 24, 2026 •

edited

Loading