Skip to content

docs: expand DDP metric synchronization guidance#21685

Open
c-pozzi wants to merge 2 commits intoLightning-AI:masterfrom
c-pozzi:docs/ddp-metrics-20117
Open

docs: expand DDP metric synchronization guidance#21685
c-pozzi wants to merge 2 commits intoLightning-AI:masterfrom
c-pozzi:docs/ddp-metrics-20117

Conversation

@c-pozzi
Copy link
Copy Markdown
Contributor

@c-pozzi c-pozzi commented Apr 24, 2026

What does this PR do?

Expands the "Synchronize validation and test logging" section in docs/source-pytorch/accelerators/accelerator_prepare.rst to give clearer DDP metric-sync guidance, with focus on the custom-metric case raised in #20117.

The existing section prescribes sync_dist=True and shows a terse all_gather example, but doesn't explain:

  • Why the sync is needed (failure mode: checkpoint selection driven by rank 0's 1/world_size view).
  • That sync_dist=True is silently wrong for non-averageable metrics (F1, AUC, precision/recall on imbalanced classes).
  • The [world_size, *tensor_shape] shape returned by self.all_gather().
  • The reporter's "my compute runs N times" confusion — after all_gather every rank holds the same data, so the redundant work is cheap and correct.
  • That is_global_zero must wrap self.log, not all_gather itself (collective → hang).

What changed

Restructured the section into:

  1. Problem framing (what goes wrong without sync).
  2. Three subsections, one per tool: sync_dist=True, TorchMetrics, manual all_gather. The all_gather example mirrors the pattern in Have an example of showing explicitly how to calculate metrics in DDP for lightning 2.2.0 #20117: accumulate per-step outputs, gather in on_validation_epoch_end, compute the metric, log on rank 0.
  3. "Which one should I use?" decision table.
  4. "Common pitfalls" list.
  5. "See also" link to the TorchMetrics DDP guide.

Docs-only — no behavior change.

Fixes #20117

Before submitting
  • Was this discussed/agreed via a GitHub issue? Yes — Have an example of showing explicitly how to calculate metrics in DDP for lightning 2.2.0 #20117, ack from @deependujha on 2026-04-20.
  • Did you read the contributor guideline's PR section?
  • Did you make sure your PR does only one thing?
  • Did you make sure to update the documentation with your changes? Yes — this IS the doc update.
  • Did you write any new necessary tests? N/A — docs-only.
  • Did you verify tests pass locally? Built docs locally with `sphinx-build -n`; no warnings emitted for this file.
  • Did you list all breaking changes? None.
  • Did you update the CHANGELOG? N/A — docs-only per the template note.

📚 Documentation preview 📚: https://pytorch-lightning--21685.org.readthedocs.build/en/21685/

Restructure the "Synchronize validation and test logging" section in
accelerator_prepare.rst into a problem-framing intro plus three
subsections (sync_dist, TorchMetrics, manual all_gather), a decision
table, and a common-pitfalls list.

Directly addresses the custom-metric case: accumulate per-step outputs,
call all_gather at epoch end, and compute the metric. The "my compute
runs N times" confusion is called out and resolved — after all_gather
every rank holds the same data, so the redundant compute is cheap and
correct; only self.log needs the rank_zero_only guard.

Refs Lightning-AI#20117
@github-actions github-actions Bot added docs Documentation related pl Generic label for PyTorch Lightning package labels Apr 24, 2026
@c-pozzi
Copy link
Copy Markdown
Contributor Author

c-pozzi commented Apr 24, 2026

@deependujha thanks for hte green light on this. It ended up being a more opinionated rewrite that I originally thought. I restructured the section into subsections per tool in stead of augmenting the flat version, because the three cases seem to answer different questions and readers can land to the wrong one.

Copy link
Copy Markdown
Collaborator

@deependujha deependujha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work. Thanks for the help :)

@deependujha
Copy link
Copy Markdown
Collaborator

linkchecks failures are due to github 502 errors, and probably github is also in maintenance mode rn, due to another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation related pl Generic label for PyTorch Lightning package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Have an example of showing explicitly how to calculate metrics in DDP for lightning 2.2.0

2 participants