RecMetricModule: apply reduction function before gloo all gathers #3593

jeffkbkim · 2025-12-05T00:18:31Z

Differential Revision: D88297404

meta-codesync · 2025-12-05T00:18:42Z

@jeffkbkim has exported this pull request. If you are a Meta employee, you can view the originating Diff in D88297404.

…ta-pytorch#3593) Summary: Pull Request resolved: meta-pytorch#3593 Differential Revision: D88297404

…ta-pytorch#3593) Summary: metric_module's get_pre_compute_states() provides an API to perform gloo all gathers instead of the default torchmetric.Metric's sync_dist (nccl). However, the mechanism calls gloo all gathers for each element in a list of tensors. This can be problematic because: - AUC's 3 state tensors hold a list of tensors, not a single tensor. - The size of the tensor list is theoretically unbounded. (In practice, it can grow to orders of 100K) - gloo all gathers are inherently much slower. Instead, this patch aims to: - apply the reduction function prior to the all gather if we're processing a tensor list - enforce that the reduction_fn does not rely on ordering Differential Revision: D88297404

Summary: Utilities for CPUOffloadedRecMetricModule and RecMetricModule. Also raise exceptions in the main thread if any of the background threads. Added unit tests. Simplify the core metric types: - MetricsResult = Dict[str, MetricValue]: sync metrics computation - MetricsFuture = concurrent.futures.Future[MetricsResult]: for async computation - MetricsOutput = Union[MetricsResult, MetricsFuture]: Either a MetricsResult, or a MetricsFuture - The PublishableMetrics variants are to loosen the constraints to publish, so that the user can store values other than a Tensor/float. Introduce a metrics_output_util to handle the logic between futures and dicts. Users can schedule callbacks via `get_metrics_async()`. If they want to synchronously perform it, they can use `get_metrics_sync()` Introduce `device` argument to RecMetricModule constructor. It is a noop for the standard metric module, but CPUOffloadedRecMetricModule requires it to determine whether to perform GPU to CPU transfers. Differential Revision: D87110900

…ta-pytorch#3593) Summary: metric_module's get_pre_compute_states() provides an API to perform gloo all gathers instead of the default torchmetric.Metric's sync_dist (nccl). However, the mechanism calls gloo all gathers for each element in a list of tensors. This can be problematic because: - AUC's 3 state tensors hold a list of tensors, not a single tensor. - The size of the tensor list is theoretically unbounded. (In practice, it can grow to orders of 100K) - gloo all gathers are inherently much slower. Instead, this patch aims to: - apply the reduction function prior to the all gather if we're processing a tensor list - enforce that the reduction_fn does not rely on ordering Differential Revision: D88297404

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 5, 2025

meta-codesync bot added fb-exported meta-exported labels Dec 5, 2025

jeffkbkim added a commit to jeffkbkim/torchrec that referenced this pull request Jan 12, 2026

RecMetricModule: apply reduction function before gloo all gathers (me…

05c1b3d

…ta-pytorch#3593) Summary: Pull Request resolved: meta-pytorch#3593 Differential Revision: D88297404

jeffkbkim force-pushed the export-D88297404 branch from 747258a to 4daa523 Compare January 13, 2026 19:21

jeffkbkim force-pushed the export-D88297404 branch from 4daa523 to 73e575e Compare January 22, 2026 18:48

jeffkbkim force-pushed the export-D88297404 branch from 4daa523 to 73e575e Compare January 22, 2026 18:49

jeffkbkim force-pushed the export-D88297404 branch from 73e575e to a2618d7 Compare January 22, 2026 22:48

jeffkbkim added 2 commits January 23, 2026 11:03

jeffkbkim force-pushed the export-D88297404 branch from a2618d7 to f4a2668 Compare January 23, 2026 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecMetricModule: apply reduction function before gloo all gathers #3593

RecMetricModule: apply reduction function before gloo all gathers #3593

Uh oh!

jeffkbkim commented Dec 5, 2025

Uh oh!

meta-codesync bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

RecMetricModule: apply reduction function before gloo all gathers #3593

Are you sure you want to change the base?

RecMetricModule: apply reduction function before gloo all gathers #3593

Uh oh!

Conversation

jeffkbkim commented Dec 5, 2025

Uh oh!

meta-codesync bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant