CPUOffloadedRecMetricModule: DtoHs in the update thread #3658

jeffkbkim · 2026-01-12T20:44:01Z

Summary:
CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements.

Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training.

Differential Revision: D87800947

meta-codesync · 2026-01-12T20:44:10Z

@jeffkbkim has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87800947.

…#3658) Summary: Pull Request resolved: meta-pytorch#3658 CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements. Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training. Differential Revision: D87800947

…#3658) Summary: CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements. Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training. Differential Revision: D87800947

…#3658) Summary: Pull Request resolved: meta-pytorch#3658 CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements. Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training. Differential Revision: D87800947

Summary: Utilities for CPUOffloadedRecMetricModule and RecMetricModule. Also raise exceptions in the main thread if any of the background threads. Added unit tests. Simplify the core metric types: - MetricsResult = Dict[str, MetricValue]: sync metrics computation - MetricsFuture = concurrent.futures.Future[MetricsResult]: for async computation - MetricsOutput = Union[MetricsResult, MetricsFuture]: Either a MetricsResult, or a MetricsFuture - The PublishableMetrics variants are to loosen the constraints to publish, so that the user can store values other than a Tensor/float. Introduce a metrics_output_util to handle the logic between futures and dicts. Users can schedule callbacks via `get_metrics_async()`. If they want to synchronously perform it, they can use `get_metrics_sync()` Introduce `device` argument to RecMetricModule constructor. It is a noop for the standard metric module, but CPUOffloadedRecMetricModule requires it to determine whether to perform GPU to CPU transfers. Differential Revision: D87110900

…ta-pytorch#3593) Summary: metric_module's get_pre_compute_states() provides an API to perform gloo all gathers instead of the default torchmetric.Metric's sync_dist (nccl). However, the mechanism calls gloo all gathers for each element in a list of tensors. This can be problematic because: - AUC's 3 state tensors hold a list of tensors, not a single tensor. - The size of the tensor list is theoretically unbounded. (In practice, it can grow to orders of 100K) - gloo all gathers are inherently much slower. Instead, this patch aims to: - apply the reduction function prior to the all gather if we're processing a tensor list - enforce that the reduction_fn does not rely on ordering Differential Revision: D88297404

…#3658) Summary: CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements. Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training. Differential Revision: D87800947

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 12, 2026

jeffkbkim force-pushed the export-D87800947 branch from 59d6e9d to 050e753 Compare January 13, 2026 19:25

jeffkbkim force-pushed the export-D87800947 branch from 050e753 to b02ff99 Compare January 14, 2026 17:21

jeffkbkim force-pushed the export-D87800947 branch from b02ff99 to d2204fc Compare January 22, 2026 18:48

jeffkbkim force-pushed the export-D87800947 branch from d2204fc to ef7c8ee Compare January 22, 2026 22:52

jeffkbkim added 3 commits January 23, 2026 11:03

jeffkbkim force-pushed the export-D87800947 branch from ef7c8ee to 657b7cb Compare January 23, 2026 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPUOffloadedRecMetricModule: DtoHs in the update thread #3658

CPUOffloadedRecMetricModule: DtoHs in the update thread #3658

Uh oh!

jeffkbkim commented Jan 12, 2026

Uh oh!

meta-codesync bot commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CPUOffloadedRecMetricModule: DtoHs in the update thread #3658

Are you sure you want to change the base?

CPUOffloadedRecMetricModule: DtoHs in the update thread #3658

Uh oh!

Conversation

jeffkbkim commented Jan 12, 2026

Uh oh!

meta-codesync bot commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant