Skip to content

Conversation

@jeffkbkim
Copy link
Contributor

Summary:
CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements.

Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training.

Differential Revision: D87800947

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 12, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Jan 12, 2026

@jeffkbkim has exported this pull request. If you are a Meta employee, you can view the originating Diff in D87800947.

jeffkbkim added a commit to jeffkbkim/torchrec that referenced this pull request Jan 13, 2026
…#3658)

Summary:
Pull Request resolved: meta-pytorch#3658

CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements.

Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training.

Differential Revision: D87800947
jeffkbkim added a commit to jeffkbkim/torchrec that referenced this pull request Jan 14, 2026
…#3658)

Summary:
Pull Request resolved: meta-pytorch#3658

CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements.

Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training.

Differential Revision: D87800947
jeffkbkim added a commit to jeffkbkim/torchrec that referenced this pull request Jan 22, 2026
…#3658)

Summary:

CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements.

Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training.

Differential Revision: D87800947
jeffkbkim added a commit to jeffkbkim/torchrec that referenced this pull request Jan 22, 2026
…#3658)

Summary:
Pull Request resolved: meta-pytorch#3658

CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements.

Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training.

Differential Revision: D87800947
Summary:

Utilities for CPUOffloadedRecMetricModule and RecMetricModule.

Also raise exceptions in the main thread if any of the background threads. Added unit tests.

Simplify the core metric types:
- MetricsResult = Dict[str, MetricValue]: sync metrics computation
- MetricsFuture = concurrent.futures.Future[MetricsResult]: for async computation
- MetricsOutput = Union[MetricsResult, MetricsFuture]: Either a MetricsResult, or a MetricsFuture
- The PublishableMetrics variants are to loosen the constraints to publish, so that the user can store values other than a Tensor/float.

Introduce a metrics_output_util to handle the logic between futures and dicts. Users can schedule callbacks via
`get_metrics_async()`. If they want to synchronously perform it, they can use `get_metrics_sync()`

Introduce `device` argument to RecMetricModule constructor. It is a noop for the standard metric module, but CPUOffloadedRecMetricModule requires it to determine whether to perform GPU to CPU transfers.

Differential Revision: D87110900
…ta-pytorch#3593)

Summary:

metric_module's get_pre_compute_states() provides an API to perform gloo all gathers instead of the default torchmetric.Metric's sync_dist (nccl).

However, the mechanism calls gloo all gathers for each element in a list of tensors. This can be problematic because:
- AUC's 3 state tensors hold a list of tensors, not a single tensor. 
- The size of the tensor list is theoretically unbounded. (In practice, it can grow to orders of 100K)
- gloo all gathers are inherently much slower.

Instead, this patch aims to:
- apply the reduction function prior to the all gather if we're processing a tensor list
- enforce that the reduction_fn does not rely on ordering

Differential Revision: D88297404
…#3658)

Summary:

CPUOffloadedRecMetricModule currently performs DtoH (nonblocking) from the main thread. This can start to become quite expensive when the order of magnitude of the model_out dict size is in the thousands, where each key stores a tensor with 1000+ elements.

Instead of the main thread launching the DtoHs, have the update thread be responsible. This will free the main thread to continue training.

Differential Revision: D87800947
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant