Reduce GPU OOM in layer gradient computation by offloading tensors to CPU

styusuf · facebook-github-bot · commit 78485d56be76 · 2026-03-03T11:33:47.000-08:00
Summary: LayerGradientXActivation is having a number of jobs with OOM erros. These errors are due to the way we drain GPU memory during the forward and backward passes to obtain layer evaluations, copies of evaluations, and layer gradients. The issue with the current way of getting evalutaions and gradients is the following: * After getting the evaluations (activations) from the forward pass, we gather all the activation tensors across multiple devices into [one device](https://www.internalfb.com/code/fbsource/[0579e5aab76b1f89fc82d27913cbb3a6e0160b5f]/fbcode/pytorch/captum/captum/_utils/common.py?lines=785) meaning that one device would hold its own layer activations + copy of all the activations across all devices. This is the peak of the memory utilization. This is in addition to storing the model graph and the original layer activations in memory. * After the backward pass on the `saved_layer` (still on GPU), we are also doing a similar operation on the gradients - collecting gradients across all devices into the first device. At this point though, we don't have as much memory utilization since backward pass is already completed. So it is safer to collect all gradients. We can then offload to cpu afterwards. What we want to do now is to offload the layer activations to cpu during these peak gpu utilization. Before we run the expensive torch.cat, we offload all these tensors to cpu first. That way, when we run torch.cat, this is actually done on cpu freeing up gpu memory. Additionally when we get to the gradients, after gathering all tensors together, we also offload these to cpu. With this simple change, we would significantly improve gpu utilization. We add a flag that will include an efficient path to the implementation. Adds a `memory_efficient` mode to `LayerGradientXActivation` and an `offload_to_cpu` parameter to `compute_layer_gradients_and_eval` to reduce peak GPU memory usage during multi-layer attribution. Differential Revision: D94915367
diff --git a/captum/_utils/gradient.py b/captum/_utils/gradient.py
@@ -555,6 +555,7 @@ def compute_layer_gradients_and_eval(
     # pyre-fixme[24]: Generic type `Callable` expects 2 type parameters.
     output_fn: Union[None, Callable] = None,
     grad_kwargs: Optional[Dict[str, Any]] = None,
+    offload_to_cpu: bool = False,
 ) -> Tuple[Tuple[Tensor, ...], Tuple[Tensor, ...], Tuple[Tensor, ...]]: ...
 
 
@@ -574,6 +575,7 @@ def compute_layer_gradients_and_eval(
     # pyre-fixme[24]: Generic type `Callable` expects 2 type parameters.
     output_fn: Union[None, Callable] = None,
     grad_kwargs: Optional[Dict[str, Any]] = None,
+    offload_to_cpu: bool = False,
 ) -> Tuple[List[Tuple[Tensor, ...]], List[Tuple[Tensor, ...]]]: ...
 
 
@@ -593,6 +595,7 @@ def compute_layer_gradients_and_eval(
     # pyre-fixme[24]: Generic type `Callable` expects 2 type parameters.
     output_fn: Union[None, Callable] = None,
     grad_kwargs: Optional[Dict[str, Any]] = None,
+    offload_to_cpu: bool = False,
 ) -> Tuple[Tuple[Tensor, ...], Tuple[Tensor, ...]]: ...
 
 
@@ -612,6 +615,7 @@ def compute_layer_gradients_and_eval(
     # pyre-fixme[24]: Generic type `Callable` expects 2 type parameters.
     output_fn: Union[None, Callable] = None,
     grad_kwargs: Optional[Dict[str, Any]] = None,
+    offload_to_cpu: bool = False,
 ) -> Union[
     Tuple[Tuple[Tensor, ...], Tuple[Tensor, ...]],
     Tuple[Tuple[Tensor, ...], Tuple[Tensor, ...], Tuple[Tensor, ...]],
@@ -669,6 +673,7 @@ def compute_layer_gradients_and_eval(
     with torch.autograd.set_grad_enabled(True):
         # saved_layer is a dictionary mapping device to a tuple of
         # layer evaluations on that device.
+        saved_layer: Dict[Module, Dict[device, Tuple[Tensor, ...]]]
         saved_layer, output = _forward_layer_distributed_eval(
             forward_fn,
             inputs,
@@ -693,35 +698,44 @@ def compute_layer_gradients_and_eval(
             list(next(iter(saved_layer.values())).keys()), device_ids
         )
         all_outputs: Union[Tuple[Tensor, ...], List[Tuple[Tensor, ...]]]
+
+        def _get_layer_output(
+            single_layer: Module, device_id: device
+        ) -> Tuple[Tensor, ...]:
+            layer_out = saved_layer[single_layer][device_id]
+            if output_fn is not None:
+                layer_out = output_fn(layer_out)
+            # When offloading to CPU, move tensors before reduction (torch.cat)
+            # to avoid GPU OOM. This is safe because all_outputs is not used by
+            # torch.autograd.grad, which reads from saved_layer directly.
+            if offload_to_cpu:
+                layer_out = tuple(t.detach().cpu() for t in layer_out)
+            return layer_out
+
+        # pyre-fixme[9]: all_layers has type `List[Module]`; used as
+        #  `Union[List[Variable[ModuleOrModuleList <: [Module, List[Module]]]],
+        #  Variable[ModuleOrModuleList <: [Module, List[Module]]]]`.
+        all_layers: List[Module] = [layer] if isinstance(layer, Module) else layer
+
+        # Build all_outputs before backward pass. _get_layer_output detaches
+        # and moves tensors to CPU when offload_to_cpu is set, so these copies
+        # do not participate in the autograd graph and won't affect GPU memory
+        # during torch.autograd.grad (which reads from saved_layer directly).
         if isinstance(layer, Module):
             all_outputs = _reduce_list(
-                [
-                    (
-                        saved_layer[layer][device_id]
-                        if output_fn is None
-                        else output_fn(saved_layer[layer][device_id])
-                    )
-                    for device_id in key_list
-                ]
+                [_get_layer_output(layer, device_id) for device_id in key_list]
             )
         else:
             all_outputs = [
                 _reduce_list(
                     [
-                        (
-                            saved_layer[single_layer][device_id]
-                            if output_fn is None
-                            else output_fn(saved_layer[single_layer][device_id])
-                        )
+                        _get_layer_output(single_layer, device_id)
                         for device_id in key_list
                     ]
                 )
                 for single_layer in layer
             ]
-        # pyre-fixme[9]: all_layers has type `List[Module]`; used as
-        #  `Union[List[Variable[ModuleOrModuleList <: [Module, List[Module]]]],
-        #  Variable[ModuleOrModuleList <: [Module, List[Module]]]]`.
-        all_layers: List[Module] = [layer] if isinstance(layer, Module) else layer
+
         grad_inputs = tuple(
             layer_tensor
             for single_layer in all_layers
@@ -750,7 +764,13 @@ def compute_layer_gradients_and_eval(
                     output_fn(curr_saved_grad) for curr_saved_grad in curr_saved_grads
                 ]
 
-            all_grads.append(_reduce_list(curr_saved_grads))
+            reduced = _reduce_list(curr_saved_grads)
+            # When offloading to CPU, move gradient tensors after reduction
+            # (torch.cat) since reducing on GPU first is slightly more
+            # memory-efficient than moving individual tensors before reduction.
+            if offload_to_cpu:
+                reduced = tuple(t.cpu() for t in reduced)
+            all_grads.append(reduced)
 
         layer_grads: Union[Tuple[Tensor, ...], List[Tuple[Tensor, ...]]]
         layer_grads = all_grads
diff --git a/captum/attr/_core/layer/layer_gradient_x_activation.py b/captum/attr/_core/layer/layer_gradient_x_activation.py
@@ -28,6 +28,7 @@ def __init__(
         layer: ModuleOrModuleList,
         device_ids: Union[None, List[int]] = None,
         multiply_by_inputs: bool = True,
+        memory_efficient: bool = False,
     ) -> None:
         r"""
         Args:
@@ -62,10 +63,18 @@ def __init__(
                         is set to True, final sensitivity scores are being multiplied by
                         layer activations for inputs.
 
+            memory_efficient (bool, optional): If True, offloads intermediate
+                        activations and gradients to CPU during computation and
+                        uses in-place multiplication to reduce peak GPU memory
+                        usage. Useful when attributing across multiple target
+                        layers simultaneously.
+                        Default: False
+
         """
         LayerAttribution.__init__(self, forward_func, layer, device_ids)
         GradientAttribution.__init__(self, forward_func)
         self._multiply_by_inputs = multiply_by_inputs
+        self._memory_efficient = memory_efficient
 
     @property
     def multiplies_by_inputs(self) -> bool:
@@ -181,6 +190,7 @@ def attribute(
             device_ids=self.device_ids,
             attribute_to_layer_input=attribute_to_layer_input,
             grad_kwargs=grad_kwargs,
+            offload_to_cpu=self._memory_efficient,
         )
         if isinstance(self.layer, Module):
             return _format_output(
@@ -191,13 +201,17 @@ def attribute(
                 ),
             )
         else:
-            return [
-                _format_output(
-                    len(layer_evals[i]) > 1,
-                    self.multiply_gradient_acts(layer_gradients[i], layer_evals[i]),
+            results = []
+            for i in range(len(self.layer)):
+                grads_i = layer_gradients[i]
+                evals_i = layer_evals[i]
+                results.append(
+                    _format_output(
+                        len(evals_i) > 1,
+                        self.multiply_gradient_acts(grads_i, evals_i),
+                    )
                 )
-                for i in range(len(self.layer))
-            ]
+            return results
 
     def multiply_gradient_acts(
         self, gradients: Tuple[Tensor, ...], evals: Tuple[Tensor, ...]