Load Weights Once #1015

jmalone-tt · 2025-05-14T00:33:44Z

This adds changes to only load weights once by adding a wrapper function with logic to load weights on device and use cached tensors for future iterations.

Cached tensors should be deallocated when the calling function ends. Since Native Device should have the same behavior, this is ok for now. If we start running into issues with running out of DRAM, we may need to be more strategic about which tensors are cached on device.

Note that this PR does not enable cached model parameters for training runs, only inference. This is accomplished through an additional GraphModuleAnalysisPass. This pass looks for characteristics that are typical of backward computation graphs (calls to backward aten ops), forward training graphs (outputting one of the inputs unchanged), and marks everything else as forward inference. This seems fragile, but any of the normal torch methods of determining whether a model is in training or eval mode appear to not work by the time the GraphModule gets to our backend.

This PR also fixes an issue where inputs for data parallel were not hoisted to the top of the forward function.

This PR also opens up the opportunity for preprocessing model weights. This is needed for convolution ops to be able to run metal trace, and will help speed up several other functions once implemented. Preprocessing will be addressed in a separate PR.

Local tests indicate the following performance improvements:
MNIST Batch 1, 5th iteration changes 5.29 ms -> 3.32 ms
BERT Batch 8, 5th iteration now hits 23.3 sen/sec (batch 1 on main today hits ~1.7 sen/sec, not sure about batch 8)

Function correctly runs once Created graphmodule loads weights properly TODO: get wrapper function to successfully call generated GraphModule

TODO: clean up before PR

jmalone-tt · 2025-05-14T00:34:47Z

Based on a quick test, this lowers 5th iteration MNIST batch 1 from 5.29 ms to 3.32 ms. I have not tested BERT, data parallel, or other models.

jmalone-tt · 2025-05-14T17:26:24Z

With this change, BERT batch 8 hits 23.3 sen/sec locally

TODO: clean up unused code

TODO: address any feedback

jmalone-tt · 2025-05-14T20:41:52Z

I'm sure the code in here could use some additional cleaning, but at this point I'm a bit too deep in the weeds, so I'm sure I'm missing things. Please point out areas where the code could be more readable

run in one command

torch_ttnn/backend.py

torch_ttnn/passes/analysis/graph_module_analysis_pass.py

dgomezTT

looks good to me, but I would wait for Kevin or Artem to check it up!

Update comment for graph module analysis pass

kevinwuTT

LGTM! I'm glad the analysis pass for determining training/eval works well enough.

kevinwuTT · 2025-05-15T16:23:10Z

torch_ttnn/passes/lowering/add_data_move_pass.py

+    i = 0
+
+    for node in nodes:
+        args = node.args


I know args and kwargs are iterated differently, but the rest of the code seems like it could be combined. Let me know if it's more difficult than it looks.

I'm not entirely sure what change you're asking for here. Can you clarify?

jmalone-tt · 2025-05-15T18:23:39Z

Passed tests:
https://github.com/tenstorrent/pytorch2.0_ttnn/actions/runs/15048228625

ayerofieiev-tt · 2025-05-16T12:26:19Z

torch_ttnn/passes/lowering/add_data_move_pass.py

@@ -270,6 +271,8 @@ def __init__(self, graph, device):
        self.graph = graph
        self.device = device
        self.aligned_node_dict = {}


Can you please leave a comment on what this member is about?

torch_ttnn/passes/lowering/add_data_move_pass.py

jmalone-tt · 2025-05-17T12:44:17Z

This passes consistently in the Run Tests workflow, and passes locally, but fails consistently in Before Merge 😕. Need to investigate further.

philei-tt · 2025-05-20T10:02:31Z

torch_ttnn/passes/analysis/graph_module_analysis_pass.py

+class ModelType(Enum):
+    """Enumeration of compiled model inputs.
+
+    It is expected that PARAMETER and BUFFER tensors do not change between inference runs, but ARGUMENT tensors may change.
+
+    :param PARAMETER: Tensor tracked by optimizer during training. Represents weights and biases.
+    :param BUFFER: Tensor not updated during training but still part of model state. Normally used for non-trainable data like fixed weights or running statistics in batch norm.
+    :param ARGUMENT: Tensors not tracked by model and not persistent between calls. Can be input data or configuration options to module's methods.
+    """
+
+    INFERENCE = 1
+    TRAIN_FORWARD = 2
+    TRAIN_BACKWARD = 3


I think docs are from other enum? From PrimalTag?

Thanks for catching this!

philei-tt · 2025-05-20T10:04:17Z

torch_ttnn/passes/analysis/graph_module_analysis_pass.py

+aten_backward_ops = {
+    torch.ops.aten._adaptive_avg_pool2d_backward.default,
+    torch.ops.aten.avg_pool2d_backward.default,
+    torch.ops.aten.convolution_backward.default,
+    torch.ops.aten.embedding_dense_backward.default,
+    torch.ops.aten.max_pool2d_with_indices_backward.default,
+    torch.ops.aten.native_group_norm_backward.default,
+    torch.ops.aten.native_layer_norm_backward.default,
+}


Are you sure this list will always determine that graph is train backward? Feels too small)

I agree that it seems too small, but it's based on the list of backward ops from the aten core IR. I will add a comment with a link

philei-tt · 2025-05-20T10:07:17Z

torch_ttnn/passes/analysis/graph_module_analysis_pass.py

+    return False
+
+
+def is_train_forward(gm):


Is this documented in somewhere in torch docs that train forward will return inputs? If no, this might change in the future

I don't know if it's documented anywhere, but inputs are used in backward pass, so this should be reliable. Can add a comment that notes this assumption though

philei-tt · 2025-05-20T11:27:53Z

torch_ttnn/passes/lowering/add_data_move_pass.py

+        with self.graph.inserting_before(first_node):
+            self.marshaled_node_dict[data_move_spec] = self.input_idx
+            self.input_idx += 1


Why is this block wrapped with with self.graph.inserting_before? You aren't inserting node here, or am I missing something?

Good point, earlier in the change I was adding the nodes here, but then removed that code. Will update

philei-tt · 2025-05-20T11:38:26Z

torch_ttnn/passes/lowering/target_wrappers.py

+run_once_count = 0
+run_once_ans = tuple()
+
+
+@torch.fx.wrap
+def run_once(*args):


(nit) using closure or @lru_cache might be a bit better than global variable.

I really like the lru_cache here, but it hits an error with the kwargs inputs being unhashable since it's a dict. The closure path seems less readable to me in this case. I agree that the global variable isn't the best, but I'm leaning towards keeping it unless you feel strongly

philei-tt · 2025-05-20T11:44:07Z

torch_ttnn/passes/lowering/add_data_move_pass.py

                        )
                else:
-                    i += node_input_aligner.align(node, arg, key, SiteType.KWARGS, first_node)
+                    i += node_input_aligner.align(node, arg, key, SiteType.KWARGS, first_node, ttnn_inputs)

        modified = i > 0


(nit) maybe use modified = modified or node_input_aligner.align(...) for readability purposes? Not sure it's better, but variable "i" doesn't tell much about it's purpose

philei-tt

Looks good, just a few small comments.

…once

passes in Run Tests

jmalone-tt added 4 commits May 14, 2025 00:30

WIP loading weights once

9880665

WIP loading weights once

8ed2ef7

Function correctly runs once Created graphmodule loads weights properly TODO: get wrapper function to successfully call generated GraphModule

Adds necessary files to get load_weights working (fixup)

7c07c0d

Loading weights once works!!!

5ba66f5

TODO: clean up before PR

github-actions bot added the community label May 14, 2025

jmalone-tt removed the community label May 14, 2025

Clean up for PR part 1

7480673

jmalone-tt added 5 commits May 14, 2025 19:26

Add analysis of when to load_weights_once

f3bbcc4

TODO: clean up unused code

Training flows now work by not loading weights first

02dce32

TODO: address any feedback

fixup

4eeab9f

Get Data Parallel working again

2d57080

Merge branch 'main' into jmalone/load_weights_once

f3708f2

jmalone-tt requested review from kevinwuTT, dgomezTT, ayerofieiev-tt and philei-tt May 14, 2025 20:40

jmalone-tt marked this pull request as ready for review May 14, 2025 20:40

Reset run_once count every time compilation occurs to fix multiple tests

cb48ec6

run in one command

dgomezTT reviewed May 15, 2025

View reviewed changes

torch_ttnn/backend.py Outdated Show resolved Hide resolved

dgomezTT reviewed May 15, 2025

View reviewed changes

torch_ttnn/backend.py Outdated Show resolved Hide resolved

dgomezTT reviewed May 15, 2025

View reviewed changes

torch_ttnn/passes/analysis/graph_module_analysis_pass.py Outdated Show resolved Hide resolved

dgomezTT approved these changes May 15, 2025

View reviewed changes

Remove dead code in backend

a966ea0

Update comment for graph module analysis pass

kevinwuTT approved these changes May 15, 2025

View reviewed changes

ayerofieiev-tt reviewed May 16, 2025

View reviewed changes

torch_ttnn/passes/lowering/add_data_move_pass.py Outdated Show resolved Hide resolved

jmalone-tt added this pull request to the merge queue May 17, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 17, 2025

jmalone-tt added this pull request to the merge queue May 19, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 19, 2025

philei-tt reviewed May 20, 2025

View reviewed changes

philei-tt approved these changes May 20, 2025

View reviewed changes

jmalone-tt added 3 commits May 20, 2025 14:35

Updated comments

d7f40d3

Merge remote-tracking branch 'origin/main' into jmalone/load_weights_…

dd01d91

…once

Merge branch 'main' into jmalone/load_weights_once

9e341a8

jmalone-tt enabled auto-merge May 20, 2025 20:22