Flip default tensor retention on graph inputs and outputs from False to True in MLIR runtime #5465

jameszianxuTT · 2025-10-23T15:29:57Z

jameszianxuTT
Oct 23, 2025
Collaborator

There are 4 possibilities here -> (set retain = true, set retain = false) x (on input tensors, on output tensors) and I will try to detail implications from the frontends and historical reasons for the current default false retention on both based on expertise from @jnie-TT and @pilkicTT.

jameszianxuTT · 2025-10-23T15:37:04Z

jameszianxuTT
Oct 23, 2025
Collaborator Author

(current) retain=False on graph inputs

Retain is currently default-set to false on inputs, because we want to rely on ttnn deallocator to auto-deallocate inputs after they are not used to prevent device OOM.

tt-xla currently ignores this logic and sets retain=True on all input tensors to prevent ttnn from eagerly deallocating them. This is because:

We may reuse the memory for input tensors in future executions, which runtime does not know about. This is particularly relevant for static caches, which have their ttnn device tensors modified inplace by cache update in the runtime, and persist between prefill and decode executables.
We do not know in general whether graph inputs may need to be used by future graphs. This is made especially complicated by torch xla graph breaks, which may turn non user inputs (i.e. intermediates) into graph inputs to a split up subraph. Having retain=False set on those graph inputs can lead to premature deallocation bugs like this one. I am not sure if it is theoretically possible to know if a graph is the "last user" of an input either from tt-xla or tt-mlir side.

0 replies

jameszianxuTT · 2025-10-23T15:39:07Z

jameszianxuTT
Oct 23, 2025
Collaborator Author

(change) retain=True on graph inputs

This would allow the FE to not have to call setTensorRetain(true) on each input manually. However, it would make it more challenging to use ttnn tensor auto-deallocation in the future. The auto deallocation is not really necessary right now since inputs are DRAM interleaved, but as we moved to L1 sharded layouts, tighter tensor lifetime management will be come necessary.

0 replies

jameszianxuTT · 2025-10-23T15:45:45Z

jameszianxuTT
Oct 23, 2025
Collaborator Author

(current) retain=False on graph outputs

Previously, all FEs were immediately deallocating all output tensors after submit.

For tt-xla, this was because we eagerly transferred output tensors toHost, which is not performant, but a workaround for a separate issue. This behaviour will be removed in tenstorrent/tt-xla#1657. Now, calls to submit() in tt-xla will be captured without transferring to host, and output BufferInstances will hold onto live device tensors at all times.

This also means that tensor transfers toHost from the frontend are unbuffered on host and repeatedly toHost()'ed on demand. These could be buffered on host, but some mutation-tracking infra may be required in the frontend.

Some outputs (like static caches) must participate in compute graphs as input and output, but should not be returned to host.

0 replies

jameszianxuTT · 2025-10-23T16:00:28Z

jameszianxuTT
Oct 23, 2025
Collaborator Author

(change) retain=True on graph outputs

tt-xla does not have logic to deallocate output tensors intelligently, except for when BufferInstances are destructed. While we don't set retain=True on graph outputs (meaning their retention is false), they are also not auto-deallocated by tt-xla, since the deallocator only runs when the tensor is input in another graph. However, if the output tensor is reused in another graph, it will have its retain flag set to true because of reasons detailed in previous comment. So, it won't get auto-deallocated by tt-xla either.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flip default tensor retention on graph inputs and outputs from False to True in MLIR runtime #5465

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Flip default tensor retention on graph inputs and outputs from False to True in MLIR runtime #5465

Uh oh!

Uh oh!

jameszianxuTT Oct 23, 2025 Collaborator

Replies: 4 comments

Uh oh!

Uh oh!

jameszianxuTT Oct 23, 2025 Collaborator Author

(current) retain=False on graph inputs

Uh oh!

jameszianxuTT Oct 23, 2025 Collaborator Author

(change) retain=True on graph inputs

Uh oh!

Uh oh!

jameszianxuTT Oct 23, 2025 Collaborator Author

(current) retain=False on graph outputs

Uh oh!

jameszianxuTT Oct 23, 2025 Collaborator Author

(change) retain=True on graph outputs

jameszianxuTT
Oct 23, 2025
Collaborator

jameszianxuTT
Oct 23, 2025
Collaborator Author

jameszianxuTT
Oct 23, 2025
Collaborator Author

jameszianxuTT
Oct 23, 2025
Collaborator Author

jameszianxuTT
Oct 23, 2025
Collaborator Author