Support per-op partitioning in XNNPACK delegate for NHWC ops #8476

GregoryComer · 2025-02-06T11:35:55Z

GregoryComer
Feb 6, 2025
Collaborator

🚀 The feature, motivation and pitch

We currently support per-op partitioning in the XNNPACK delegate, which allows for all activation tensor memory to be owned by ExecuTorch and thus overlapped with other ExecuTorch-owned activation memory. However, this isn't currently practical for ops that run in channels-last (NHWC) dim order. This is because the delegate currently assumes that tensors that are inputs or outputs to the partition are always channels first (NCHW) and thus inserts dim order conversions around every op. This is perf issue, but more importantly, it means that XNNPACK ends up owning some of the activation memory.

Ideally, we can leverage the recent dim order support in the core runtime to let the framework manage the dim order conversion, at least in single-op mode. How this interacts with partitioning is not entirely clear, since this would have to happen after partitioning. For initial purposes, it's likely fine to let the dim order conversions not be delegated. This likely needs a bit more design discussion, but it is a high ROI feature and may be necessary for memory parity with LI in some cases, even with workspace sharing.

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

cc @digantdesai @mcr229

mcr229 · 2025-02-13T20:08:36Z

mcr229
Feb 13, 2025
Collaborator

let's convert this to a discussion

0 replies

GregoryComer · 2025-02-13T21:28:40Z

GregoryComer
Feb 13, 2025
Collaborator Author

I discussed this a bit with @digantdesai, and his proposed approach is as follows:

Update the runtime code to handle varying dim order.
Update the per-partition logic (currently in the tag channels last pass) to skip conversions when the partition inputs are channels last. We may need to generalize to handle arbitrary dim orders. We will always maintain the input and output dim orders as they were at time of partitioning. If the inputs are channels last and we need channels first, we will need to convert back to channels last before leaving the partition.
Potentially add some logic in to_edge_transform_or_lower or somewhere between to_edge and to_backend to intelligently convert to channels-last for XNNPACK-delegated CV models or similar. This is the trickiest to do cleanly and needs more discussion. In the short-term, with (1) + (2), users can manually pass inputs as channels last or include a convert in the graph, but I'd prefer to make it "just work".

I discussed a few other options, such as allowing each partition to flag its desired input and output dim order. Digant expressed the opinion that it's much better for partitions to maintain the exact semantics as pre-partition, such that the delegated partition is drop-in. That makes sense to me, but we do maybe need to figure out a way to give delegates a way to run passes or otherwise mutate the graph outside of the partition pre-lowering.

@mcr229 Any thoughts or concerns with this approach? Once we have alignment on approach, I may go ahead and implement this in the near future as it will be very helpful for memory optimization for the models that have difficult-to-fix graph breaks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support per-op partitioning in XNNPACK delegate for NHWC ops #8476

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Support per-op partitioning in XNNPACK delegate for NHWC ops #8476

Uh oh!

Uh oh!

GregoryComer Feb 6, 2025 Collaborator

🚀 The feature, motivation and pitch

Alternatives

Additional context

RFC (Optional)

Replies: 2 comments

Uh oh!

mcr229 Feb 13, 2025 Collaborator

Uh oh!

GregoryComer Feb 13, 2025 Collaborator Author

GregoryComer
Feb 6, 2025
Collaborator

mcr229
Feb 13, 2025
Collaborator

GregoryComer
Feb 13, 2025
Collaborator Author