Skip to content

Add core performance for 4-level tiling pipeline #1177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

newling
Copy link
Contributor

@newling newling commented Mar 11, 2025

The existing test uses the pack-peel pipeline and results in a function call to a matmul with m=n=64 and k=32, different to the existing function call resulting from the pack-peel-4-level-tiling pipeline, which is m=n=64 and k=64.

Core performance with k=64 is better. However performance with k=32 with jam-and-unroll is even better ( see #1167 ). Unfortunately k=64 with jam-and-unroll results in using the stack / stack overflow -- either very slow or compilation failure.

So I guess the options are

  1. decompose the linalg.generic with the 64x64x64 matmul just before function outlining, to 2 (or maybe 4) smaller linalg.generics, and then outline that smaller generic. Pros: (1) can unroll and jam, and get best currently observed performance (2) even better PM foot-print, as smaller outlined function. Cons : (1) complexity (2) potentially dodging the more serious problem of not controlling aie-opt (3) isn't ukernel approach with64x64x64 optimal, what does that look like?

  2. don't unroll-and-jam with this pipeline

@jtuyls
Copy link
Collaborator

jtuyls commented Mar 12, 2025

Core performance with k=64 is better. However performance with k=32 with jam-and-unroll is even better ( see #1167 ). Unfortunately k=64 with jam-and-unroll results in using the stack / stack overflow -- either very slow or compilation failure.

Hmm. a larger k dimension should basically always be better in terms of core performance because:

  • It can reuse accumulator registers more
  • It makes sure that the local output is typically smaller than the inputs, so that the core can overlap moving in inputs with moving out the output.

It might just need so more experimentation/tuning.

@newling
Copy link
Contributor Author

newling commented Mar 12, 2025

Core performance with k=64 is better. However performance with k=32 with jam-and-unroll is even better ( see #1167 ). Unfortunately k=64 with jam-and-unroll results in using the stack / stack overflow -- either very slow or compilation failure.

Hmm. a larger k dimension should basically always be better in terms of core performance because:

  • It can reuse accumulator registers more
  • It makes sure that the local output is typically smaller than the inputs, so that the core doesn't can overlap moving in inputs with moving out the output.

It might just need so more experimentation/tuning.

Yeah I agree. In my experiments, it's just much more likely to spill registers. I think this is a combination of aie-opt and aie-llc not doing the optimal thing. i.e. despite getting a lot more control over the order of llvm.load / llvm.store and unrolling in #1167 , we don't seem have full control.

@newling newling closed this Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants