Add core performance for 4-level tiling pipeline #1177

newling · 2025-03-11T22:42:13Z

The existing test uses the pack-peel pipeline and results in a function call to a matmul with m=n=64 and k=32, different to the existing function call resulting from the pack-peel-4-level-tiling pipeline, which is m=n=64 and k=64.

Core performance with k=64 is better. However performance with k=32 with jam-and-unroll is even better ( see #1167 ). Unfortunately k=64 with jam-and-unroll results in using the stack / stack overflow -- either very slow or compilation failure.

So I guess the options are

decompose the linalg.generic with the 64x64x64 matmul just before function outlining, to 2 (or maybe 4) smaller linalg.generics, and then outline that smaller generic. Pros: (1) can unroll and jam, and get best currently observed performance (2) even better PM foot-print, as smaller outlined function. Cons : (1) complexity (2) potentially dodging the more serious problem of not controlling aie-opt (3) isn't ukernel approach with64x64x64 optimal, what does that look like?
don't unroll-and-jam with this pipeline

jtuyls · 2025-03-12T16:05:34Z

Core performance with k=64 is better. However performance with k=32 with jam-and-unroll is even better ( see #1167 ). Unfortunately k=64 with jam-and-unroll results in using the stack / stack overflow -- either very slow or compilation failure.

Hmm. a larger k dimension should basically always be better in terms of core performance because:

It can reuse accumulator registers more
It makes sure that the local output is typically smaller than the inputs, so that the core can overlap moving in inputs with moving out the output.

It might just need so more experimentation/tuning.

newling · 2025-03-12T17:29:28Z

Core performance with k=64 is better. However performance with k=32 with jam-and-unroll is even better ( see #1167 ). Unfortunately k=64 with jam-and-unroll results in using the stack / stack overflow -- either very slow or compilation failure.

Hmm. a larger k dimension should basically always be better in terms of core performance because:

It can reuse accumulator registers more

It makes sure that the local output is typically smaller than the inputs, so that the core doesn't can overlap moving in inputs with moving out the output.

It might just need so more experimentation/tuning.

Yeah I agree. In my experiments, it's just much more likely to spill registers. I think this is a combination of aie-opt and aie-llc not doing the optimal thing. i.e. despite getting a lot more control over the order of llvm.load / llvm.store and unrolling in #1167 , we don't seem have full control.

add 4 level

e5b5f62

newling mentioned this pull request Mar 11, 2025

Pass to 'unroll and jam' the scf.for loops around aievec.matmul #1167

Draft

black

cccb063

newling closed this Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add core performance for 4-level tiling pipeline #1177

Add core performance for 4-level tiling pipeline #1177

Uh oh!

newling commented Mar 11, 2025

Uh oh!

jtuyls commented Mar 12, 2025 •

edited

Loading

Uh oh!

newling commented Mar 12, 2025

Uh oh!

Uh oh!

Add core performance for 4-level tiling pipeline #1177

Add core performance for 4-level tiling pipeline #1177

Uh oh!

Conversation

newling commented Mar 11, 2025

Uh oh!

jtuyls commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newling commented Mar 12, 2025

Uh oh!

Uh oh!

jtuyls commented Mar 12, 2025 •

edited

Loading