Add core performance for 4-level tiling pipeline #1177
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The existing test uses the pack-peel pipeline and results in a function call to a matmul with m=n=64 and k=32, different to the existing function call resulting from the pack-peel-4-level-tiling pipeline, which is m=n=64 and k=64.
Core performance with k=64 is better. However performance with k=32 with jam-and-unroll is even better ( see #1167 ). Unfortunately k=64 with jam-and-unroll results in using the stack / stack overflow -- either very slow or compilation failure.
So I guess the options are
decompose the linalg.generic with the 64x64x64 matmul just before function outlining, to 2 (or maybe 4) smaller linalg.generics, and then outline that smaller generic. Pros: (1) can unroll and jam, and get best currently observed performance (2) even better PM foot-print, as smaller outlined function. Cons : (1) complexity (2) potentially dodging the more serious problem of not controlling aie-opt (3) isn't ukernel approach with64x64x64 optimal, what does that look like?
don't unroll-and-jam with this pipeline