Skip to content

Add matrix address optimization phase for Zba#261

Open
Alexehv77 wants to merge 1 commit into
mem-optfrom
mem-opt2
Open

Add matrix address optimization phase for Zba#261
Alexehv77 wants to merge 1 commit into
mem-optfrom
mem-opt2

Conversation

@Alexehv77

Copy link
Copy Markdown

This pass identifies addressing patterns in memory-access intensive kernels. In such kernels an index is frequently scaled and added to a base pointer to access memory followed by an immediate index update. The pass hoists the pointer arithmetic out of the loop header converting index addressing into induction variables.

This pass identifies addressing patterns in memory-access intensive
kernels. In such kernels an index is frequently scaled and added to a
base pointer to access memory followed by an immediate index update.
The pass hoists the pointer arithmetic out of the loop header converting
index addressing into induction variables.
Comment on lines +1 to +6
/* This pass identifies addressing patterns in memory-access intensive
kernels. In such kernels an index is frequently scaled and added to a
base pointer to access memory followed by an immediate index update.
The pass hoists the pointer arithmetic out of the loop header converting
index addressing into induction variables. */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On #241 I asked what the point of this pass was. This looks a lot like something ivopts is for. Why make something specialized for RTL?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of loops with unsigned loop bounds, ivopts can not do much because scalar evolution can not calculate the upper iteration bound of a loop - as the loop iterator may overflow. Because of this i tried to fix things at rtl level.

As you were trying to make use of loop versioning at the tree level (where ivopts takes place) I started approaching the performance issues from the rtl level by analyzing the generated assembly and comparing it to what llvm and gcc with signed loop bounds were generating.

This phase combined with the patch from #261 managed to reach same performance as if the loop bounds were signed int.

Also remark that ivopts is not aware of sh1add.uw and addw and therefore its cost model can not accurately calculate whether hoisting them is cheaper than leaving them inside the loop.

In conclusion this phase is complementary to ivopts as it targets a specific peephole (or combine) type of optimization being able to reason without the issues triggered by the unsigned loop bounds.
Remark here that combine and peephole wont apply here as they work on instruction windows of 2-4 instructions while the pattern i recognize here is a dependency chain of several instructions that spans over bb or even loop boundaries.

Also another interesting issue is that i allow the optimization to take place only for leaf functions (this ensures no function calls happen inside - and as a result no need to spill caller saved registers) to decrease the chance of introducing spill code. On the contrary i am familiar that although ivopts (as running at the tree level) does take register pressure into account it does it in simplistic way which in return has the potential to result in a large amount of spill code.

Last but not least - there is a jira ticket where i proposed another solution for dealing with loops with unsigned bounds that doesnt require loop versioning. The approach is rather simple and i have already implemented a basic tree level phase that deals with it.

@Alexehv77 Alexehv77 requested a review from MichielDerhaeg June 12, 2026 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants