Conversation
This pass identifies addressing patterns in memory-access intensive kernels. In such kernels an index is frequently scaled and added to a base pointer to access memory followed by an immediate index update. The pass hoists the pointer arithmetic out of the loop header converting index addressing into induction variables.
| /* This pass identifies addressing patterns in memory-access intensive | ||
| kernels. In such kernels an index is frequently scaled and added to a | ||
| base pointer to access memory followed by an immediate index update. | ||
| The pass hoists the pointer arithmetic out of the loop header converting | ||
| index addressing into induction variables. */ | ||
|
|
There was a problem hiding this comment.
On #241 I asked what the point of this pass was. This looks a lot like something ivopts is for. Why make something specialized for RTL?
There was a problem hiding this comment.
In the context of loops with unsigned loop bounds, ivopts can not do much because scalar evolution can not calculate the upper iteration bound of a loop - as the loop iterator may overflow. Because of this i tried to fix things at rtl level.
As you were trying to make use of loop versioning at the tree level (where ivopts takes place) I started approaching the performance issues from the rtl level by analyzing the generated assembly and comparing it to what llvm and gcc with signed loop bounds were generating.
This phase combined with the patch from #261 managed to reach same performance as if the loop bounds were signed int.
Also remark that ivopts is not aware of sh1add.uw and addw and therefore its cost model can not accurately calculate whether hoisting them is cheaper than leaving them inside the loop.
In conclusion this phase is complementary to ivopts as it targets a specific peephole (or combine) type of optimization being able to reason without the issues triggered by the unsigned loop bounds.
Remark here that combine and peephole wont apply here as they work on instruction windows of 2-4 instructions while the pattern i recognize here is a dependency chain of several instructions that spans over bb or even loop boundaries.
Also another interesting issue is that i allow the optimization to take place only for leaf functions (this ensures no function calls happen inside - and as a result no need to spill caller saved registers) to decrease the chance of introducing spill code. On the contrary i am familiar that although ivopts (as running at the tree level) does take register pressure into account it does it in simplistic way which in return has the potential to result in a large amount of spill code.
Last but not least - there is a jira ticket where i proposed another solution for dealing with loops with unsigned bounds that doesnt require loop versioning. The approach is rather simple and i have already implemented a basic tree level phase that deals with it.
This pass identifies addressing patterns in memory-access intensive kernels. In such kernels an index is frequently scaled and added to a base pointer to access memory followed by an immediate index update. The pass hoists the pointer arithmetic out of the loop header converting index addressing into induction variables.