Description
Some popular compilers (cough GCC cough) have rather conservative (stupid?) CSE passes. For instance, to get decent code in AVX-512 sgemm, I have to store the base pointers to the windows of C (in DRAM) that I'm loading.
See here: https://gcc.godbolt.org/z/38Y1EPr5n
Without the c_base
array, GCC spills all the computed addresses to memory. Clang 13 does okay, but with this change GCC wins overall.
I think a lot of the issues we're encountering with respect to code generation stem from the size of the gap between LoopIR and C. I think we should design a closer-to-C IR (CIR for now) that the compiler and memory system use to construct a final compiled program, and then give users the ability to visit this IR as an "ultimate" escape hatch. We could even expose some scheduling directives on it, such as a CSE pass that isn't worried about the horrors of C semantics.
The particular issue above involves CSE-ing sub-expressions that are generated by the memories... no way to schedule one's way out of this.