Description
(Fleshing out and jotting down for posterity an idea I floated and we collectively refined at a recent Cranelift meeting.)
This peephole pass could run after regalloc, just before encoding instructions. It wouldn't need to be a whole new traversal over the IR, it could run as we walk the vcode for encoding, essentially implemented as an iterator combinator.
This peephole pass would not be DFG-based. It would be a classic sliding-window-of-instructions peephole pass. This both allows us to avoid constructing a more-complete DFG for vcode like what we have for clif, which could be expensive, and it allows us to benefit from opportunities that arise only because of instruction scheduling of two different sequences that are otherwise disconnected in the DFG.
Put another way: if you want DFG-y rewrites, use the mid-end for target-agnostic rules and the clif-to-vcode instruction selection lowering for target-specific rules. We have good solutions for that kind of thing. If however you want non-DFG-y rewrites, we currently have nothing and this issue is proposing we add something for the target-specific dimension, which is where I suspect most of the opportunity exists.
Implementation idea brain dump:
- Treat finding matches as a a string-matching algorithm.
- For simplicity, and to cut down on the size of tables, ignore operands and only consider opcodes.
- So we treat the vcode as a string where the alphabet is its opcodes
- Take all the left-hand sides of our rewrites and treat them as substrings in that same alphabet
- Do an online version of aho-corasick or similar to find all substring matches in the vcode, without needing to repeat work for each index in the vcode or for each rewrite rule
- On each substring match
- run additional predicates to capture preconditions in the associated pattern(s)'s left-hand side that are not accounted for in the string match (i.e. that one of the operands is zero)
- if those additional predicates are satisfied, replace the matched sequence with the rule's right-hand side
- On each substring match
- As far as ISLE integration goes:
- I don't think we could do this with just custom external multi extractors, since we need to process all the rules to get the substring patterns to create the aho-corasick automaton
- This means we would need to add a new kind of mode/backend to ISLE or something?
- We could, I guess, just accept that we are doing redundant work at each index in the vcode "string" and not do the full aho-corasick thing.
- We would still get the trie-y bits via
islec
- But we wouldn't be able to avoid repeatedly looking at the same instructions when we call into ISLE at
vcode[i]
that we've already looked at when we called into ISLE fromvcode[i - 1]
. - pretty sure we could implement this less-ambitious version with ISLE today
- but this is also way less cool
- We would still get the trie-y bits via
Between integrating with an existing traversal over the vcode, not constructing a DFG, and the string-matching sketch just above, this new peephole pass should impose extremely little overhead on compile times.
A few concrete example peephole optimizations we could do, I'm sure we can think of more and will do so if we put this infra in place:
- turn adjacent pairs of loads/stores into
ldp
/stp
on aarch64- similarly: turn a pair of adjacent u32 loads into an unaligned u64 load and split on targets where the cost of an unaligned u64 load is less than the cost of two u32 loads (I think x64, at least)
- sink spills/reloads introduced by regalloc into the actual
add
s or whatever that produces/consumes the value on x64 - on x64 (and others?) avoid unnecessary
cmp
andtest
instructions when we know that the last instruction in the stream happened to set identical flags (egand rax, r10; test rax, r10
orsub rdi, rsi; cmp rdi, rsi
)