-
Notifications
You must be signed in to change notification settings - Fork 1.4k
cranelift: Optimize select+icmp
into {s,u}{min,max}
#5546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cranelift: Optimize select+icmp
into {s,u}{min,max}
#5546
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these rules can be generalized further, but I think these are worth merging as-is too.
I think there are several ways to generalize these:
- Each of these rules has an equivalent reversed version; for example
icmp sgt x, y
is equivalent toicmp slt y, x
. Perhaps we should take advantage of the egraph ability to represent equivalencies with a(simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x)
rule. - The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.
- Is there an equivalent floating-point rewrite? NaNs make my head hurt.
I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64
but I can't; it looks like my username isn't in the list in performance.yml
.
Yeah, that sounds like a good idea, I'll try that!
👍, I'll add those in this PR if that's OK with you.
I'm not sure, I'll try to look into it.
I don't expect any benefits for x86 since I think right now we lower /bench_x64 |
instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [160490 179150.32 293708] commit.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [202562282 217920659.92 241944548] commit.so instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [186500 203277.04 247336] commit.so compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [293540152 337751307.12 374158094] commit.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [6965675788 7114549484.48 7213034856] commit.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [488512 504472.32 573802] commit.so execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [8327116 8476234.16 8683322] commit.so execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [111521870 115012543.92 117344984] commit.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [1024608234 1038058547.20 1067188634] commit.so |
I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance. This is what we get now:
So, I don't think there is an exact match for the full We have some docs for those instructions, but they are better described in the Webassembly PR that introduces them I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal... Sightglass SIMD benchmark results
/bench_x64 |
instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [195254 206675.84 239848] commit.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [158588 177232.96 200728] commit.so compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [310637436 342353446.56 390973658] commit.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [205285250 220044999.92 262015540] commit.so execution :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [111173460 114115222.96 119495454] commit.so compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [6940653952 7096700981.20 7212066130] commit.so instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [478890 506487.60 662056] commit.so execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm No difference in performance. [1023014568 1034258955.84 1055892690] commit.so execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm No difference in performance. [8352392 8496849.04 8691182] commit.so |
It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward! Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default? (I'm cooking up a PR to flip that, but it's not done yet.) |
It was also really intuitive to get to this solution which is awesome!
Oh! I completely forgot about that! 😄 I'll try to enable those settings in sightglass in the next run I make. But I'm not expecting great things here, x86 and aarch64 both lower to the same instructions, RISC-V does benefit from it since in can do The |
Egraph based optimizations are now on by default, so I think an rebase should be enough to test if something changes. |
Egraph optimizations are on by default if optimizations are on. Do benchmarks run with optimizations enabled? |
bd14a99
to
25ea16e
Compare
I think optimizations are enabled by default, at least on the wasmtime cli that is the case. Although I'm not entirely sure about sightglass. Rebased this on main and ran sightglass on AArch64, doesen't seem to have much of an impact. I wonder if the results would be any different for something like Scalar Benchmarks AArch64
SIMD Benchmarks AArch64
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afonso360 this looks generally good to me, aside from comment below, and test fixes; happy to merge once those are fixed!
25ea16e
to
0233691
Compare
Looks like s390x doesn't have lowerings for |
Well, we don't have any instructions for |
I think that's what x86 does, as well. |
I've now opened #5762 to add support for these clif instructions on s390x. |
0233691
to
7c999a1
Compare
This transform is wrong for the inputs in the attached testcase. It was introduced in bytecodealliance#5546 and has been sort of unnoticed until bytecodealliance#6843 caused some wasmtime testcode to fire this optimization rule and produce this counterexample.
…odealliance#6859) This transform is wrong for the inputs in the attached testcase. It was introduced in bytecodealliance#5546 and has been sort of unnoticed until bytecodealliance#6843 caused some wasmtime testcode to fire this optimization rule and produce this counterexample.
👋 Hey,
This PR adds some egraph rules to optimize a
select + icmp
into the appropriate version of themin
/max
instructions.We transform the following code:
Into:
This is beneficial at least for the RISC-V backend, where we can emit the
max
instruction in these cases instead of branching.I've been running this in fuzzgen for a while and it hasn't complained so far!