Skip to content

cranelift: Optimize select+icmp into {s,u}{min,max} #5546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 15, 2023

Conversation

afonso360
Copy link
Contributor

@afonso360 afonso360 commented Jan 7, 2023

👋 Hey,

This PR adds some egraph rules to optimize a select + icmp into the appropriate version of the min/max instructions.

We transform the following code:

function %select_sgt_to_smax(i32, i32) -> i32 {
block0(v0: i32, v1: i32):
    v2 = icmp sgt v0, v1
    v3 = select v2, v0, v1
    return v3
}

Into:

function %select_sgt_to_smax(i32, i32) -> i32 {
block0(v0: i32, v1: i32):
    v2 = smax v0, v1
    return v2
}

This is beneficial at least for the RISC-V backend, where we can emit the max instruction in these cases instead of branching.

I've been running this in fuzzgen for a while and it hasn't complained so far!

@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Jan 7, 2023
Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these rules can be generalized further, but I think these are worth merging as-is too.

I think there are several ways to generalize these:

  1. Each of these rules has an equivalent reversed version; for example icmp sgt x, y is equivalent to icmp slt y, x. Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.
  2. The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.
  3. Is there an equivalent floating-point rewrite? NaNs make my head hurt.

I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.

@afonso360
Copy link
Contributor Author

afonso360 commented Jan 9, 2023

Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.

Yeah, that sounds like a good idea, I'll try that!

The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.

👍, I'll add those in this PR if that's OK with you.

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

I'm not sure, I'll try to look into it.

I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.

I don't expect any benefits for x86 since I think right now we lower min/max as cmp+cmov* which is the same as the unoptimized form, but we can try it. Although we might see some benefits on the vselect version (maybe?).

/bench_x64

@jlb6740
Copy link
Contributor

jlb6740 commented Jan 9, 2023

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[160490 179150.32 293708] commit.so
[155924 172967.04 189250] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[202562282 217920659.92 241944548] commit.so
[199991838 212302882.08 235356036] main.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[186500 203277.04 247336] commit.so
[186216 207677.60 249312] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[293540152 337751307.12 374158094] commit.so
[295516870 341031354.56 393946812] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6965675788 7114549484.48 7213034856] commit.so
[6955473648 7075012926.00 7165106004] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[488512 504472.32 573802] commit.so
[488122 501985.20 525900] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8327116 8476234.16 8683322] commit.so
[8339612 8512672.24 8750996] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[111521870 115012543.92 117344984] commit.so
[111540296 114710730.08 122547780] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1024608234 1038058547.20 1067188634] commit.so
[1027357456 1037305556.48 1048046760] main.so

@afonso360
Copy link
Contributor Author

afonso360 commented Jan 15, 2023

I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal icmp eq. I've left it as a separate commit so that its easier to investigate.

I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.

This is what we get now:

test optimize
set use_egraphs=true
target x86_64

function %icmp_eq(i32, i32) -> i8 {
block0(v0: i32, v1: i32):
    v2 = icmp eq v0, v1
    return v2
}

; check: block0(v0: i32, v1: i32):
; check:    v8 = icmp eq v0, v1
; check:    v9 -> v8
; check:    v10 -> v9
; check:    v11 -> v10
; check:    v12 -> v11
; check:    v13 -> v12
; check:    v14 -> v13
; check:    return v8 

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

So, I don't think there is an exact match for the full fcmp space, but the f{min,max}_pseudo instructions are defined in terms of fcmp {lt,gt} + select, so we can at least get those.

We have some docs for those instructions, but they are better described in the Webassembly PR that introduces them


I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...

Sightglass SIMD benchmark results
     Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm
`

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [73345320 95091825.36 287267718] main.so
  [70722628 101732805.98 299985933] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [168256 216817.60 382976] main.so
  [177600 220696.00 344096] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [133184 167869.12 416352] main.so
  [132128 165781.12 341184] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [709243689 815567491.90 1073223272] main.so
  [729598384 821242801.38 1027388168] select.so

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [3855579532 3914533632.01 4071619781] main.so
  [3855359377 3917102361.59 4042283617] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [439936 513683.52 823392] main.so
  [439488 513457.28 862848] select.so

/bench_x64

@jlb6740
Copy link
Contributor

jlb6740 commented Jan 15, 2023

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[195254 206675.84 239848] commit.so
[185986 200724.96 227526] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[158588 177232.96 200728] commit.so
[153838 181267.68 359084] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[310637436 342353446.56 390973658] commit.so
[295484178 336498505.76 366404166] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[205285250 220044999.92 262015540] commit.so
[202251150 217512094.00 237476954] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[111173460 114115222.96 119495454] commit.so
[111448476 114811907.92 115933528] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6940653952 7096700981.20 7212066130] commit.so
[6942954936 7124108036.08 7229159218] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[478890 506487.60 662056] commit.so
[485120 504785.28 608956] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1023014568 1034258955.84 1055892690] commit.so
[1019706902 1030842327.04 1044519376] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8352392 8496849.04 8691182] commit.so
[8360734 8482021.60 8626984] main.so

@cfallin
Copy link
Member

cfallin commented Jan 17, 2023

It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!

Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default? (I'm cooking up a PR to flip that, but it's not done yet.)

@afonso360
Copy link
Contributor Author

afonso360 commented Jan 18, 2023

It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!

It was also really intuitive to get to this solution which is awesome!

Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default?

Oh! I completely forgot about that! 😄

I'll try to enable those settings in sightglass in the next run I make. But I'm not expecting great things here, x86 and aarch64 both lower to the same instructions, RISC-V does benefit from it since in can do max/min as a single instruction.

The vselect ones I think might make a difference? Unless existing toolchains already perform this optimization when producing wasm which is somewhat likely.

@EdorianDark
Copy link

Egraph based optimizations are now on by default, so I think an rebase should be enough to test if something changes.

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 29, 2023

Egraph optimizations are on by default if optimizations are on. Do benchmarks run with optimizations enabled?

@afonso360 afonso360 force-pushed the select-optimizations branch from bd14a99 to 25ea16e Compare January 29, 2023 18:59
@afonso360
Copy link
Contributor Author

Do benchmarks run with optimizations enabled?

I think optimizations are enabled by default, at least on the wasmtime cli that is the case. Although I'm not entirely sure about sightglass.

Rebased this on main and ran sightglass on AArch64, doesen't seem to have much of an impact. I wonder if the results would be any different for something like cg_clif that doesen't run on already optimized code.
Unfortunately I can't run this on RISCV which is where I would hope to see some positive effect.

Scalar Benchmarks AArch64
ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3282352.64 ± 1066807.74 (confidence = 99%)

  main.so is 1.02x to 1.04x faster than select.so!

  [111047763 118113547.11 123328519] main.so
  [115034752 121395899.75 128445997] select.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 6938.23 ± 1941.66 (confidence = 99%)

  select.so is 1.01x to 1.01x faster than main.so!

  [876198 884956.45 902123] main.so
  [869354 878018.22 901740] select.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [2849 3260.40 12227] main.so
  [2804 3148.16 4268] select.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4381352 5597680.00 6817168] main.so
  [4465781 5715523.09 7002568] select.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2361 2621.85 4044] main.so
  [2284 2575.55 3465] select.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2440354 2586186.45 3163455] main.so
  [2460769 2621895.24 3155162] select.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [69303 75400.87 94495] main.so
  [70033 76346.64 97401] select.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [5533 6360.26 18487] main.so
  [5558 6418.39 19011] select.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [8367754 8415382.47 8499654] main.so
  [8377518 8420745.48 8547947] select.so
SIMD Benchmarks AArch64
ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  Δ = 203985.92 ± 8663.93 (confidence = 99%)

  select.so is 1.00x to 1.00x faster than main.so!

  [45561021 45611295.11 45720621] main.so
  [45363259 45407309.19 45486165] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [3472 4728.82 13547] main.so
  [3477 4261.65 15734] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [1749 2124.65 3341] main.so
  [1724 2101.14 3474] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [5448821 5745023.23 6331575] main.so
  [5487095 5779248.67 6320724] select.so

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [526706 615112.06 1567920] main.so
  [536207 613883.57 1149844] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [2246 2495.36 3522] main.so
  [2126 2497.00 4032] select.so

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afonso360 this looks generally good to me, aside from comment below, and test fixes; happy to merge once those are fixed!

@afonso360 afonso360 force-pushed the select-optimizations branch from 25ea16e to 0233691 Compare February 9, 2023 20:41
@jameysharp
Copy link
Contributor

Looks like s390x doesn't have lowerings for umin/umax on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigand

@uweigand
Copy link
Member

Looks like s390x doesn't have lowerings for umin/umax on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigand

Well, we don't have any instructions for umin/umax on scalars, so the only possible implementation would be to turn it right back into (the equivalent of) select + icmp ... Is this what we're supposed to be doing here?

@afonso360
Copy link
Contributor Author

I think that's what x86 does, as well.

@uweigand
Copy link
Member

I've now opened #5762 to add support for these clif instructions on s390x.

@afonso360 afonso360 force-pushed the select-optimizations branch from 0233691 to 7c999a1 Compare February 15, 2023 11:57
@jameysharp jameysharp merged commit 76539ef into bytecodealliance:main Feb 15, 2023
afonso360 added a commit to afonso360/wasmtime that referenced this pull request Aug 18, 2023
This transform is wrong for the inputs in the attached testcase.

It was introduced in bytecodealliance#5546 and has been sort of unnoticed until bytecodealliance#6843 caused
some wasmtime testcode to fire this optimization rule and produce this
counterexample.
github-merge-queue bot pushed a commit that referenced this pull request Aug 21, 2023
This transform is wrong for the inputs in the attached testcase.

It was introduced in #5546 and has been sort of unnoticed until #6843 caused
some wasmtime testcode to fire this optimization rule and produce this
counterexample.
eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this pull request Aug 22, 2023
…odealliance#6859)

This transform is wrong for the inputs in the attached testcase.

It was introduced in bytecodealliance#5546 and has been sort of unnoticed until bytecodealliance#6843 caused
some wasmtime testcode to fire this optimization rule and produce this
counterexample.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants