cranelift: Optimize `select+icmp` into `{s,u}{min,max}` #5546

afonso360 · 2023-01-07T12:23:33Z

👋 Hey,

This PR adds some egraph rules to optimize a select + icmp into the appropriate version of the min/max instructions.

We transform the following code:

function %select_sgt_to_smax(i32, i32) -> i32 {
block0(v0: i32, v1: i32):
    v2 = icmp sgt v0, v1
    v3 = select v2, v0, v1
    return v3
}

Into:

function %select_sgt_to_smax(i32, i32) -> i32 {
block0(v0: i32, v1: i32):
    v2 = smax v0, v1
    return v2
}

This is beneficial at least for the RISC-V backend, where we can emit the max instruction in these cases instead of branching.

I've been running this in fuzzgen for a while and it hasn't complained so far!

jameysharp

I believe these rules can be generalized further, but I think these are worth merging as-is too.

I think there are several ways to generalize these:

Each of these rules has an equivalent reversed version; for example icmp sgt x, y is equivalent to icmp slt y, x. Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.
The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.
Is there an equivalent floating-point rewrite? NaNs make my head hurt.

I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.

afonso360 · 2023-01-09T21:48:50Z

Perhaps we should take advantage of the egraph ability to represent equivalencies with a (simplify (icmp ty cc x y) (icmp ty (intcc_reverse cc) y x) rule.

Yeah, that sounds like a good idea, I'll try that!

The same reasoning should work on vectors; I think copying each of these rules for vselect of icmp should work.

👍, I'll add those in this PR if that's OK with you.

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

I'm not sure, I'll try to look into it.

I'd love to see Sightglass benchmark figures for this PR. I think you can do /bench_x64 but I can't; it looks like my username isn't in the list in performance.yml.

I don't expect any benefits for x86 since I think right now we lower min/max as cmp+cmov* which is the same as the unoptimized form, but we can try it. Although we might see some benefits on the vselect version (maybe?).

/bench_x64

jlb6740 · 2023-01-09T21:56:14Z

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[160490 179150.32 293708] commit.so
[155924 172967.04 189250] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[202562282 217920659.92 241944548] commit.so
[199991838 212302882.08 235356036] main.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[186500 203277.04 247336] commit.so
[186216 207677.60 249312] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[293540152 337751307.12 374158094] commit.so
[295516870 341031354.56 393946812] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6965675788 7114549484.48 7213034856] commit.so
[6955473648 7075012926.00 7165106004] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[488512 504472.32 573802] commit.so
[488122 501985.20 525900] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8327116 8476234.16 8683322] commit.so
[8339612 8512672.24 8750996] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[111521870 115012543.92 117344984] commit.so
[111540296 114710730.08 122547780] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1024608234 1038058547.20 1067188634] commit.so
[1027357456 1037305556.48 1048046760] main.so

afonso360 · 2023-01-15T17:59:29Z

I tried adding the icmp reverse rule, and while it works and is correct it seems to be doing something weird. We now get a bunch of aliases after optimizing a normal icmp eq. I've left it as a separate commit so that its easier to investigate.

I'm worried that we might be unintentionally doing a bunch of stuff and worsening compile performance.

This is what we get now:

test optimize
set use_egraphs=true
target x86_64

function %icmp_eq(i32, i32) -> i8 {
block0(v0: i32, v1: i32):
    v2 = icmp eq v0, v1
    return v2
}

; check: block0(v0: i32, v1: i32):
; check:    v8 = icmp eq v0, v1
; check:    v9 -> v8
; check:    v10 -> v9
; check:    v11 -> v10
; check:    v12 -> v11
; check:    v13 -> v12
; check:    v14 -> v13
; check:    return v8

Is there an equivalent floating-point rewrite? NaNs make my head hurt.

So, I don't think there is an exact match for the full fcmp space, but the f{min,max}_pseudo instructions are defined in terms of fcmp {lt,gt} + select, so we can at least get those.

We have some docs for those instructions, but they are better described in the Webassembly PR that introduces them

I ran the Sightglass SIMD benchmarks locally and they measured no difference, but I'm not too confident in my benchmarking setup since its a linux vm inside a windows host, which doesn't seem ideal...

Sightglass SIMD benchmark results

     Running `target/debug/sightglass-cli benchmark --engine ./libwasmtime_main.so --engine ./libwasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm
`

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [73345320 95091825.36 287267718] main.so
  [70722628 101732805.98 299985933] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [168256 216817.60 382976] main.so
  [177600 220696.00 344096] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [133184 167869.12 416352] main.so
  [132128 165781.12 341184] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [709243689 815567491.90 1073223272] main.so
  [729598384 821242801.38 1027388168] select.so

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [3855579532 3914533632.01 4071619781] main.so
  [3855359377 3917102361.59 4042283617] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [439936 513683.52 823392] main.so
  [439488 513457.28 862848] select.so

/bench_x64

jlb6740 · 2023-01-15T18:06:13Z

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[195254 206675.84 239848] commit.so
[185986 200724.96 227526] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[158588 177232.96 200728] commit.so
[153838 181267.68 359084] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[310637436 342353446.56 390973658] commit.so
[295484178 336498505.76 366404166] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[205285250 220044999.92 262015540] commit.so
[202251150 217512094.00 237476954] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

No difference in performance.

[111173460 114115222.96 119495454] commit.so
[111448476 114811907.92 115933528] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[6940653952 7096700981.20 7212066130] commit.so
[6942954936 7124108036.08 7229159218] main.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[478890 506487.60 662056] commit.so
[485120 504785.28 608956] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

No difference in performance.

[1023014568 1034258955.84 1055892690] commit.so
[1019706902 1030842327.04 1044519376] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

No difference in performance.

[8352392 8496849.04 8691182] commit.so
[8360734 8482021.60 8626984] main.so

cfallin · 2023-01-17T19:55:36Z

It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!

Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default? (I'm cooking up a PR to flip that, but it's not done yet.)

afonso360 · 2023-01-18T15:43:25Z

It makes me quite happy to see the usage of the egraph framework to add a rewrite be so straightforward!

It was also really intuitive to get to this solution which is awesome!

Regarding the benchmarking results: perhaps they are showing no effect because egraph-based optimization is still off by default?

Oh! I completely forgot about that! 😄

I'll try to enable those settings in sightglass in the next run I make. But I'm not expecting great things here, x86 and aarch64 both lower to the same instructions, RISC-V does benefit from it since in can do max/min as a single instruction.

The vselect ones I think might make a difference? Unless existing toolchains already perform this optimization when producing wasm which is somewhat likely.

EdorianDark · 2023-01-29T12:02:03Z

Egraph based optimizations are now on by default, so I think an rebase should be enough to test if something changes.

bjorn3 · 2023-01-29T16:34:58Z

Egraph optimizations are on by default if optimizations are on. Do benchmarks run with optimizations enabled?

afonso360 · 2023-01-29T19:02:59Z

Do benchmarks run with optimizations enabled?

I think optimizations are enabled by default, at least on the wasmtime cli that is the case. Although I'm not entirely sure about sightglass.

Rebased this on main and ran sightglass on AArch64, doesen't seem to have much of an impact. I wonder if the results would be any different for something like cg_clif that doesen't run on already optimized code.
Unfortunately I can't run this on RISCV which is where I would hope to see some positive effect.

Scalar Benchmarks AArch64

ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 3282352.64 ± 1066807.74 (confidence = 99%)

  main.so is 1.02x to 1.04x faster than select.so!

  [111047763 118113547.11 123328519] main.so
  [115034752 121395899.75 128445997] select.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 6938.23 ± 1941.66 (confidence = 99%)

  select.so is 1.01x to 1.01x faster than main.so!

  [876198 884956.45 902123] main.so
  [869354 878018.22 901740] select.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [2849 3260.40 12227] main.so
  [2804 3148.16 4268] select.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [4381352 5597680.00 6817168] main.so
  [4465781 5715523.09 7002568] select.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2361 2621.85 4044] main.so
  [2284 2575.55 3465] select.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [2440354 2586186.45 3163455] main.so
  [2460769 2621895.24 3155162] select.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [69303 75400.87 94495] main.so
  [70033 76346.64 97401] select.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [5533 6360.26 18487] main.so
  [5558 6418.39 19011] select.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [8367754 8415382.47 8499654] main.so
  [8377518 8420745.48 8547947] select.so

SIMD Benchmarks AArch64

ubuntu@instance-20220805-0848:~/git/sightglass$ target/debug/sightglass-cli benchmark --engine /tmp/wasmtime_main.so --engine /tmp/wasmtime_select.so -- benchmarks/blake3-simd/benchmark.wasm benchmarks/intgemm-simd/benchmark.wasm

execution :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  Δ = 203985.92 ± 8663.93 (confidence = 99%)

  select.so is 1.00x to 1.00x faster than main.so!

  [45561021 45611295.11 45720621] main.so
  [45363259 45407309.19 45486165] select.so

execution :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [3472 4728.82 13547] main.so
  [3477 4261.65 15734] select.so

instantiation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [1749 2124.65 3341] main.so
  [1724 2101.14 3474] select.so

compilation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [5448821 5745023.23 6331575] main.so
  [5487095 5779248.67 6320724] select.so

compilation :: cycles :: benchmarks/blake3-simd/benchmark.wasm

  No difference in performance.

  [526706 615112.06 1567920] main.so
  [536207 613883.57 1149844] select.so

instantiation :: cycles :: benchmarks/intgemm-simd/benchmark.wasm

  No difference in performance.

  [2246 2495.36 3522] main.so
  [2126 2497.00 4032] select.so

cfallin

@afonso360 this looks generally good to me, aside from comment below, and test fixes; happy to merge once those are fixed!

cranelift/codegen/src/opts/algebraic.isle

jameysharp · 2023-02-09T21:16:14Z

Looks like s390x doesn't have lowerings for umin/umax on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigand

uweigand · 2023-02-10T12:15:52Z

Looks like s390x doesn't have lowerings for umin/umax on scalars, so when this PR rewrites this pattern to those instructions the s390x tests fail. cc: @uweigand

Well, we don't have any instructions for umin/umax on scalars, so the only possible implementation would be to turn it right back into (the equivalent of) select + icmp ... Is this what we're supposed to be doing here?

afonso360 · 2023-02-10T12:24:29Z

I think that's what x86 does, as well.

uweigand · 2023-02-10T15:28:24Z

I've now opened #5762 to add support for these clif instructions on s390x.

This transform is wrong for the inputs in the attached testcase. It was introduced in bytecodealliance#5546 and has been sort of unnoticed until bytecodealliance#6843 caused some wasmtime testcode to fire this optimization rule and produce this counterexample.

This transform is wrong for the inputs in the attached testcase. It was introduced in #5546 and has been sort of unnoticed until #6843 caused some wasmtime testcode to fire this optimization rule and produce this counterexample.

…odealliance#6859) This transform is wrong for the inputs in the attached testcase. It was introduced in bytecodealliance#5546 and has been sort of unnoticed until bytecodealliance#6843 caused some wasmtime testcode to fire this optimization rule and produce this counterexample.

github-actions bot added the cranelift Issues related to the Cranelift code generator label Jan 7, 2023

jameysharp approved these changes Jan 9, 2023

View reviewed changes

afonso360 force-pushed the select-optimizations branch from bd14a99 to 25ea16e Compare January 29, 2023 18:59

cfallin reviewed Feb 9, 2023

View reviewed changes

cranelift/codegen/src/opts/algebraic.isle Outdated Show resolved Hide resolved

afonso360 force-pushed the select-optimizations branch from 25ea16e to 0233691 Compare February 9, 2023 20:41

afonso360 added 5 commits February 15, 2023 11:56

cranelift: Optimize select+icmp into {s,u}{min,max}

08e2a86

cranelift: Add generic egraph icmp reverse rule

1f049e5

cranelift: Optimize vselect+icmp into {s,u}{min,max}

51f6237

cranelift: Optimize some vselect+fcmp into f{min,max}_pseudo

87d1139

cranelift: Add inverted forms of min/max rules

7c999a1

afonso360 force-pushed the select-optimizations branch from 0233691 to 7c999a1 Compare February 15, 2023 11:57

jameysharp merged commit 76539ef into bytecodealliance:main Feb 15, 2023

afonso360 mentioned this pull request Feb 28, 2023

[Cranelift] Implement scalar min/max for AArch64 #4313

Open

jameysharp mentioned this pull request Mar 30, 2023

ISLE: Support commutative matches #6128

Open

afonso360 mentioned this pull request Aug 18, 2023

egraphs: Delete select+fcmp to f{min,max}_pseudo transform #6859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cranelift: Optimize `select+icmp` into `{s,u}{min,max}` #5546

cranelift: Optimize `select+icmp` into `{s,u}{min,max}` #5546

afonso360 commented Jan 7, 2023 •

edited

Loading

jameysharp left a comment

afonso360 commented Jan 9, 2023 •

edited

Loading

jlb6740 commented Jan 9, 2023

afonso360 commented Jan 15, 2023 •

edited

Loading

jlb6740 commented Jan 15, 2023

cfallin commented Jan 17, 2023

afonso360 commented Jan 18, 2023 •

edited

Loading

EdorianDark commented Jan 29, 2023

bjorn3 commented Jan 29, 2023

afonso360 commented Jan 29, 2023

cfallin left a comment

jameysharp commented Feb 9, 2023

uweigand commented Feb 10, 2023

afonso360 commented Feb 10, 2023

uweigand commented Feb 10, 2023

cranelift: Optimize select+icmp into {s,u}{min,max} #5546

cranelift: Optimize select+icmp into {s,u}{min,max} #5546

Conversation

afonso360 commented Jan 7, 2023 • edited Loading

jameysharp left a comment

Choose a reason for hiding this comment

afonso360 commented Jan 9, 2023 • edited Loading

jlb6740 commented Jan 9, 2023

afonso360 commented Jan 15, 2023 • edited Loading

jlb6740 commented Jan 15, 2023

cfallin commented Jan 17, 2023

afonso360 commented Jan 18, 2023 • edited Loading

EdorianDark commented Jan 29, 2023

bjorn3 commented Jan 29, 2023

afonso360 commented Jan 29, 2023

cfallin left a comment

Choose a reason for hiding this comment

jameysharp commented Feb 9, 2023

uweigand commented Feb 10, 2023

afonso360 commented Feb 10, 2023

uweigand commented Feb 10, 2023

cranelift: Optimize `select+icmp` into `{s,u}{min,max}` #5546

cranelift: Optimize `select+icmp` into `{s,u}{min,max}` #5546

afonso360 commented Jan 7, 2023 •

edited

Loading

afonso360 commented Jan 9, 2023 •

edited

Loading

afonso360 commented Jan 15, 2023 •

edited

Loading

afonso360 commented Jan 18, 2023 •

edited

Loading