From 74f1b1182bedd44baf19cdbe0df7c2e69ba6bbd9 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 7 Jul 2024 14:17:09 +0200 Subject: [PATCH 01/16] start working on a z3 inefficiencies blog post --- .../2024/07/finding-inefficiencies-with-z3.md | 243 ++++++++++++++++++ 1 file changed, 243 insertions(+) create mode 100644 posts/2024/07/finding-inefficiencies-with-z3.md diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md new file mode 100644 index 000000000..b71161d18 --- /dev/null +++ b/posts/2024/07/finding-inefficiencies-with-z3.md @@ -0,0 +1,243 @@ + + +Last week I was at the [PLDI conference](https://pldi24.sigplan.org/) in +Copenhagen to present a [paper](https://dl.acm.org/doi/10.1145/3652588.3663316) +I co-authored with [Max Bernstein](https://bernsteinbear.com/). I also finally +met [John Regehr](https://blog.regehr.org/), who I'd been talking on social +media for a long time but had never met. John has been working on compiler +correctness and better techniques for building compilers and optimizers since a +very long time. The blog post [Finding JIT Optimizer Bugs using SMT Solvers and +Fuzzing](https://www.pypy.org/posts/2022/12/jit-bug-finding-smt-fuzzing.html) +was heavily inspired by this work. We talked a lot about his and his groups +work on using Z3 to find missing optimizations. + +In my previous blog post about finding optimizer bugs, I described how to use +Z3 (an SMT solver) and fuzzing to find bugs in the integer optimization passes +of PyPy's JIT. In this post I want to talk about early results from an +experiment that came out of discussions with John (and also inspired by a talk +by Nikolai Tillmann about +[SPUR](https://web.archive.org/web/20130503204413/http://www.cs.iastate.edu/~design/vmil/2010/slides/p03-tillmann-slides.pdf) +that I saw many years ago). In this experiment I try to use Z3 to find missing +[peephole optimizations]() for integer operations in the traces of PyPy's JIT. + +## Motivation + +PyPy's JIT has historically always focused almost exclusively on the typical +inefficiencies of Python programs, such as removing boxing, run-time type +feedback, speculation type-specializing collections, etc. However, in the last +two years I have applied the JIT also to a very different domain, that of CPU +emulation, in the experimental [Pydrofoil](https://docs.pydrofoil.org) project +(supports RISC-V and ARM by now). For Pydrofoil, the performance of integer +operations is significantly more important than for "normal" Python code, which +is why I've become much more interested in optimizing integer operations. + +Finding missing optimizations is quite hard to do manually, because it involves +carefully staring at huge concrete traces of programs that we care about. So +ideally we want to automate the problem. To make sure that we care about the +optimizations that we find, we start from the traces of benchmarks that are +interesting, such as booting Linux on the ARM and RISC-V emulators. + +We could find missing optimizations by using LLVM or GCC if we could translate +RPython JIT traces to C code. But then we would be relying on the quality of +GCC optimizations. Instead, we will use Z3, an SMT ('satisfiability modulo +theories') solver by Microsoft Research, which was an excellent Python API. Z3 +has excellent support for reasoning about bitvectors, and we will use that to +identify non-optimal integer operation sequences. + +## Background Z3 + +Z3 is a STM solver, which means that it can answer the question whether a +certain formula is satisfiable. This means that it will try to find concrete +values for all the variables that occur in the formulas such that the formula +becomes true. The formulas can be about objects in various theories, Z3 has +support for (mathematical) integers, strings, arrays, unterinterpreted +functions, etc. The theory we care about in this blog post is the **bitvector** +theory. Bitvectors are vectors of bits of a fixed size, and they are great for +expressing operations on machine integers with a fixed bit width. We use Z3 +extensively to fuzz the JIT's optimizer, so we already have a translation from +JIT integer operations to Z3 formulas. I'm just reusing some of that +infrastructure for finding missing optimizations. + +To give a bit of an idea of what using the Z3 Python API looks like, here's an +example. Suppose we want to check that `x ^ -1` is the same as `~x` for +arbitrary 64-bit integers (this is true because xor with a 1 bit flips the bit +in `x`, and `-1` has all bits set in two's complement). + +```pycon +>>>> import z3 +>>>> a = z3.BitVec('a', 64) # we construct a bitvector variable with width 64 +>>>> a ^ -1 == ~a # the variables have operators overloaded, so we can combine them into larger formulas +a ^ 18446744073709551615 == ~a +>>>> z3.prove(a ^ -1 == ~a) # now we can ask Z3 to prove the formula +proved +>>>> z3.prove(a ^ -1 == a) # try to prove something wrong +counterexample +[a = 0] +``` + +Or, an example with two variables: `a * (1 << b)` is the same as `a << b`: + +```pycon +>>>> b = z3.BitVec('b', 64) +>>>> solver.prove(a * (1 << b) == a << b) +proved +``` + +## General Recipe for Finding Missing Optimizations + +Now that we have an idea how Z3 and bitvectors work, we want to use it to find +missing integer optimizations in the PyPy JIT. The general approach is to first +collect a bunch of traces of integer heavy programs. Then we encode the integer +operations in the traces into Z3 formulas and use different Z3 queries to +identify inefficiencies in the traces. Once we found an inefficiency, we try to +minimize the trace to the smallest trace that we can get that still shows an +inefficiency to make it easier to understand. Then we inspect the minimized +cases manually and (if possible) try to implement the missing optimizations. +I'll write about each of these steps in turn. + +## Encoding Traces as Z3 formulas + +The blog post about [using Z3 to find bugs]() already contained the code to +encode PyPy trace operations into Z3 formulas, so we don't need to repeat that +here. But to give an example, XXX + +## Identifying constant booleans with Z3 + +The very simplest thing we can try to find inefficient parts of the traces is +to first focus on boolean variables. They are represented as word-sized ints +with values 0 and 1 in PyPy's JIT. For every boolean variable in the trace we +can ask Z3 to prove that this variable must be always True or always False. +Most of the time, neither of these proofs will succeed. But if Z3 manages to +prove one of them, we know have found an ineffiency: instead of computing the +boolean result (eg by executing a comparison) we could have replaced the +operation with the corresponding constant. + +Here's an example of an inefficiency found that way: if `x < y` and `y < z` are +both true, PyPy's JIT isn't able to conclude from that that `x < z` must also +be true. The JIT reasons about the concrete ranges (lower and upper bounds) for +every integer variable, but it has no way to remember anything about +relationships between different variables. This kind of reasoning would quite +often be useful to remove list/string bounds checks. Here's a [talk about how +LLVM does this](https://www.youtube.com/watch?app=desktop&v=1hm5ZVmBEvo) (but +it might be too heavyweight for a JIT setting). + +Here are some more examples found that way: + +- `x - 1 == x` is always False +- `x - (x == -1) == -1` is always False. The pattern `x - (x == -1)` happens a + lot in PyPy's hash computations: To be compatible with the CPython hashes we + need to make sure that no object's hash is -1 (CPython uses -1 as an error + value on the C level). + +## Identifying redundant operations + +A more interesting class of redundancy is to try to find two operations in a +trace that compute the same result. We can do that by asking Z3 to prove for +each pair of different operations in the trace to prove that the result is +always the same. If a previous operation returns the same result, the JIT could +have re-used that result instead of re-computing it, saving time. Doing this +search for equivalent operations with Z3 is quadratic in the number of +operations, but since traces have a maximum length it is not too bad in +practice. + +This is the real workhorse of my script so far, it's what finds most of the +inefficiencies. Here's a few examples: + +- The very first and super useful example the script found is `int_eq(b, 1) == + b` if `b` is known to be a boolean (ie and integer 0 or 1). I have already + implemented this optimization in the JIT. +- Similarly, `int_and(b, 1) == b` for booleans. +- `(x << 4) & -0xf == x << 4` +- `((x >> 63) << 1) << 2) >> 3 == x >> 63`. In general the JIT is quite bad at + optimizing repeated shifts (the infrastructure for doing better with that is + already in place, so this will be a relatively easy fix). +- `(x & 0xffffffff) | ((x >> 32) << 32) == x`. Having the JIT optimize this + would maybe require first recognizing that `(x >> 32) << 32` can be expressed + as a mask: `(x ^ 0xffffffff00000000)`, and then using `(x & c1) | (x & c1) == + x & (c1 | c2)` +- A commonly occurring pattern is variations of this one: + `((x & 1345) ^ 2048) - 2048 == x & 1345` (with different constants, of + course). xor is add without carry, and `x & 1345` does not have the bit + `2048` set. Therefore the `^ 2048` is equivalent to `+ 2048`, which the `- + 2048` cancels. I don't understand at all why this appears so often in my + traces, but I see variations of it a lot. + +## Synthesizing more complicated constants with exists-forall + +To find out whether some integer operations always return a constant result, we +can't simply use the same trick as for those operations that return boolean +results. For bools, enumerating all possible results (True and False) and then +checking is feasible, but for 64-bit integers it very much is not. Therefore we +can ask Z3 to come up with a constant for us. To do this, we have to use an +"exists-forall query". To use this, we pose the following query for every +operation `op` in the trace: "does there exist a constant `c` so that for all +inputs of the trace the result of the operation `op` is equal to constant `c`?" +If such a constant exists, we could have removed the operation, and replaced it +with the constant that Z3 provides. + +(This is a bit of a subtle point, but for synthesizing constants we use Z3 in +the "opposite" mode than if we ask it to prove stuff. To prove things, we try +to satisfy their negation. If that returns unsat, we now that the proof worked. +To synthesize, we use exists-forall to check whether the forall-query is +satisfiable in order to be able to get the concrete value of the constant `c`.) + +Here's an examples of an inefficiency found this way: `(x ^ 1) ^ x == 1` (or, +more generally: `(x ^ y) ^ x == y`). + +## Minimization + +Analyzing an inefficiency by hand in the context of a larger trace is quite +tedious. Therefore I've implemented a (super inefficient) script to try to make +the examples smaller. Here's how that works: +- We throw out all the operations that occur *after* the inefficient operation + in the trace. +- Then we remove all "dead" operations, ie operations that don't have their + results used (all the operations that we analyze are without side effects). +- Now we try to remove every guard in the trace one by one and check + afterwards, whether the resulting trace still has an inefficiency. +- We also try to replace every single operation with a new argument to the + trace, to see whether the inefficiency is still present. This process is + super inefficient and I should probably be using + [shrinkray](https://github.com/DRMacIver/shrinkray) or + [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it + works super well in practice and the runtime isn't too bad. + +## Results + +So far I am using the JIT traces of three programs: 1) Booting Linux on the +Pydrofoil RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3) +running the PyPy bootstrap process on top of PyPy. The script identifies 94 +inefficiencies in the traces, obviously a lot of them come from repeating +patterns. My next steps will be to inspect them all, categorize them, and +implement easy optimizations identified that way. I also want a way to sort the +examples by execution count in the benchmarks, to get a feeling for which of +them are most important. + +I didn't investigate the full set of [Python +benchmarks](https://speed.pypy.org) that PyPy uses yet, because I don't expect +them to contain interesting amounts of integer operations, but maybe I am wrong +about that? Will have to try eventually. + +## Conclusion + +This was again much easier to do than I would have expected! Given that I had +the translation of trace ops to Z3 already in place, it was a matter of about a +day's of programming to use this infrastructure to find the first problems and +minimizing them. + +Reusing the results of existing operations or replacing operations by constants +can be seen as "zero-instruction superoptimization". I'll probably be rather +busy for a while to add the missing optimizations identified by my simple +script. But later extensions to actually synthesize one or several operations +in the attempt to optimize the traces more and find more opportunities is +possible. From f138ddd8abcec2f62e85f054ab8a7fc02b5813de Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Fri, 12 Jul 2024 11:17:50 +0200 Subject: [PATCH 02/16] add nother example --- posts/2024/07/finding-inefficiencies-with-z3.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md index b71161d18..8d4a21b73 100644 --- a/posts/2024/07/finding-inefficiencies-with-z3.md +++ b/posts/2024/07/finding-inefficiencies-with-z3.md @@ -191,8 +191,10 @@ to satisfy their negation. If that returns unsat, we now that the proof worked. To synthesize, we use exists-forall to check whether the forall-query is satisfiable in order to be able to get the concrete value of the constant `c`.) -Here's an examples of an inefficiency found this way: `(x ^ 1) ^ x == 1` (or, -more generally: `(x ^ y) ^ x == y`). +Here a few examples of inefficiencies found this way: + +- `(x ^ 1) ^ x == 1` (or, more generally: `(x ^ y) ^ x == y`) +- if `x | y == 0`, it follows that `x == 0` and `y == 0` ## Minimization From c6181c644ef1014845dc855963b9b3d7d30b3a40 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Fri, 12 Jul 2024 20:01:00 +0200 Subject: [PATCH 03/16] some notes and changes --- .../2024/07/finding-inefficiencies-with-z3.md | 128 ++++++++---------- 1 file changed, 56 insertions(+), 72 deletions(-) diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md index 8d4a21b73..aa91192c6 100644 --- a/posts/2024/07/finding-inefficiencies-with-z3.md +++ b/posts/2024/07/finding-inefficiencies-with-z3.md @@ -1,5 +1,5 @@ -Last week I was at the [PLDI conference](https://pldi24.sigplan.org/) in -Copenhagen to present a [paper](https://dl.acm.org/doi/10.1145/3652588.3663316) -I co-authored with [Max Bernstein](https://bernsteinbear.com/). I also finally -met [John Regehr](https://blog.regehr.org/), who I'd been talking on social -media for a long time but had never met. John has been working on compiler -correctness and better techniques for building compilers and optimizers since a -very long time. The blog post [Finding JIT Optimizer Bugs using SMT Solvers and -Fuzzing](https://www.pypy.org/posts/2022/12/jit-bug-finding-smt-fuzzing.html) -was heavily inspired by this work. We talked a lot about his and his groups -work on using Z3 to find missing optimizations. - -In my previous blog post about finding optimizer bugs, I described how to use -Z3 (an SMT solver) and fuzzing to find bugs in the integer optimization passes -of PyPy's JIT. In this post I want to talk about early results from an -experiment that came out of discussions with John (and also inspired by a talk -by Nikolai Tillmann about -[SPUR](https://web.archive.org/web/20130503204413/http://www.cs.iastate.edu/~design/vmil/2010/slides/p03-tillmann-slides.pdf) -that I saw many years ago). In this experiment I try to use Z3 to find missing -[peephole optimizations]() for integer operations in the traces of PyPy's JIT. - -## Motivation +In last weeks post I've described how to use Z3 to find simple local peephole +optimization patterns to for the integer operations in PyPy's JIT. An example +is `int_and(x, 0) -> 0`. The post ended by discussing problems: + +- combinatorial explosion +- non-minimal examples +- how do we know that we even care about the patterns that it finds? + +In this post I want to discuss a different method: + +- start from real programs that we care about (e.g. benchmarks) +- take their traces, ignore all the non-integer operations +- translate the traces into big Z3 formulas, operation by operation +- use Z3 to identify inefficiens in those concrete traces +- minimize the inefficient programs by removing as many operations as possible + +that way we don't have to generate all possible sequences of ir operations up +to a certain size, and we also get optimizations that we care about, because +the patterns come from real programs + +post will be more high-level and describe the general approach, but not go +through the implementation in detail. + +## Background PyPy's JIT has historically always focused almost exclusively on the typical inefficiencies of Python programs, such as removing boxing, run-time type @@ -39,7 +42,10 @@ two years I have applied the JIT also to a very different domain, that of CPU emulation, in the experimental [Pydrofoil](https://docs.pydrofoil.org) project (supports RISC-V and ARM by now). For Pydrofoil, the performance of integer operations is significantly more important than for "normal" Python code, which -is why I've become much more interested in optimizing integer operations. +is why I've become much more interested in optimizing integer operations. The +simple integer operation rewrites of the last post were all implemented long +ago, and so were a lot of more complicated ones. But there surely are a lot of +them missing, compared to mature compilers. Finding missing optimizations is quite hard to do manually, because it involves carefully staring at huge concrete traces of programs that we care about. So @@ -47,55 +53,10 @@ ideally we want to automate the problem. To make sure that we care about the optimizations that we find, we start from the traces of benchmarks that are interesting, such as booting Linux on the ARM and RISC-V emulators. -We could find missing optimizations by using LLVM or GCC if we could translate -RPython JIT traces to C code. But then we would be relying on the quality of -GCC optimizations. Instead, we will use Z3, an SMT ('satisfiability modulo -theories') solver by Microsoft Research, which was an excellent Python API. Z3 -has excellent support for reasoning about bitvectors, and we will use that to -identify non-optimal integer operation sequences. - -## Background Z3 - -Z3 is a STM solver, which means that it can answer the question whether a -certain formula is satisfiable. This means that it will try to find concrete -values for all the variables that occur in the formulas such that the formula -becomes true. The formulas can be about objects in various theories, Z3 has -support for (mathematical) integers, strings, arrays, unterinterpreted -functions, etc. The theory we care about in this blog post is the **bitvector** -theory. Bitvectors are vectors of bits of a fixed size, and they are great for -expressing operations on machine integers with a fixed bit width. We use Z3 -extensively to fuzz the JIT's optimizer, so we already have a translation from -JIT integer operations to Z3 formulas. I'm just reusing some of that -infrastructure for finding missing optimizations. - -To give a bit of an idea of what using the Z3 Python API looks like, here's an -example. Suppose we want to check that `x ^ -1` is the same as `~x` for -arbitrary 64-bit integers (this is true because xor with a 1 bit flips the bit -in `x`, and `-1` has all bits set in two's complement). - -```pycon ->>>> import z3 ->>>> a = z3.BitVec('a', 64) # we construct a bitvector variable with width 64 ->>>> a ^ -1 == ~a # the variables have operators overloaded, so we can combine them into larger formulas -a ^ 18446744073709551615 == ~a ->>>> z3.prove(a ^ -1 == ~a) # now we can ask Z3 to prove the formula -proved ->>>> z3.prove(a ^ -1 == a) # try to prove something wrong -counterexample -[a = 0] -``` - -Or, an example with two variables: `a * (1 << b)` is the same as `a << b`: - -```pycon ->>>> b = z3.BitVec('b', 64) ->>>> solver.prove(a * (1 << b) == a << b) -proved -``` ## General Recipe for Finding Missing Optimizations -Now that we have an idea how Z3 and bitvectors work, we want to use it to find +We want to use it to find missing integer optimizations in the PyPy JIT. The general approach is to first collect a bunch of traces of integer heavy programs. Then we encode the integer operations in the traces into Z3 formulas and use different Z3 queries to @@ -107,9 +68,32 @@ I'll write about each of these steps in turn. ## Encoding Traces as Z3 formulas -The blog post about [using Z3 to find bugs]() already contained the code to -encode PyPy trace operations into Z3 formulas, so we don't need to repeat that -here. But to give an example, XXX +The last blog post already contained the code to encode individual trace +operations into Z3 formulas, so we don't need to repeat that here. To encode +traces of operations we introduce a Z3 variable for every operation in the +trace and then call the `z3_expression` function for every single one of the +operations in the trace. + +For example, for the following trace: + +``` +[i1] +i2 = uint_rshift(i1, 32) +i3 = int_and(i2, 65535) +i4 = uint_rshift(i1, 48) +i5 = int_lshift(i4, 16) +i6 = int_or(i5, i3) +jump(i6, i2) # equal +``` + +We would get the Z3 formula: + +``` +z3.And(i2 == LShR(i1, 32), + i3 == i2 & 65535, + i4 == LShR(i1, 48), + i5 == i4 << 16) +``` ## Identifying constant booleans with Z3 From 83c62fe089d3d517f8c1e3fbc07b26448c217250 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 14 Jul 2024 15:11:34 +0200 Subject: [PATCH 04/16] some tweaks --- .../2024/07/finding-inefficiencies-with-z3.md | 64 +++++++++---------- 1 file changed, 30 insertions(+), 34 deletions(-) diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md index aa91192c6..aac102d6d 100644 --- a/posts/2024/07/finding-inefficiencies-with-z3.md +++ b/posts/2024/07/finding-inefficiencies-with-z3.md @@ -98,8 +98,7 @@ z3.And(i2 == LShR(i1, 32), ## Identifying constant booleans with Z3 The very simplest thing we can try to find inefficient parts of the traces is -to first focus on boolean variables. They are represented as word-sized ints -with values 0 and 1 in PyPy's JIT. For every boolean variable in the trace we +to first focus on boolean variables. For every boolean variable in the trace we can ask Z3 to prove that this variable must be always True or always False. Most of the time, neither of these proofs will succeed. But if Z3 manages to prove one of them, we know have found an ineffiency: instead of computing the @@ -107,13 +106,14 @@ boolean result (eg by executing a comparison) we could have replaced the operation with the corresponding constant. Here's an example of an inefficiency found that way: if `x < y` and `y < z` are -both true, PyPy's JIT isn't able to conclude from that that `x < z` must also -be true. The JIT reasons about the concrete ranges (lower and upper bounds) for -every integer variable, but it has no way to remember anything about -relationships between different variables. This kind of reasoning would quite -often be useful to remove list/string bounds checks. Here's a [talk about how -LLVM does this](https://www.youtube.com/watch?app=desktop&v=1hm5ZVmBEvo) (but -it might be too heavyweight for a JIT setting). +both true, PyPy's JIT could conclude that `x < z` must also +be true. However, currently the JIT cannot make that conclusion because the JIT +only reasons about the concrete ranges (lower and upper bounds) for every +integer variable, but it has no way to remember anything about relationships +between different variables. This kind of reasoning would quite often be useful +to remove list/string bounds checks. Here's a [talk about how LLVM does +this](https://www.youtube.com/watch?app=desktop&v=1hm5ZVmBEvo) (but it might be +too heavyweight for a JIT setting). Here are some more examples found that way: @@ -153,27 +153,21 @@ inefficiencies. Here's a few examples: `((x & 1345) ^ 2048) - 2048 == x & 1345` (with different constants, of course). xor is add without carry, and `x & 1345` does not have the bit `2048` set. Therefore the `^ 2048` is equivalent to `+ 2048`, which the `- - 2048` cancels. I don't understand at all why this appears so often in my - traces, but I see variations of it a lot. + 2048` cancels. More generally, if `a & b == 0`, then `a + b == a | b == a ^ + b`. I don't understand at all why this appears so often in my traces, but I + see variations of it a lot. LLVM can optimize this, but [GCC + can't](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115829), thanks to + [Andrew Pinski for filing the + bug](https://hachyderm.io/@pinskia/112752641328799157)! ## Synthesizing more complicated constants with exists-forall To find out whether some integer operations always return a constant result, we can't simply use the same trick as for those operations that return boolean -results. For bools, enumerating all possible results (True and False) and then -checking is feasible, but for 64-bit integers it very much is not. Therefore we -can ask Z3 to come up with a constant for us. To do this, we have to use an -"exists-forall query". To use this, we pose the following query for every -operation `op` in the trace: "does there exist a constant `c` so that for all -inputs of the trace the result of the operation `op` is equal to constant `c`?" -If such a constant exists, we could have removed the operation, and replaced it -with the constant that Z3 provides. - -(This is a bit of a subtle point, but for synthesizing constants we use Z3 in -the "opposite" mode than if we ask it to prove stuff. To prove things, we try -to satisfy their negation. If that returns unsat, we now that the proof worked. -To synthesize, we use exists-forall to check whether the forall-query is -satisfiable in order to be able to get the concrete value of the constant `c`.) +results. Like in the last post, we can use `z3.ForAll` to find out whether Z3 +can synthesize constants for us, for the result of an operation in ints +context. If such a constant exists, we could have removed the operation, and +replaced it with the constant that Z3 provides. Here a few examples of inefficiencies found this way: @@ -188,15 +182,17 @@ the examples smaller. Here's how that works: - We throw out all the operations that occur *after* the inefficient operation in the trace. - Then we remove all "dead" operations, ie operations that don't have their - results used (all the operations that we analyze are without side effects). + results used (all the operations that we can analyze with Z3 are without side + effects). - Now we try to remove every guard in the trace one by one and check afterwards, whether the resulting trace still has an inefficiency. - We also try to replace every single operation with a new argument to the - trace, to see whether the inefficiency is still present. This process is - super inefficient and I should probably be using - [shrinkray](https://github.com/DRMacIver/shrinkray) or - [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it - works super well in practice and the runtime isn't too bad. + trace, to see whether the inefficiency is still present. + + The minimization process is super inefficient and I should probably be using + [shrinkray](https://github.com/DRMacIver/shrinkray) or + [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it + seems to work well in practice and the runtime isn't too bad. ## Results @@ -204,7 +200,7 @@ So far I am using the JIT traces of three programs: 1) Booting Linux on the Pydrofoil RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3) running the PyPy bootstrap process on top of PyPy. The script identifies 94 inefficiencies in the traces, obviously a lot of them come from repeating -patterns. My next steps will be to inspect them all, categorize them, and +patterns. My next steps will be to manually inspect them all, categorize them, and implement easy optimizations identified that way. I also want a way to sort the examples by execution count in the benchmarks, to get a feeling for which of them are most important. @@ -225,5 +221,5 @@ Reusing the results of existing operations or replacing operations by constants can be seen as "zero-instruction superoptimization". I'll probably be rather busy for a while to add the missing optimizations identified by my simple script. But later extensions to actually synthesize one or several operations -in the attempt to optimize the traces more and find more opportunities is -possible. +in the attempt to optimize the traces more and find more opportunities should +be possible. From a1495375567b91b446aac9a53d85a88491c5b957 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Sun, 14 Jul 2024 15:14:12 +0200 Subject: [PATCH 05/16] add links --- posts/2024/07/finding-inefficiencies-with-z3.md | 8 +++++--- posts/2024/07/finding-simplification-rules-with-z3.md | 2 +- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md index aac102d6d..b18f1012c 100644 --- a/posts/2024/07/finding-inefficiencies-with-z3.md +++ b/posts/2024/07/finding-inefficiencies-with-z3.md @@ -10,9 +10,11 @@ .. author: CF Bolz-Tereick --> -In last weeks post I've described how to use Z3 to find simple local peephole -optimization patterns to for the integer operations in PyPy's JIT. An example -is `int_and(x, 0) -> 0`. The post ended by discussing problems: +In last weeks post I've described [how to use Z3 to find simple local peephole +optimization +patterns](finding-simple-rewrite-rules-jit-z3.html) +to for the integer operations in PyPy's JIT. An example is `int_and(x, 0) -> +0`. The post ended by discussing problems: - combinatorial explosion - non-minimal examples diff --git a/posts/2024/07/finding-simplification-rules-with-z3.md b/posts/2024/07/finding-simplification-rules-with-z3.md index d2f08dbf5..f3e21b7d2 100644 --- a/posts/2024/07/finding-simplification-rules-with-z3.md +++ b/posts/2024/07/finding-simplification-rules-with-z3.md @@ -666,7 +666,7 @@ looking at sequences of instructions: that they typically aren't found in realistic programs. It would be much better to somehow focus on the patterns that real benchmarks are using. -In the next blog post I'll discuss an alternative approach to simply generating +In the [next blog post](finding-missing-optimizations-z3.html) I'll discuss an alternative approach to simply generating all possible sequences of instructions, that tries to address these problems. This works by analyzing the real traces of benchmarks and mining those for inefficiencies, which only shows problems that occur in actual programs. From e27ffcd57f6c19aba314dae742df77c13b819f63 Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Fri, 19 Jul 2024 12:22:04 +0200 Subject: [PATCH 06/16] add pseudo-code, a bunch of other changes --- .../2024/07/finding-inefficiencies-with-z3.md | 297 +++++++++++++----- 1 file changed, 220 insertions(+), 77 deletions(-) diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md index b18f1012c..7048ac7ca 100644 --- a/posts/2024/07/finding-inefficiencies-with-z3.md +++ b/posts/2024/07/finding-inefficiencies-with-z3.md @@ -1,6 +1,6 @@ -In last weeks post I've described [how to use Z3 to find simple local peephole -optimization -patterns](finding-simple-rewrite-rules-jit-z3.html) +In my last post I've described [how to use Z3 to find simple local peephole +optimization patterns](finding-simple-rewrite-rules-jit-z3.html) to for the integer operations in PyPy's JIT. An example is `int_and(x, 0) -> -0`. The post ended by discussing problems: - -- combinatorial explosion -- non-minimal examples -- how do we know that we even care about the patterns that it finds? - -In this post I want to discuss a different method: - -- start from real programs that we care about (e.g. benchmarks) -- take their traces, ignore all the non-integer operations -- translate the traces into big Z3 formulas, operation by operation -- use Z3 to identify inefficiens in those concrete traces -- minimize the inefficient programs by removing as many operations as possible - -that way we don't have to generate all possible sequences of ir operations up -to a certain size, and we also get optimizations that we care about, because -the patterns come from real programs - -post will be more high-level and describe the general approach, but not go -through the implementation in detail. - -## Background - -PyPy's JIT has historically always focused almost exclusively on the typical -inefficiencies of Python programs, such as removing boxing, run-time type -feedback, speculation type-specializing collections, etc. However, in the last -two years I have applied the JIT also to a very different domain, that of CPU -emulation, in the experimental [Pydrofoil](https://docs.pydrofoil.org) project -(supports RISC-V and ARM by now). For Pydrofoil, the performance of integer -operations is significantly more important than for "normal" Python code, which -is why I've become much more interested in optimizing integer operations. The -simple integer operation rewrites of the last post were all implemented long -ago, and so were a lot of more complicated ones. But there surely are a lot of -them missing, compared to mature compilers. - -Finding missing optimizations is quite hard to do manually, because it involves -carefully staring at huge concrete traces of programs that we care about. So -ideally we want to automate the problem. To make sure that we care about the -optimizations that we find, we start from the traces of benchmarks that are -interesting, such as booting Linux on the ARM and RISC-V emulators. - - -## General Recipe for Finding Missing Optimizations - -We want to use it to find -missing integer optimizations in the PyPy JIT. The general approach is to first -collect a bunch of traces of integer heavy programs. Then we encode the integer -operations in the traces into Z3 formulas and use different Z3 queries to -identify inefficiencies in the traces. Once we found an inefficiency, we try to -minimize the trace to the smallest trace that we can get that still shows an -inefficiency to make it easier to understand. Then we inspect the minimized -cases manually and (if possible) try to implement the missing optimizations. -I'll write about each of these steps in turn. - -## Encoding Traces as Z3 formulas +0`. In this post I want to scale up the problem of identifying possible +optimizations to much bigger sequences, also using Z3. For that, I am starting +with the JIT traces of **real benchmarks**, after they have been optimized by +the optimizer of PyPy's JTI. Then we can ask Z3 to find inefficient integer +operations in those traces. + +Starting from the optimized traces of real programs has a bunch of big +advantages over the "classical" superoptimization approach of generating and +then trying all possible sequences of instructions. It avoids the +combinatorical explosion that happens with the latter approach. Starting from +the traces of benchmarks or (even better) actual programs also makes sure that +the missing optimizations that we actually care about the missing optimizations +that are found in this way. And because the traces are analyzed after they have +been optimized by PyPy's optimizer, we only get reports for *missing* +optimizations, that the JIT isn't able to do (yet). + +The techniques and experiments I describe in this post are again the result of +a bunch of discussions with John Regehr at a conference a few weeks ago, as +well as reading his blog posts and papers. Thanks John! + + +## High-Level Approach + +The approach works as follows: + +- Run benchmarks or other interesting programs and then dump the IR of the JIT + traces into a file. The traces have at that point been already optimized by + the PyPy JIT's optimizer. +- For every trace, ignore all the operations on non-integer variables. +- Translate every integer operation into a Z3 formula. +- For every operation, use Z3 to find out whether the operation is redundant + (how that is done is described below). +- If the operation is redundant, the trace is less efficient than it could have + been, because the optimizer could also have removed the operation. Report the + inefficiency. +- Minimize the inefficient programs by removing as many operations as possible + to make the problem easier to understand. + +In the post I will describe the details and show some pseudocode of the +approach. I'll also make the proper code public eventually (but it needs a +healthy dose of cleanups first). + + +## Encoding Traces as Z3 formulas The last blog post already contained the code to encode individual trace operations into Z3 formulas, so we don't need to repeat that here. To encode @@ -97,15 +85,50 @@ z3.And(i2 == LShR(i1, 32), i5 == i4 << 16) ``` +Usually we won't ask for the formula of the whole trace at once. Instead we go +through the trace operation by operation and try to find inefficiencies in the +current one we are looking at. Roughly like this (pseudo-)code: + +```python +def newvar(name): + return z3.BitVec(name, INTEGER_WIDTH) + +def find_inefficiencies(trace): + solver = z3.Solver() + var_to_z3var = {} + for input_argument in trace.inputargs: + var_to_z3var[input_argument] = newz3var(input_argument) + for op in trace: + var_to_z3var[op] = z3resultvar = newz3var(op.resultvarname) + arg0 = op.args[0] + z3arg0 = var_to_z3var[arg0] + if len(op.args) == 2: + arg1 = op.args[1] + z3arg1 = var_to_z3var[arg1] + else: + z3arg1 = None + res, valid_if = z3_expression(op.name, z3arg0, z3arg1) + # checking for inefficiencies, see the next sections + ... + if ...: + return "inefficient", op + + # not inefficient, add op to the solver and continue with the next op + solver.add(z3resultvar == res) + return None # no inefficiency found +``` + + + ## Identifying constant booleans with Z3 The very simplest thing we can try to find inefficient parts of the traces is -to first focus on boolean variables. For every boolean variable in the trace we -can ask Z3 to prove that this variable must be always True or always False. -Most of the time, neither of these proofs will succeed. But if Z3 manages to -prove one of them, we know have found an ineffiency: instead of computing the -boolean result (eg by executing a comparison) we could have replaced the -operation with the corresponding constant. +to first focus on boolean variables. For every operation in the trace that +returns a bool we can ask Z3 to prove that this variable must be always True or +always False. Most of the time, neither of these proofs will succeed. But if Z3 +manages to prove one of them, we know have found an ineffiency: instead of +computing the boolean result (eg by executing a comparison) the JIT's optimizer +could have replaced the operation with the corresponding boolean constant. Here's an example of an inefficiency found that way: if `x < y` and `y < z` are both true, PyPy's JIT could conclude that `x < z` must also @@ -119,13 +142,37 @@ too heavyweight for a JIT setting). Here are some more examples found that way: -- `x - 1 == x` is always False +- `x - 1 == x` is always False - `x - (x == -1) == -1` is always False. The pattern `x - (x == -1)` happens a lot in PyPy's hash computations: To be compatible with the CPython hashes we need to make sure that no object's hash is -1 (CPython uses -1 as an error value on the C level). -## Identifying redundant operations +Here's pseudo-code for how to implement this, it would go in the middle of the +loop above: + +```python +def find_inefficiencies(trace): + ... + for op in trace: + ... + res, valid_if = z3_expression(op.name, z3arg0, z3arg1) + # check for boolean constant result + if op.has_boolean_result(): + if prove(solver, res == 0): + return "inefficient", op, 0 + if prove(solver, res == 1): + return "inefficient", op, 1 + # checking for other inefficiencies, see the next sections + ... + + # not inefficient, add op to the solver and continue with the next op + solver.add(z3resultvar == res) + return None # no inefficiency found +``` + + +## Identifying redundant operations A more interesting class of redundancy is to try to find two operations in a trace that compute the same result. We can do that by asking Z3 to prove for @@ -162,21 +209,89 @@ inefficiencies. Here's a few examples: [Andrew Pinski for filing the bug](https://hachyderm.io/@pinskia/112752641328799157)! -## Synthesizing more complicated constants with exists-forall + +And here's some implementation pseudo-code again: + +```python +def find_inefficiencies(trace): + ... + for op in trace: + ... + res, valid_if = z3_expression(op.name, z3arg0, z3arg1) + # check for boolean constant result + ... + # searching for redundant operations + for previous_op in trace: + if previous_op is op: + break # done, reached the current op + previous_op_z3var = var_to_z3var[previous_op] + if prove(solver, previous_op_z3var == res): + return "inefficient", op, previous_op + ... + # more code here later + ... + + # not inefficient, add op to the solver and continue with the next op + solver.add(z3resultvar == res) + return None # no inefficiency found +``` + +## Synthesizing more complicated constants with exists-forall To find out whether some integer operations always return a constant result, we can't simply use the same trick as for those operations that return boolean -results. Like in the last post, we can use `z3.ForAll` to find out whether Z3 -can synthesize constants for us, for the result of an operation in ints -context. If such a constant exists, we could have removed the operation, and -replaced it with the constant that Z3 provides. +results, because enumerating $2^64$ possible constants and checking them all +would take too long. Like in the last post, we can use `z3.ForAll` to find out +whether Z3 can synthesize constants for us, for the result of an operation in +ints context. If such a constant exists, we could have removed the operation, +and replaced it with the constant that Z3 provides. Here a few examples of inefficiencies found this way: - `(x ^ 1) ^ x == 1` (or, more generally: `(x ^ y) ^ x == y`) - if `x | y == 0`, it follows that `x == 0` and `y == 0` +- if `x != MAXINT`, then `x + 1 > x` + +Implementing this is actually slightly annoying. The `solver.add` calls for +non-inefficient ops add assertions to the solver, which are now confusing the +`z3.ForAll` query. We could remove all assertion from the solver, then do the +`ForAll` query, then add the assertions back. What I ended doing instead was +instantiating a second solver object that I'm using for the `ForAll` queries, +that remains empty the whole time. + +```python +def find_inefficiencies(trace): + solver = z3.Solver() + empty_solver = z3.Solver() + var_to_z3var = {} + ... + for op in trace: + ... + res, valid_if = z3_expression(op.name, z3arg0, z3arg1) + # check for boolean constant result + ... + # searching for redundant operations + ... + # checking for constant results + constvar = z3.BitVec('find_const', INTEGER_WIDTH) + condition = z3.ForAll( + var_to_z3var.values(), + z3.Implies( + *solver.assertions(), + expr == constvar + ) + ) + if empty_solver.check(condition) == z3.sat: + model = empty_solver.model() + const = model[constvar].as_signed_long() + return "inefficient", op, const + + # not inefficient, add op to the solver and continue with the next op + solver.add(z3resultvar == res) + return None # no inefficiency found +``` -## Minimization +## Minimization Analyzing an inefficiency by hand in the context of a larger trace is quite tedious. Therefore I've implemented a (super inefficient) script to try to make @@ -196,10 +311,10 @@ the examples smaller. Here's how that works: [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it seems to work well in practice and the runtime isn't too bad. -## Results +## Results So far I am using the JIT traces of three programs: 1) Booting Linux on the -Pydrofoil RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3) +[Pydrofoil](https://docs.pydrofoil.org) RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3) running the PyPy bootstrap process on top of PyPy. The script identifies 94 inefficiencies in the traces, obviously a lot of them come from repeating patterns. My next steps will be to manually inspect them all, categorize them, and @@ -217,7 +332,7 @@ about that? Will have to try eventually. This was again much easier to do than I would have expected! Given that I had the translation of trace ops to Z3 already in place, it was a matter of about a day's of programming to use this infrastructure to find the first problems and -minimizing them. +minimizing them. Reusing the results of existing operations or replacing operations by constants can be seen as "zero-instruction superoptimization". I'll probably be rather @@ -225,3 +340,31 @@ busy for a while to add the missing optimizations identified by my simple script. But later extensions to actually synthesize one or several operations in the attempt to optimize the traces more and find more opportunities should be possible. + +Finding inefficiencies in traces with Z3 is certainly significantly less +annoying and also less error-prone than just manually inspecting traces and +trying to spot optimization opportunities. + + +## Random Notes and Sources + +Again, John's blog posts: + +- [Let’s Work on an LLVM Superoptimizer](https://blog.regehr.org/archives/1109) +- [Early Superoptimizer Results](https://blog.regehr.org/archives/1146) +- [A Few Synthesizing Superoptimizer Results](https://blog.regehr.org/archives/1252) +- [Synthesizing Constants](https://blog.regehr.org/archives/1636) + +and papers: + +- [A Synthesizing Superoptimizer](https://arxiv.org/pdf/1711.04422) +- [Hydra: Generalizing Peephole Optimizations with Program Synthesis](https://dl.acm.org/doi/pdf/10.1145/3649837) + +I remembered recently that I had seen the approach of optimizing the traces of +a tracing JIT with Z3 a long time ago, as part of the (now long dead, I think) +[SPUR +project](https://web.archive.org/web/20160304055149/http://research.microsoft.com/en-us/projects/spur/). +There's a [workshop +paper](https://web.archive.org/web/20161029162737/http://csl.stanford.edu/~christos/pldi2010.fit/tillmann.provers4jit.pdf) +from 2010 about this. In addition to bitvectors, they also use the Z3 support +for arrays to model the C# heap and remove redundant stores. From 654fb2eaedd9aa129ff1bd3c726e400ccecf055c Mon Sep 17 00:00:00 2001 From: CF Bolz-Tereick Date: Fri, 19 Jul 2024 12:56:45 +0200 Subject: [PATCH 07/16] some more edits --- .../2024/07/finding-inefficiencies-with-z3.md | 79 ++++++++++++------- 1 file changed, 49 insertions(+), 30 deletions(-) diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md index 7048ac7ca..40554b6ca 100644 --- a/posts/2024/07/finding-inefficiencies-with-z3.md +++ b/posts/2024/07/finding-inefficiencies-with-z3.md @@ -1,7 +1,7 @@