From 74f1b1182bedd44baf19cdbe0df7c2e69ba6bbd9 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Sun, 7 Jul 2024 14:17:09 +0200
Subject: [PATCH 01/16] start working on a z3 inefficiencies blog post

---
 .../2024/07/finding-inefficiencies-with-z3.md | 243 ++++++++++++++++++
 1 file changed, 243 insertions(+)
 create mode 100644 posts/2024/07/finding-inefficiencies-with-z3.md

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
new file mode 100644
index 000000000..b71161d18
--- /dev/null
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -0,0 +1,243 @@
+<!--
+.. title: Finding Missing JIT Compiler Optimizations with Z3
+.. slug: finding-missing-optimizations-z3
+.. date: 2024-07-07 19:14:09 UTC
+.. tags:
+.. category:
+.. link:
+.. description:
+.. type: text
+.. author: CF Bolz-Tereick
+-->
+
+Last week I was at the [PLDI conference](https://pldi24.sigplan.org/) in
+Copenhagen to present a [paper](https://dl.acm.org/doi/10.1145/3652588.3663316)
+I co-authored with [Max Bernstein](https://bernsteinbear.com/). I also finally
+met [John Regehr](https://blog.regehr.org/), who I'd been talking on social
+media for a long time but had never met. John has been working on compiler
+correctness and better techniques for building compilers and optimizers since a
+very long time. The blog post [Finding JIT Optimizer Bugs using SMT Solvers and
+Fuzzing](https://www.pypy.org/posts/2022/12/jit-bug-finding-smt-fuzzing.html)
+was heavily inspired by this work. We talked a lot about his and his groups
+work on using Z3 to find missing optimizations.
+
+In my previous blog post about finding optimizer bugs, I described how to use
+Z3 (an SMT solver) and fuzzing to find bugs in the integer optimization passes
+of PyPy's JIT. In this post I want to talk about early results from an
+experiment that came out of discussions with John (and also inspired by a talk
+by Nikolai Tillmann about
+[SPUR](https://web.archive.org/web/20130503204413/http://www.cs.iastate.edu/~design/vmil/2010/slides/p03-tillmann-slides.pdf)
+that I saw many years ago). In this experiment I try to use Z3 to find missing
+[peephole optimizations]() for integer operations in the traces of PyPy's JIT.
+
+## Motivation 
+
+PyPy's JIT has historically always focused almost exclusively on the typical
+inefficiencies of Python programs, such as removing boxing, run-time type
+feedback, speculation type-specializing collections, etc. However, in the last
+two years I have applied the JIT also to a very different domain, that of CPU
+emulation, in the experimental [Pydrofoil](https://docs.pydrofoil.org) project
+(supports RISC-V and ARM by now). For Pydrofoil, the performance of integer
+operations is significantly more important than for "normal" Python code, which
+is why I've become much more interested in optimizing integer operations.
+
+Finding missing optimizations is quite hard to do manually, because it involves
+carefully staring at huge concrete traces of programs that we care about. So
+ideally we want to automate the problem. To make sure that we care about the
+optimizations that we find, we start from the traces of benchmarks that are
+interesting, such as booting Linux on the ARM and RISC-V emulators.
+
+We could find missing optimizations by using LLVM or GCC if we could translate
+RPython JIT traces to C code. But then we would be relying on the quality of
+GCC optimizations. Instead, we will use Z3, an SMT ('satisfiability modulo
+theories') solver by Microsoft Research, which was an excellent Python API. Z3
+has excellent support for reasoning about bitvectors, and we will use that to
+identify non-optimal integer operation sequences.
+
+## Background Z3
+
+Z3 is a STM solver, which means that it can answer the question whether a
+certain formula is satisfiable. This means that it will try to find concrete
+values for all the variables that occur in the formulas such that the formula
+becomes true. The formulas can be about objects in various theories, Z3 has
+support for (mathematical) integers, strings, arrays, unterinterpreted
+functions, etc. The theory we care about in this blog post is the **bitvector**
+theory. Bitvectors are vectors of bits of a fixed size, and they are great for
+expressing operations on machine integers with a fixed bit width. We use Z3
+extensively to fuzz the JIT's optimizer, so we already have a translation from
+JIT integer operations to Z3 formulas. I'm just reusing some of that
+infrastructure for finding missing optimizations.
+
+To give a bit of an idea of what using the Z3 Python API looks like, here's an
+example. Suppose we want to check that `x ^ -1` is the same as `~x` for
+arbitrary 64-bit integers (this is true because xor with a 1 bit flips the bit
+in `x`, and `-1` has all bits set in two's complement). 
+
+```pycon
+>>>> import z3
+>>>> a = z3.BitVec('a', 64) # we construct a bitvector variable with width 64
+>>>> a ^ -1 == ~a # the variables have operators overloaded, so we can combine them into larger formulas
+a ^ 18446744073709551615 == ~a
+>>>> z3.prove(a ^ -1 == ~a) # now we can ask Z3 to prove the formula
+proved
+>>>> z3.prove(a ^ -1 == a) # try to prove something wrong
+counterexample
+[a = 0]
+```
+
+Or, an example with two variables: `a * (1 << b)` is the same as `a << b`:
+
+```pycon
+>>>> b = z3.BitVec('b', 64)
+>>>> solver.prove(a * (1 << b) == a << b)
+proved
+```
+
+## General Recipe for Finding Missing Optimizations 
+
+Now that we have an idea how Z3 and bitvectors work, we want to use it to find
+missing integer optimizations in the PyPy JIT. The general approach is to first
+collect a bunch of traces of integer heavy programs. Then we encode the integer
+operations in the traces into Z3 formulas and use different Z3 queries to
+identify inefficiencies in the traces. Once we found an inefficiency, we try to
+minimize the trace to the smallest trace that we can get that still shows an
+inefficiency to make it easier to understand. Then we inspect the minimized
+cases manually and (if possible) try to implement the missing optimizations.
+I'll write about each of these steps in turn.
+
+## Encoding Traces as Z3 formulas 
+
+The blog post about [using Z3 to find bugs]() already contained the code to
+encode PyPy trace operations into Z3 formulas, so we don't need to repeat that
+here. But to give an example, XXX
+
+## Identifying constant booleans with Z3
+
+The very simplest thing we can try to find inefficient parts of the traces is
+to first focus on boolean variables. They are represented as word-sized ints
+with values 0 and 1 in PyPy's JIT. For every boolean variable in the trace we
+can ask Z3 to prove that this variable must be always True or always False.
+Most of the time, neither of these proofs will succeed. But if Z3 manages to
+prove one of them, we know have found an ineffiency: instead of computing the
+boolean result (eg by executing a comparison) we could have replaced the
+operation with the corresponding constant.
+
+Here's an example of an inefficiency found that way: if `x < y` and `y < z` are
+both true, PyPy's JIT isn't able to conclude from that that `x < z` must also
+be true. The JIT reasons about the concrete ranges (lower and upper bounds) for
+every integer variable, but it has no way to remember anything about
+relationships between different variables. This kind of reasoning would quite
+often be useful to remove list/string bounds checks. Here's a [talk about how
+LLVM does this](https://www.youtube.com/watch?app=desktop&v=1hm5ZVmBEvo) (but
+it might be too heavyweight for a JIT setting).
+
+Here are some more examples found that way:
+
+- `x - 1 == x` is always False 
+- `x - (x == -1) == -1` is always False. The pattern `x - (x == -1)` happens a
+  lot in PyPy's hash computations: To be compatible with the CPython hashes we
+  need to make sure that no object's hash is -1 (CPython uses -1 as an error
+  value on the C level).
+
+## Identifying redundant operations 
+
+A more interesting class of redundancy is to try to find two operations in a
+trace that compute the same result. We can do that by asking Z3 to prove for
+each pair of different operations in the trace to prove that the result is
+always the same. If a previous operation returns the same result, the JIT could
+have re-used that result instead of re-computing it, saving time. Doing this
+search for equivalent operations with Z3 is quadratic in the number of
+operations, but since traces have a maximum length it is not too bad in
+practice.
+
+This is the real workhorse of my script so far, it's what finds most of the
+inefficiencies. Here's a few examples:
+
+- The very first and super useful example the script found is `int_eq(b, 1) ==
+  b` if `b` is known to be a boolean (ie and integer 0 or 1). I have already
+  implemented this optimization in the JIT.
+- Similarly, `int_and(b, 1) == b` for booleans.
+- `(x << 4) & -0xf == x << 4`
+- `((x >> 63) << 1) << 2) >> 3 == x >> 63`. In general the JIT is quite bad at
+  optimizing repeated shifts (the infrastructure for doing better with that is
+  already in place, so this will be a relatively easy fix).
+- `(x & 0xffffffff) | ((x >> 32) << 32) == x`. Having the JIT optimize this
+  would maybe require first recognizing that `(x >> 32) << 32` can be expressed
+  as a mask: `(x ^ 0xffffffff00000000)`, and then using `(x & c1) | (x & c1) ==
+  x & (c1 | c2)`
+- A commonly occurring pattern is variations of this one:
+  `((x & 1345) ^ 2048) - 2048 == x & 1345` (with different constants, of
+  course). xor is add without carry, and `x & 1345` does not have the bit
+  `2048` set. Therefore the `^ 2048` is equivalent to `+ 2048`, which the `-
+  2048` cancels. I don't understand at all why this appears so often in my
+  traces, but I see variations of it a lot.
+
+## Synthesizing more complicated constants with exists-forall 
+
+To find out whether some integer operations always return a constant result, we
+can't simply use the same trick as for those operations that return boolean
+results. For bools, enumerating all possible results (True and False) and then
+checking is feasible, but for 64-bit integers it very much is not. Therefore we
+can ask Z3 to come up with a constant for us. To do this, we have to use an
+"exists-forall query". To use this, we pose the following query for every
+operation `op` in the trace: "does there exist a constant `c` so that for all
+inputs of the trace the result of the operation `op` is equal to constant `c`?"
+If such a constant exists, we could have removed the operation, and replaced it
+with the constant that Z3 provides.
+
+(This is a bit of a subtle point, but for synthesizing constants we use Z3 in
+the "opposite" mode than if we ask it to prove stuff. To prove things, we try
+to satisfy their negation. If that returns unsat, we now that the proof worked.
+To synthesize, we use exists-forall to check whether the forall-query is
+satisfiable in order to be able to get the concrete value of the constant `c`.)
+
+Here's an examples of an inefficiency found this way: `(x ^ 1) ^ x == 1` (or,
+more generally: `(x ^ y) ^ x == y`).
+
+## Minimization 
+
+Analyzing an inefficiency by hand in the context of a larger trace is quite
+tedious. Therefore I've implemented a (super inefficient) script to try to make
+the examples smaller. Here's how that works:
+- We throw out all the operations that occur *after* the inefficient operation
+  in the trace.
+- Then we remove all "dead" operations, ie operations that don't have their
+  results used (all the operations that we analyze are without side effects).
+- Now we try to remove every guard in the trace one by one and check
+  afterwards, whether the resulting trace still has an inefficiency.
+- We also try to replace every single operation with a new argument to the
+  trace, to see whether the inefficiency is still present. This process is
+  super inefficient and I should probably be using
+  [shrinkray](https://github.com/DRMacIver/shrinkray) or
+  [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it
+  works super well in practice and the runtime isn't too bad.
+
+## Results 
+
+So far I am using the JIT traces of three programs: 1) Booting Linux on the
+Pydrofoil RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3)
+running the PyPy bootstrap process on top of PyPy. The script identifies 94
+inefficiencies in the traces, obviously a lot of them come from repeating
+patterns. My next steps will be to inspect them all, categorize them, and
+implement easy optimizations identified that way. I also want a way to sort the
+examples by execution count in the benchmarks, to get a feeling for which of
+them are most important.
+
+I didn't investigate the full set of [Python
+benchmarks](https://speed.pypy.org) that PyPy uses yet, because I don't expect
+them to contain interesting amounts of integer operations, but maybe I am wrong
+about that? Will have to try eventually.
+
+## Conclusion
+
+This was again much easier to do than I would have expected! Given that I had
+the translation of trace ops to Z3 already in place, it was a matter of about a
+day's of programming to use this infrastructure to find the first problems and
+minimizing them. 
+
+Reusing the results of existing operations or replacing operations by constants
+can be seen as "zero-instruction superoptimization". I'll probably be rather
+busy for a while to add the missing optimizations identified by my simple
+script. But later extensions to actually synthesize one or several operations
+in the attempt to optimize the traces more and find more opportunities is
+possible.

From f138ddd8abcec2f62e85f054ab8a7fc02b5813de Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 12 Jul 2024 11:17:50 +0200
Subject: [PATCH 02/16] add nother example

---
 posts/2024/07/finding-inefficiencies-with-z3.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index b71161d18..8d4a21b73 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -191,8 +191,10 @@ to satisfy their negation. If that returns unsat, we now that the proof worked.
 To synthesize, we use exists-forall to check whether the forall-query is
 satisfiable in order to be able to get the concrete value of the constant `c`.)
 
-Here's an examples of an inefficiency found this way: `(x ^ 1) ^ x == 1` (or,
-more generally: `(x ^ y) ^ x == y`).
+Here a few examples of inefficiencies found this way:
+
+- `(x ^ 1) ^ x == 1` (or, more generally: `(x ^ y) ^ x == y`)
+- if `x | y == 0`, it follows that `x == 0` and `y == 0`
 
 ## Minimization 
 

From c6181c644ef1014845dc855963b9b3d7d30b3a40 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 12 Jul 2024 20:01:00 +0200
Subject: [PATCH 03/16] some notes and changes

---
 .../2024/07/finding-inefficiencies-with-z3.md | 128 ++++++++----------
 1 file changed, 56 insertions(+), 72 deletions(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index 8d4a21b73..aa91192c6 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -1,5 +1,5 @@
 <!--
-.. title: Finding Missing JIT Compiler Optimizations with Z3
+.. title: Finding Complicated Missing JIT Compiler Optimizations with Z3
 .. slug: finding-missing-optimizations-z3
 .. date: 2024-07-07 19:14:09 UTC
 .. tags:
@@ -10,27 +10,30 @@
 .. author: CF Bolz-Tereick
 -->
 
-Last week I was at the [PLDI conference](https://pldi24.sigplan.org/) in
-Copenhagen to present a [paper](https://dl.acm.org/doi/10.1145/3652588.3663316)
-I co-authored with [Max Bernstein](https://bernsteinbear.com/). I also finally
-met [John Regehr](https://blog.regehr.org/), who I'd been talking on social
-media for a long time but had never met. John has been working on compiler
-correctness and better techniques for building compilers and optimizers since a
-very long time. The blog post [Finding JIT Optimizer Bugs using SMT Solvers and
-Fuzzing](https://www.pypy.org/posts/2022/12/jit-bug-finding-smt-fuzzing.html)
-was heavily inspired by this work. We talked a lot about his and his groups
-work on using Z3 to find missing optimizations.
-
-In my previous blog post about finding optimizer bugs, I described how to use
-Z3 (an SMT solver) and fuzzing to find bugs in the integer optimization passes
-of PyPy's JIT. In this post I want to talk about early results from an
-experiment that came out of discussions with John (and also inspired by a talk
-by Nikolai Tillmann about
-[SPUR](https://web.archive.org/web/20130503204413/http://www.cs.iastate.edu/~design/vmil/2010/slides/p03-tillmann-slides.pdf)
-that I saw many years ago). In this experiment I try to use Z3 to find missing
-[peephole optimizations]() for integer operations in the traces of PyPy's JIT.
-
-## Motivation 
+In last weeks post I've described how to use Z3 to find simple local peephole
+optimization patterns to for the integer operations in PyPy's JIT. An example
+is `int_and(x, 0) -> 0`. The post ended by discussing problems:
+
+- combinatorial explosion
+- non-minimal examples
+- how do we know that we even care about the patterns that it finds?
+
+In this post I want to discuss a different method:
+
+- start from real programs that we care about (e.g. benchmarks)
+- take their traces, ignore all the non-integer operations
+- translate the traces into big Z3 formulas, operation by operation
+- use Z3 to identify inefficiens in those concrete traces
+- minimize the inefficient programs by removing as many operations as possible
+
+that way we don't have to generate all possible sequences of ir operations up
+to a certain size, and we also get optimizations that we care about, because
+the patterns come from real programs
+
+post will be more high-level and describe the general approach, but not go
+through the implementation in detail.
+
+## Background 
 
 PyPy's JIT has historically always focused almost exclusively on the typical
 inefficiencies of Python programs, such as removing boxing, run-time type
@@ -39,7 +42,10 @@ two years I have applied the JIT also to a very different domain, that of CPU
 emulation, in the experimental [Pydrofoil](https://docs.pydrofoil.org) project
 (supports RISC-V and ARM by now). For Pydrofoil, the performance of integer
 operations is significantly more important than for "normal" Python code, which
-is why I've become much more interested in optimizing integer operations.
+is why I've become much more interested in optimizing integer operations. The
+simple integer operation rewrites of the last post were all implemented long
+ago, and so were a lot of more complicated ones. But there surely are a lot of
+them missing, compared to mature compilers.
 
 Finding missing optimizations is quite hard to do manually, because it involves
 carefully staring at huge concrete traces of programs that we care about. So
@@ -47,55 +53,10 @@ ideally we want to automate the problem. To make sure that we care about the
 optimizations that we find, we start from the traces of benchmarks that are
 interesting, such as booting Linux on the ARM and RISC-V emulators.
 
-We could find missing optimizations by using LLVM or GCC if we could translate
-RPython JIT traces to C code. But then we would be relying on the quality of
-GCC optimizations. Instead, we will use Z3, an SMT ('satisfiability modulo
-theories') solver by Microsoft Research, which was an excellent Python API. Z3
-has excellent support for reasoning about bitvectors, and we will use that to
-identify non-optimal integer operation sequences.
-
-## Background Z3
-
-Z3 is a STM solver, which means that it can answer the question whether a
-certain formula is satisfiable. This means that it will try to find concrete
-values for all the variables that occur in the formulas such that the formula
-becomes true. The formulas can be about objects in various theories, Z3 has
-support for (mathematical) integers, strings, arrays, unterinterpreted
-functions, etc. The theory we care about in this blog post is the **bitvector**
-theory. Bitvectors are vectors of bits of a fixed size, and they are great for
-expressing operations on machine integers with a fixed bit width. We use Z3
-extensively to fuzz the JIT's optimizer, so we already have a translation from
-JIT integer operations to Z3 formulas. I'm just reusing some of that
-infrastructure for finding missing optimizations.
-
-To give a bit of an idea of what using the Z3 Python API looks like, here's an
-example. Suppose we want to check that `x ^ -1` is the same as `~x` for
-arbitrary 64-bit integers (this is true because xor with a 1 bit flips the bit
-in `x`, and `-1` has all bits set in two's complement). 
-
-```pycon
->>>> import z3
->>>> a = z3.BitVec('a', 64) # we construct a bitvector variable with width 64
->>>> a ^ -1 == ~a # the variables have operators overloaded, so we can combine them into larger formulas
-a ^ 18446744073709551615 == ~a
->>>> z3.prove(a ^ -1 == ~a) # now we can ask Z3 to prove the formula
-proved
->>>> z3.prove(a ^ -1 == a) # try to prove something wrong
-counterexample
-[a = 0]
-```
-
-Or, an example with two variables: `a * (1 << b)` is the same as `a << b`:
-
-```pycon
->>>> b = z3.BitVec('b', 64)
->>>> solver.prove(a * (1 << b) == a << b)
-proved
-```
 
 ## General Recipe for Finding Missing Optimizations 
 
-Now that we have an idea how Z3 and bitvectors work, we want to use it to find
+We want to use it to find
 missing integer optimizations in the PyPy JIT. The general approach is to first
 collect a bunch of traces of integer heavy programs. Then we encode the integer
 operations in the traces into Z3 formulas and use different Z3 queries to
@@ -107,9 +68,32 @@ I'll write about each of these steps in turn.
 
 ## Encoding Traces as Z3 formulas 
 
-The blog post about [using Z3 to find bugs]() already contained the code to
-encode PyPy trace operations into Z3 formulas, so we don't need to repeat that
-here. But to give an example, XXX
+The last blog post already contained the code to encode individual trace
+operations into Z3 formulas, so we don't need to repeat that here. To encode
+traces of operations we introduce a Z3 variable for every operation in the
+trace and then call the `z3_expression` function for every single one of the
+operations in the trace.
+
+For example, for the following trace:
+
+```
+[i1]
+i2 = uint_rshift(i1, 32)
+i3 = int_and(i2, 65535)
+i4 = uint_rshift(i1, 48)
+i5 = int_lshift(i4, 16)
+i6 = int_or(i5, i3)
+jump(i6, i2) # equal
+```
+
+We would get the Z3 formula:
+
+```
+z3.And(i2 == LShR(i1, 32),
+       i3 == i2 & 65535,
+       i4 == LShR(i1, 48),
+       i5 == i4 << 16)
+```
 
 ## Identifying constant booleans with Z3
 

From 83c62fe089d3d517f8c1e3fbc07b26448c217250 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Sun, 14 Jul 2024 15:11:34 +0200
Subject: [PATCH 04/16] some tweaks

---
 .../2024/07/finding-inefficiencies-with-z3.md | 64 +++++++++----------
 1 file changed, 30 insertions(+), 34 deletions(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index aa91192c6..aac102d6d 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -98,8 +98,7 @@ z3.And(i2 == LShR(i1, 32),
 ## Identifying constant booleans with Z3
 
 The very simplest thing we can try to find inefficient parts of the traces is
-to first focus on boolean variables. They are represented as word-sized ints
-with values 0 and 1 in PyPy's JIT. For every boolean variable in the trace we
+to first focus on boolean variables. For every boolean variable in the trace we
 can ask Z3 to prove that this variable must be always True or always False.
 Most of the time, neither of these proofs will succeed. But if Z3 manages to
 prove one of them, we know have found an ineffiency: instead of computing the
@@ -107,13 +106,14 @@ boolean result (eg by executing a comparison) we could have replaced the
 operation with the corresponding constant.
 
 Here's an example of an inefficiency found that way: if `x < y` and `y < z` are
-both true, PyPy's JIT isn't able to conclude from that that `x < z` must also
-be true. The JIT reasons about the concrete ranges (lower and upper bounds) for
-every integer variable, but it has no way to remember anything about
-relationships between different variables. This kind of reasoning would quite
-often be useful to remove list/string bounds checks. Here's a [talk about how
-LLVM does this](https://www.youtube.com/watch?app=desktop&v=1hm5ZVmBEvo) (but
-it might be too heavyweight for a JIT setting).
+both true, PyPy's JIT could conclude that `x < z` must also
+be true. However, currently the JIT cannot make that conclusion because the JIT
+only reasons about the concrete ranges (lower and upper bounds) for every
+integer variable, but it has no way to remember anything about relationships
+between different variables. This kind of reasoning would quite often be useful
+to remove list/string bounds checks. Here's a [talk about how LLVM does
+this](https://www.youtube.com/watch?app=desktop&v=1hm5ZVmBEvo) (but it might be
+too heavyweight for a JIT setting).
 
 Here are some more examples found that way:
 
@@ -153,27 +153,21 @@ inefficiencies. Here's a few examples:
   `((x & 1345) ^ 2048) - 2048 == x & 1345` (with different constants, of
   course). xor is add without carry, and `x & 1345` does not have the bit
   `2048` set. Therefore the `^ 2048` is equivalent to `+ 2048`, which the `-
-  2048` cancels. I don't understand at all why this appears so often in my
-  traces, but I see variations of it a lot.
+  2048` cancels. More generally, if `a & b == 0`, then `a + b == a | b == a ^
+  b`. I don't understand at all why this appears so often in my traces, but I
+  see variations of it a lot. LLVM can optimize this, but [GCC
+  can't](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115829), thanks to
+  [Andrew Pinski for filing the
+  bug](https://hachyderm.io/@pinskia/112752641328799157)!
 
 ## Synthesizing more complicated constants with exists-forall 
 
 To find out whether some integer operations always return a constant result, we
 can't simply use the same trick as for those operations that return boolean
-results. For bools, enumerating all possible results (True and False) and then
-checking is feasible, but for 64-bit integers it very much is not. Therefore we
-can ask Z3 to come up with a constant for us. To do this, we have to use an
-"exists-forall query". To use this, we pose the following query for every
-operation `op` in the trace: "does there exist a constant `c` so that for all
-inputs of the trace the result of the operation `op` is equal to constant `c`?"
-If such a constant exists, we could have removed the operation, and replaced it
-with the constant that Z3 provides.
-
-(This is a bit of a subtle point, but for synthesizing constants we use Z3 in
-the "opposite" mode than if we ask it to prove stuff. To prove things, we try
-to satisfy their negation. If that returns unsat, we now that the proof worked.
-To synthesize, we use exists-forall to check whether the forall-query is
-satisfiable in order to be able to get the concrete value of the constant `c`.)
+results. Like in the last post, we can use `z3.ForAll` to find out whether Z3
+can synthesize constants for us, for the result of an operation in ints
+context. If such a constant exists, we could have removed the operation, and
+replaced it with the constant that Z3 provides.
 
 Here a few examples of inefficiencies found this way:
 
@@ -188,15 +182,17 @@ the examples smaller. Here's how that works:
 - We throw out all the operations that occur *after* the inefficient operation
   in the trace.
 - Then we remove all "dead" operations, ie operations that don't have their
-  results used (all the operations that we analyze are without side effects).
+  results used (all the operations that we can analyze with Z3 are without side
+  effects).
 - Now we try to remove every guard in the trace one by one and check
   afterwards, whether the resulting trace still has an inefficiency.
 - We also try to replace every single operation with a new argument to the
-  trace, to see whether the inefficiency is still present. This process is
-  super inefficient and I should probably be using
-  [shrinkray](https://github.com/DRMacIver/shrinkray) or
-  [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it
-  works super well in practice and the runtime isn't too bad.
+  trace, to see whether the inefficiency is still present.
+
+ The minimization process is super inefficient and I should probably be using
+ [shrinkray](https://github.com/DRMacIver/shrinkray) or
+ [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it
+ seems to work well in practice and the runtime isn't too bad.
 
 ## Results 
 
@@ -204,7 +200,7 @@ So far I am using the JIT traces of three programs: 1) Booting Linux on the
 Pydrofoil RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3)
 running the PyPy bootstrap process on top of PyPy. The script identifies 94
 inefficiencies in the traces, obviously a lot of them come from repeating
-patterns. My next steps will be to inspect them all, categorize them, and
+patterns. My next steps will be to manually inspect them all, categorize them, and
 implement easy optimizations identified that way. I also want a way to sort the
 examples by execution count in the benchmarks, to get a feeling for which of
 them are most important.
@@ -225,5 +221,5 @@ Reusing the results of existing operations or replacing operations by constants
 can be seen as "zero-instruction superoptimization". I'll probably be rather
 busy for a while to add the missing optimizations identified by my simple
 script. But later extensions to actually synthesize one or several operations
-in the attempt to optimize the traces more and find more opportunities is
-possible.
+in the attempt to optimize the traces more and find more opportunities should
+be possible.

From a1495375567b91b446aac9a53d85a88491c5b957 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Sun, 14 Jul 2024 15:14:12 +0200
Subject: [PATCH 05/16] add links

---
 posts/2024/07/finding-inefficiencies-with-z3.md       | 8 +++++---
 posts/2024/07/finding-simplification-rules-with-z3.md | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index aac102d6d..b18f1012c 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -10,9 +10,11 @@
 .. author: CF Bolz-Tereick
 -->
 
-In last weeks post I've described how to use Z3 to find simple local peephole
-optimization patterns to for the integer operations in PyPy's JIT. An example
-is `int_and(x, 0) -> 0`. The post ended by discussing problems:
+In last weeks post I've described [how to use Z3 to find simple local peephole
+optimization
+patterns](finding-simple-rewrite-rules-jit-z3.html)
+to for the integer operations in PyPy's JIT. An example is `int_and(x, 0) ->
+0`. The post ended by discussing problems:
 
 - combinatorial explosion
 - non-minimal examples
diff --git a/posts/2024/07/finding-simplification-rules-with-z3.md b/posts/2024/07/finding-simplification-rules-with-z3.md
index d2f08dbf5..f3e21b7d2 100644
--- a/posts/2024/07/finding-simplification-rules-with-z3.md
+++ b/posts/2024/07/finding-simplification-rules-with-z3.md
@@ -666,7 +666,7 @@ looking at sequences of instructions:
   that they typically aren't found in realistic programs. It would be much
   better to somehow focus on the patterns that real benchmarks are using.
 
-In the next blog post I'll discuss an alternative approach to simply generating
+In the [next blog post](finding-missing-optimizations-z3.html) I'll discuss an alternative approach to simply generating
 all possible sequences of instructions, that tries to address these problems.
 This works by analyzing the real traces of benchmarks and mining those for
 inefficiencies, which only shows problems that occur in actual programs.

From e27ffcd57f6c19aba314dae742df77c13b819f63 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 12:22:04 +0200
Subject: [PATCH 06/16] add pseudo-code, a bunch of other changes

---
 .../2024/07/finding-inefficiencies-with-z3.md | 297 +++++++++++++-----
 1 file changed, 220 insertions(+), 77 deletions(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index b18f1012c..7048ac7ca 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -1,6 +1,6 @@
 <!--
-.. title: Finding Complicated Missing JIT Compiler Optimizations with Z3
-.. slug: finding-missing-optimizations-z3
+.. title: Mining JIT traces for missing optimizations with Z3
+.. slug: mining-jit-traces-missing-optimizations-z3
 .. date: 2024-07-07 19:14:09 UTC
 .. tags:
 .. category:
@@ -10,65 +10,53 @@
 .. author: CF Bolz-Tereick
 -->
 
-In last weeks post I've described [how to use Z3 to find simple local peephole
-optimization
-patterns](finding-simple-rewrite-rules-jit-z3.html)
+In my last post I've described [how to use Z3 to find simple local peephole
+optimization patterns](finding-simple-rewrite-rules-jit-z3.html)
 to for the integer operations in PyPy's JIT. An example is `int_and(x, 0) ->
-0`. The post ended by discussing problems:
-
-- combinatorial explosion
-- non-minimal examples
-- how do we know that we even care about the patterns that it finds?
-
-In this post I want to discuss a different method:
-
-- start from real programs that we care about (e.g. benchmarks)
-- take their traces, ignore all the non-integer operations
-- translate the traces into big Z3 formulas, operation by operation
-- use Z3 to identify inefficiens in those concrete traces
-- minimize the inefficient programs by removing as many operations as possible
-
-that way we don't have to generate all possible sequences of ir operations up
-to a certain size, and we also get optimizations that we care about, because
-the patterns come from real programs
-
-post will be more high-level and describe the general approach, but not go
-through the implementation in detail.
-
-## Background 
-
-PyPy's JIT has historically always focused almost exclusively on the typical
-inefficiencies of Python programs, such as removing boxing, run-time type
-feedback, speculation type-specializing collections, etc. However, in the last
-two years I have applied the JIT also to a very different domain, that of CPU
-emulation, in the experimental [Pydrofoil](https://docs.pydrofoil.org) project
-(supports RISC-V and ARM by now). For Pydrofoil, the performance of integer
-operations is significantly more important than for "normal" Python code, which
-is why I've become much more interested in optimizing integer operations. The
-simple integer operation rewrites of the last post were all implemented long
-ago, and so were a lot of more complicated ones. But there surely are a lot of
-them missing, compared to mature compilers.
-
-Finding missing optimizations is quite hard to do manually, because it involves
-carefully staring at huge concrete traces of programs that we care about. So
-ideally we want to automate the problem. To make sure that we care about the
-optimizations that we find, we start from the traces of benchmarks that are
-interesting, such as booting Linux on the ARM and RISC-V emulators.
-
-
-## General Recipe for Finding Missing Optimizations 
-
-We want to use it to find
-missing integer optimizations in the PyPy JIT. The general approach is to first
-collect a bunch of traces of integer heavy programs. Then we encode the integer
-operations in the traces into Z3 formulas and use different Z3 queries to
-identify inefficiencies in the traces. Once we found an inefficiency, we try to
-minimize the trace to the smallest trace that we can get that still shows an
-inefficiency to make it easier to understand. Then we inspect the minimized
-cases manually and (if possible) try to implement the missing optimizations.
-I'll write about each of these steps in turn.
-
-## Encoding Traces as Z3 formulas 
+0`. In this post I want to scale up the problem of identifying possible
+optimizations to much bigger sequences, also using Z3. For that, I am starting
+with the JIT traces of **real benchmarks**, after they have been optimized by
+the optimizer of PyPy's JTI. Then we can ask Z3 to find inefficient integer
+operations in those traces.
+
+Starting from the optimized traces of real programs has a bunch of big
+advantages over the "classical" superoptimization approach of generating and
+then trying all possible sequences of instructions. It avoids the
+combinatorical explosion that happens with the latter approach. Starting from
+the traces of benchmarks or (even better) actual programs also makes sure that
+the missing optimizations that we actually care about the missing optimizations
+that are found in this way. And because the traces are analyzed after they have
+been optimized by PyPy's optimizer, we only get reports for *missing*
+optimizations, that the JIT isn't able to do (yet).
+
+The techniques and experiments I describe in this post are again the result of
+a bunch of discussions with John Regehr at a conference a few weeks ago, as
+well as reading his blog posts and papers. Thanks John!
+
+
+## High-Level Approach
+
+The approach works as follows:
+
+- Run benchmarks or other interesting programs and then dump the IR of the JIT
+  traces into a file. The traces have at that point been already optimized by
+  the PyPy JIT's optimizer.
+- For every trace, ignore all the operations on non-integer variables.
+- Translate every integer operation into a Z3 formula.
+- For every operation, use Z3 to find out whether the operation is redundant
+  (how that is done is described below).
+- If the operation is redundant, the trace is less efficient than it could have
+  been, because the optimizer could also have removed the operation. Report the
+  inefficiency.
+- Minimize the inefficient programs by removing as many operations as possible
+  to make the problem easier to understand.
+
+In the post I will describe the details and show some pseudocode of the
+approach. I'll also make the proper code public eventually (but it needs a
+healthy dose of cleanups first).
+
+
+## Encoding Traces as Z3 formulas
 
 The last blog post already contained the code to encode individual trace
 operations into Z3 formulas, so we don't need to repeat that here. To encode
@@ -97,15 +85,50 @@ z3.And(i2 == LShR(i1, 32),
        i5 == i4 << 16)
 ```
 
+Usually we won't ask for the formula of the whole trace at once. Instead we go
+through the trace operation by operation and try to find inefficiencies in the
+current one we are looking at. Roughly like this (pseudo-)code:
+
+```python
+def newvar(name):
+    return z3.BitVec(name, INTEGER_WIDTH)
+
+def find_inefficiencies(trace):
+    solver = z3.Solver()
+    var_to_z3var = {}
+    for input_argument in trace.inputargs:
+        var_to_z3var[input_argument] = newz3var(input_argument)
+    for op in trace:
+        var_to_z3var[op] = z3resultvar = newz3var(op.resultvarname)
+        arg0 = op.args[0]
+        z3arg0 = var_to_z3var[arg0]
+        if len(op.args) == 2:
+            arg1 = op.args[1]
+            z3arg1 = var_to_z3var[arg1]
+        else:
+            z3arg1 = None
+        res, valid_if = z3_expression(op.name, z3arg0, z3arg1)
+        # checking for inefficiencies, see the next sections
+        ...
+        if ...:
+            return "inefficient", op
+
+        # not inefficient, add op to the solver and continue with the next op
+        solver.add(z3resultvar == res)
+    return None # no inefficiency found
+```
+
+
+
 ## Identifying constant booleans with Z3
 
 The very simplest thing we can try to find inefficient parts of the traces is
-to first focus on boolean variables. For every boolean variable in the trace we
-can ask Z3 to prove that this variable must be always True or always False.
-Most of the time, neither of these proofs will succeed. But if Z3 manages to
-prove one of them, we know have found an ineffiency: instead of computing the
-boolean result (eg by executing a comparison) we could have replaced the
-operation with the corresponding constant.
+to first focus on boolean variables. For every operation in the trace that
+returns a bool we can ask Z3 to prove that this variable must be always True or
+always False. Most of the time, neither of these proofs will succeed. But if Z3
+manages to prove one of them, we know have found an ineffiency: instead of
+computing the boolean result (eg by executing a comparison) the JIT's optimizer
+could have replaced the operation with the corresponding boolean constant.
 
 Here's an example of an inefficiency found that way: if `x < y` and `y < z` are
 both true, PyPy's JIT could conclude that `x < z` must also
@@ -119,13 +142,37 @@ too heavyweight for a JIT setting).
 
 Here are some more examples found that way:
 
-- `x - 1 == x` is always False 
+- `x - 1 == x` is always False
 - `x - (x == -1) == -1` is always False. The pattern `x - (x == -1)` happens a
   lot in PyPy's hash computations: To be compatible with the CPython hashes we
   need to make sure that no object's hash is -1 (CPython uses -1 as an error
   value on the C level).
 
-## Identifying redundant operations 
+Here's pseudo-code for how to implement this, it would go in the middle of the
+loop above:
+
+```python
+def find_inefficiencies(trace):
+    ...
+    for op in trace:
+        ...
+        res, valid_if = z3_expression(op.name, z3arg0, z3arg1)
+        # check for boolean constant result
+        if op.has_boolean_result():
+            if prove(solver, res == 0):
+                return "inefficient", op, 0
+            if prove(solver, res == 1):
+                return "inefficient", op, 1
+        # checking for other inefficiencies, see the next sections
+        ...
+
+        # not inefficient, add op to the solver and continue with the next op
+        solver.add(z3resultvar == res)
+    return None # no inefficiency found
+```
+
+
+## Identifying redundant operations
 
 A more interesting class of redundancy is to try to find two operations in a
 trace that compute the same result. We can do that by asking Z3 to prove for
@@ -162,21 +209,89 @@ inefficiencies. Here's a few examples:
   [Andrew Pinski for filing the
   bug](https://hachyderm.io/@pinskia/112752641328799157)!
 
-## Synthesizing more complicated constants with exists-forall 
+
+And here's some implementation pseudo-code again:
+
+```python
+def find_inefficiencies(trace):
+    ...
+    for op in trace:
+        ...
+        res, valid_if = z3_expression(op.name, z3arg0, z3arg1)
+        # check for boolean constant result
+        ...
+        # searching for redundant operations
+        for previous_op in trace:
+            if previous_op is op:
+                break # done, reached the current op
+            previous_op_z3var = var_to_z3var[previous_op]
+            if prove(solver, previous_op_z3var == res):
+                return "inefficient", op, previous_op
+        ...
+        # more code here later
+        ...
+
+        # not inefficient, add op to the solver and continue with the next op
+        solver.add(z3resultvar == res)
+    return None # no inefficiency found
+```
+
+## Synthesizing more complicated constants with exists-forall
 
 To find out whether some integer operations always return a constant result, we
 can't simply use the same trick as for those operations that return boolean
-results. Like in the last post, we can use `z3.ForAll` to find out whether Z3
-can synthesize constants for us, for the result of an operation in ints
-context. If such a constant exists, we could have removed the operation, and
-replaced it with the constant that Z3 provides.
+results, because enumerating $2^64$ possible constants and checking them all
+would take too long. Like in the last post, we can use `z3.ForAll` to find out
+whether Z3 can synthesize constants for us, for the result of an operation in
+ints context. If such a constant exists, we could have removed the operation,
+and replaced it with the constant that Z3 provides.
 
 Here a few examples of inefficiencies found this way:
 
 - `(x ^ 1) ^ x == 1` (or, more generally: `(x ^ y) ^ x == y`)
 - if `x | y == 0`, it follows that `x == 0` and `y == 0`
+- if `x != MAXINT`, then `x + 1 > x`
+
+Implementing this is actually slightly annoying. The `solver.add` calls for
+non-inefficient ops add assertions to the solver, which are now confusing the
+`z3.ForAll` query. We could remove all assertion from the solver, then do the
+`ForAll` query, then add the assertions back. What I ended doing instead was
+instantiating a second solver object that I'm using for the `ForAll` queries,
+that remains empty the whole time.
+
+```python
+def find_inefficiencies(trace):
+    solver = z3.Solver()
+    empty_solver = z3.Solver()
+    var_to_z3var = {}
+    ...
+    for op in trace:
+        ...
+        res, valid_if = z3_expression(op.name, z3arg0, z3arg1)
+        # check for boolean constant result
+        ...
+        # searching for redundant operations
+        ...
+        # checking for constant results
+        constvar = z3.BitVec('find_const', INTEGER_WIDTH)
+        condition = z3.ForAll(
+            var_to_z3var.values(),
+            z3.Implies(
+                *solver.assertions(),
+                expr == constvar
+            )
+        )
+        if empty_solver.check(condition) == z3.sat:
+            model = empty_solver.model()
+            const = model[constvar].as_signed_long()
+            return "inefficient", op, const
+
+        # not inefficient, add op to the solver and continue with the next op
+        solver.add(z3resultvar == res)
+    return None # no inefficiency found
+```
 
-## Minimization 
+## Minimization
 
 Analyzing an inefficiency by hand in the context of a larger trace is quite
 tedious. Therefore I've implemented a (super inefficient) script to try to make
@@ -196,10 +311,10 @@ the examples smaller. Here's how that works:
  [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it
  seems to work well in practice and the runtime isn't too bad.
 
-## Results 
+## Results
 
 So far I am using the JIT traces of three programs: 1) Booting Linux on the
-Pydrofoil RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3)
+[Pydrofoil](https://docs.pydrofoil.org) RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3)
 running the PyPy bootstrap process on top of PyPy. The script identifies 94
 inefficiencies in the traces, obviously a lot of them come from repeating
 patterns. My next steps will be to manually inspect them all, categorize them, and
@@ -217,7 +332,7 @@ about that? Will have to try eventually.
 This was again much easier to do than I would have expected! Given that I had
 the translation of trace ops to Z3 already in place, it was a matter of about a
 day's of programming to use this infrastructure to find the first problems and
-minimizing them. 
+minimizing them.
 
 Reusing the results of existing operations or replacing operations by constants
 can be seen as "zero-instruction superoptimization". I'll probably be rather
@@ -225,3 +340,31 @@ busy for a while to add the missing optimizations identified by my simple
 script. But later extensions to actually synthesize one or several operations
 in the attempt to optimize the traces more and find more opportunities should
 be possible.
+
+Finding inefficiencies in traces with Z3 is certainly significantly less
+annoying and also less error-prone than just manually inspecting traces and
+trying to spot optimization opportunities.
+
+
+## Random Notes and Sources
+
+Again, John's blog posts:
+
+- [Let’s Work on an LLVM Superoptimizer](https://blog.regehr.org/archives/1109)
+- [Early Superoptimizer Results](https://blog.regehr.org/archives/1146)
+- [A Few Synthesizing Superoptimizer Results](https://blog.regehr.org/archives/1252)
+- [Synthesizing Constants](https://blog.regehr.org/archives/1636)
+
+and papers:
+
+- [A Synthesizing Superoptimizer](https://arxiv.org/pdf/1711.04422)
+- [Hydra: Generalizing Peephole Optimizations with Program Synthesis](https://dl.acm.org/doi/pdf/10.1145/3649837)
+
+I remembered recently that I had seen the approach of optimizing the traces of
+a tracing JIT with Z3 a long time ago, as part of the (now long dead, I think)
+[SPUR
+project](https://web.archive.org/web/20160304055149/http://research.microsoft.com/en-us/projects/spur/).
+There's a [workshop
+paper](https://web.archive.org/web/20161029162737/http://csl.stanford.edu/~christos/pldi2010.fit/tillmann.provers4jit.pdf)
+from 2010 about this. In addition to bitvectors, they also use the Z3 support
+for arrays to model the C# heap and remove redundant stores.

From 654fb2eaedd9aa129ff1bd3c726e400ccecf055c Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 12:56:45 +0200
Subject: [PATCH 07/16] some more edits

---
 .../2024/07/finding-inefficiencies-with-z3.md | 79 ++++++++++++-------
 1 file changed, 49 insertions(+), 30 deletions(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index 7048ac7ca..40554b6ca 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -1,7 +1,7 @@
 <!--
 .. title: Mining JIT traces for missing optimizations with Z3
 .. slug: mining-jit-traces-missing-optimizations-z3
-.. date: 2024-07-07 19:14:09 UTC
+.. date: 2024-07-19 19:14:09 UTC
 .. tags:
 .. category:
 .. link:
@@ -12,19 +12,19 @@
 
 In my last post I've described [how to use Z3 to find simple local peephole
 optimization patterns](finding-simple-rewrite-rules-jit-z3.html)
-to for the integer operations in PyPy's JIT. An example is `int_and(x, 0) ->
+for the integer operations in PyPy's JIT. An example is `int_and(x, 0) ->
 0`. In this post I want to scale up the problem of identifying possible
-optimizations to much bigger sequences, also using Z3. For that, I am starting
-with the JIT traces of **real benchmarks**, after they have been optimized by
-the optimizer of PyPy's JTI. Then we can ask Z3 to find inefficient integer
-operations in those traces.
+optimizations to much bigger instruction sequences, also using Z3. For that, I
+am starting with the JIT traces of **real benchmarks**, after they have been
+optimized by the optimizer of PyPy's JIT. Then we can ask Z3 to find
+inefficient integer operations in those traces.
 
-Starting from the optimized traces of real programs has a bunch of big
+Starting from the optimized traces of real programs has some big
 advantages over the "classical" superoptimization approach of generating and
 then trying all possible sequences of instructions. It avoids the
-combinatorical explosion that happens with the latter approach. Starting from
-the traces of benchmarks or (even better) actual programs also makes sure that
-the missing optimizations that we actually care about the missing optimizations
+combinatorical explosion that happens with the latter approach. Also, starting
+from the traces of benchmarks or (even better) actual programs makes sure that
+we actually care about the missing optimizations
 that are found in this way. And because the traces are analyzed after they have
 been optimized by PyPy's optimizer, we only get reports for *missing*
 optimizations, that the JIT isn't able to do (yet).
@@ -36,7 +36,7 @@ well as reading his blog posts and papers. Thanks John!
 
 ## High-Level Approach
 
-The approach works as follows:
+The approach that I took works as follows:
 
 - Run benchmarks or other interesting programs and then dump the IR of the JIT
   traces into a file. The traces have at that point been already optimized by
@@ -55,14 +55,22 @@ In the post I will describe the details and show some pseudocode of the
 approach. I'll also make the proper code public eventually (but it needs a
 healthy dose of cleanups first).
 
+## Dumping PyPy Traces
+
+PyPy will write its JIT traces into the file `out` if the environment variable
+[`PYPYLOG`](https://doc.pypy.org/en/latest/man/pypy.1.html) is set as follows:
+
+```
+PYPYLOG=jit-log-opt:out
+```
 
 ## Encoding Traces as Z3 formulas
 
-The last blog post already contained the code to encode individual trace
-operations into Z3 formulas, so we don't need to repeat that here. To encode
-traces of operations we introduce a Z3 variable for every operation in the
-trace and then call the `z3_expression` function for every single one of the
-operations in the trace.
+The last blog post already contained the code to encode the results of
+individual trace operations into Z3 formulas, so we don't need to repeat that
+here. To encode traces of operations we introduce a Z3 variable for every
+operation in the trace and then call the `z3_expression` function for every
+single one of the operations in the trace.
 
 For example, for the following trace:
 
@@ -113,7 +121,7 @@ def find_inefficiencies(trace):
         if ...:
             return "inefficient", op
 
-        # not inefficient, add op to the solver and continue with the next op
+        # not inefficient, assert op into the solver and continue with the next op
         solver.add(z3resultvar == res)
     return None # no inefficiency found
 ```
@@ -122,8 +130,8 @@ def find_inefficiencies(trace):
 
 ## Identifying constant booleans with Z3
 
-The very simplest thing we can try to find inefficient parts of the traces is
-to first focus on boolean variables. For every operation in the trace that
+To get started finding inefficiencies in a trace, we can
+first focus on boolean variables. For every operation in the trace that
 returns a bool we can ask Z3 to prove that this variable must be always True or
 always False. Most of the time, neither of these proofs will succeed. But if Z3
 manages to prove one of them, we know have found an ineffiency: instead of
@@ -132,7 +140,7 @@ could have replaced the operation with the corresponding boolean constant.
 
 Here's an example of an inefficiency found that way: if `x < y` and `y < z` are
 both true, PyPy's JIT could conclude that `x < z` must also
-be true. However, currently the JIT cannot make that conclusion because the JIT
+be true. However, currently the JIT cannot make that conclusion because it
 only reasons about the concrete ranges (lower and upper bounds) for every
 integer variable, but it has no way to remember anything about relationships
 between different variables. This kind of reasoning would quite often be useful
@@ -148,8 +156,8 @@ Here are some more examples found that way:
   need to make sure that no object's hash is -1 (CPython uses -1 as an error
   value on the C level).
 
-Here's pseudo-code for how to implement this, it would go in the middle of the
-loop above:
+Here's pseudo-code for how to implement checking boolean operations for
+inefficiencies:
 
 ```python
 def find_inefficiencies(trace):
@@ -202,8 +210,8 @@ inefficiencies. Here's a few examples:
   `((x & 1345) ^ 2048) - 2048 == x & 1345` (with different constants, of
   course). xor is add without carry, and `x & 1345` does not have the bit
   `2048` set. Therefore the `^ 2048` is equivalent to `+ 2048`, which the `-
-  2048` cancels. More generally, if `a & b == 0`, then `a + b == a | b == a ^
-  b`. I don't understand at all why this appears so often in my traces, but I
+  2048` cancels. More generally, if `a & b == 0`, then `a + b == a | b == a ^ b`.
+  I don't understand at all why this appears so often in the traces, but I
   see variations of it a lot. LLVM can optimize this, but [GCC
   can't](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115829), thanks to
   [Andrew Pinski for filing the
@@ -296,7 +304,7 @@ def find_inefficiencies(trace):
 Analyzing an inefficiency by hand in the context of a larger trace is quite
 tedious. Therefore I've implemented a (super inefficient) script to try to make
 the examples smaller. Here's how that works:
-- We throw out all the operations that occur *after* the inefficient operation
+- First throw out all the operations that occur *after* the inefficient operation
   in the trace.
 - Then we remove all "dead" operations, ie operations that don't have their
   results used (all the operations that we can analyze with Z3 are without side
@@ -306,7 +314,7 @@ the examples smaller. Here's how that works:
 - We also try to replace every single operation with a new argument to the
   trace, to see whether the inefficiency is still present.
 
- The minimization process is super inefficient and I should probably be using
+ The minimization process is sort of inefficient and I should probably be using
  [shrinkray](https://github.com/DRMacIver/shrinkray) or
  [C-Reduce](https://github.com/csmith-project/creduce) instead. However, it
  seems to work well in practice and the runtime isn't too bad.
@@ -315,7 +323,15 @@ the examples smaller. Here's how that works:
 
 So far I am using the JIT traces of three programs: 1) Booting Linux on the
 [Pydrofoil](https://docs.pydrofoil.org) RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3)
-running the PyPy bootstrap process on top of PyPy. The script identifies 94
+running the PyPy bootstrap process on top of PyPy.
+
+I picked these programs because most Python programs don't contain interesting
+amounts of integer operations, and the traces of the emulators
+contain a lot of them. I also used the bootstrap process because I still wanted
+to try a big Python program and personally care about the runtime of this
+program a lot.
+
+The script identifies 94
 inefficiencies in the traces, obviously a lot of them come from repeating
 patterns. My next steps will be to manually inspect them all, categorize them, and
 implement easy optimizations identified that way. I also want a way to sort the
@@ -341,7 +357,7 @@ script. But later extensions to actually synthesize one or several operations
 in the attempt to optimize the traces more and find more opportunities should
 be possible.
 
-Finding inefficiencies in traces with Z3 is certainly significantly less
+Finding inefficiencies in traces with Z3 is significantly less
 annoying and also less error-prone than just manually inspecting traces and
 trying to spot optimization opportunities.
 
@@ -366,5 +382,8 @@ a tracing JIT with Z3 a long time ago, as part of the (now long dead, I think)
 project](https://web.archive.org/web/20160304055149/http://research.microsoft.com/en-us/projects/spur/).
 There's a [workshop
 paper](https://web.archive.org/web/20161029162737/http://csl.stanford.edu/~christos/pldi2010.fit/tillmann.provers4jit.pdf)
-from 2010 about this. In addition to bitvectors, they also use the Z3 support
-for arrays to model the C# heap and remove redundant stores.
+from 2010 about this. SPUR was trying to use Z3 built into the actual JIT (as
+opposed to using Z3 only to find places where the regular optimizers could be
+improved). In addition to bitvectors, SPUR also used the Z3 support for arrays
+to model the C# heap and remove redundant stores. This is still another future
+extension for all the Z3 work I've been doing in the context of the PyPy JIT.

From c5d4f39ecd7ea5bd72a63a72fc09ee2ecea94cbc Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 13:01:43 +0200
Subject: [PATCH 08/16] edits

---
 posts/2024/07/finding-inefficiencies-with-z3.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index 40554b6ca..9d3b766a6 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -248,10 +248,10 @@ def find_inefficiencies(trace):
 
 To find out whether some integer operations always return a constant result, we
 can't simply use the same trick as for those operations that return boolean
-results, because enumerating $2^64$ possible constants and checking them all
+results, because enumerating 2⁶⁴ possible constants and checking them all
 would take too long. Like in the last post, we can use `z3.ForAll` to find out
-whether Z3 can synthesize constants for us, for the result of an operation in
-ints context. If such a constant exists, we could have removed the operation,
+whether Z3 can synthesize a constant for the result of an operation for us.
+If such a constant exists, the JIT could have removed the operation,
 and replaced it with the constant that Z3 provides.
 
 Here a few examples of inefficiencies found this way:

From b7c92126bd5fe7ac7d5afcff501878531db75056 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 16:47:07 +0200
Subject: [PATCH 09/16] Update posts/2024/07/finding-inefficiencies-with-z3.md

Co-authored-by: Max Bernstein <max@bernsteinbear.com>
---
 posts/2024/07/finding-inefficiencies-with-z3.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index 9d3b766a6..e3bc35d0c 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -304,6 +304,7 @@ def find_inefficiencies(trace):
 Analyzing an inefficiency by hand in the context of a larger trace is quite
 tedious. Therefore I've implemented a (super inefficient) script to try to make
 the examples smaller. Here's how that works:
+
 - First throw out all the operations that occur *after* the inefficient operation
   in the trace.
 - Then we remove all "dead" operations, ie operations that don't have their

From f74b45348adaa4162eef7554f7e64a61c25e71a7 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 16:47:23 +0200
Subject: [PATCH 10/16] Update posts/2024/07/finding-inefficiencies-with-z3.md

Co-authored-by: Max Bernstein <max@bernsteinbear.com>
---
 posts/2024/07/finding-inefficiencies-with-z3.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index e3bc35d0c..8d173c777 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -323,7 +323,7 @@ the examples smaller. Here's how that works:
 ## Results
 
 So far I am using the JIT traces of three programs: 1) Booting Linux on the
-[Pydrofoil](https://docs.pydrofoil.org) RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator 3)
+[Pydrofoil](https://docs.pydrofoil.org) RISC-V emulator, 2) booting Linux on the Pydrofoil ARM emulator, and 3)
 running the PyPy bootstrap process on top of PyPy.
 
 I picked these programs because most Python programs don't contain interesting

From 3af0557fc0e45609eaccee93e5672033019fd80d Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 16:47:31 +0200
Subject: [PATCH 11/16] Update posts/2024/07/finding-inefficiencies-with-z3.md

Co-authored-by: Max Bernstein <max@bernsteinbear.com>
---
 posts/2024/07/finding-inefficiencies-with-z3.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/finding-inefficiencies-with-z3.md
index 8d173c777..4013c1aef 100644
--- a/posts/2024/07/finding-inefficiencies-with-z3.md
+++ b/posts/2024/07/finding-inefficiencies-with-z3.md
@@ -333,7 +333,7 @@ to try a big Python program and personally care about the runtime of this
 program a lot.
 
 The script identifies 94
-inefficiencies in the traces, obviously a lot of them come from repeating
+inefficiencies in the traces, a lot of them come from repeating
 patterns. My next steps will be to manually inspect them all, categorize them, and
 implement easy optimizations identified that way. I also want a way to sort the
 examples by execution count in the benchmarks, to get a feeling for which of

From 60101c6eb1e88b5b8513df7b441cf9679863a58b Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 16:49:39 +0200
Subject: [PATCH 12/16] rename the file to match the slug

---
 ...s-with-z3.md => mining-jit-traces-missing-optimizations-z3.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename posts/2024/07/{finding-inefficiencies-with-z3.md => mining-jit-traces-missing-optimizations-z3.md} (100%)

diff --git a/posts/2024/07/finding-inefficiencies-with-z3.md b/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
similarity index 100%
rename from posts/2024/07/finding-inefficiencies-with-z3.md
rename to posts/2024/07/mining-jit-traces-missing-optimizations-z3.md

From cb36789fbc2fc9106bc95eecb38b8fffb85fd8aa Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 17:31:30 +0200
Subject: [PATCH 13/16] fix link, I hope

---
 posts/2024/07/finding-simplification-rules-with-z3.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/posts/2024/07/finding-simplification-rules-with-z3.md b/posts/2024/07/finding-simplification-rules-with-z3.md
index f3e21b7d2..4e4ab823a 100644
--- a/posts/2024/07/finding-simplification-rules-with-z3.md
+++ b/posts/2024/07/finding-simplification-rules-with-z3.md
@@ -666,7 +666,7 @@ looking at sequences of instructions:
   that they typically aren't found in realistic programs. It would be much
   better to somehow focus on the patterns that real benchmarks are using.
 
-In the [next blog post](finding-missing-optimizations-z3.html) I'll discuss an alternative approach to simply generating
+In the [next blog post](mining-jit-traces-missing-optimizations-z3.html) I'll discuss an alternative approach to simply generating
 all possible sequences of instructions, that tries to address these problems.
 This works by analyzing the real traces of benchmarks and mining those for
 inefficiencies, which only shows problems that occur in actual programs.

From 8a507e5689ca41bc488a4fcefc339ec17dc33dbd Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 17:31:54 +0200
Subject: [PATCH 14/16] a side-note about the relationship of Python integers
 and trace integers

---
 .../07/mining-jit-traces-missing-optimizations-z3.md  | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md b/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
index 4013c1aef..70cc15873 100644
--- a/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
+++ b/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
@@ -64,6 +64,17 @@ PyPy will write its JIT traces into the file `out` if the environment variable
 PYPYLOG=jit-log-opt:out
 ```
 
+(This is really a side point for the rest of the blog post, but since the
+question came up I wanted to clarify it: Operations on integers in the Python
+program that the JIT is running don't all correspond 1-to-1 with the `int_...`
+operations in the traces. The `int_...` trace operations always operate on
+machine words. The Python `int` type supports arbitrarily large integers. PyPy
+will optimistically try to lower the operations on Python integers into machine
+word operations, but adds the necessary guards into the trace to make sure that
+overflow outside of the range of machine words is caught. In case one of these
+guards fails the interpreter switches to a big integer heap-allocated
+representation.)
+
 ## Encoding Traces as Z3 formulas
 
 The last blog post already contained the code to encode the results of

From 30f7edf9f720feca8bac6ee69ca55fd4e1d944fb Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 18:57:33 +0200
Subject: [PATCH 15/16] add acks to Max, and a detail about the env var

---
 .../07/mining-jit-traces-missing-optimizations-z3.md     | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md b/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
index 70cc15873..ca48632ae 100644
--- a/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
+++ b/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
@@ -31,7 +31,9 @@ optimizations, that the JIT isn't able to do (yet).
 
 The techniques and experiments I describe in this post are again the result of
 a bunch of discussions with John Regehr at a conference a few weeks ago, as
-well as reading his blog posts and papers. Thanks John!
+well as reading his blog posts and papers. Thanks John! Also thanks to [Max
+Bernstein](https://bernsteinbear.com/) for super helpful feedback on the drafts
+of this blog post (and for poking me to write things in general).
 
 
 ## High-Level Approach
@@ -61,9 +63,12 @@ PyPy will write its JIT traces into the file `out` if the environment variable
 [`PYPYLOG`](https://doc.pypy.org/en/latest/man/pypy.1.html) is set as follows:
 
 ```
-PYPYLOG=jit-log-opt:out
+PYPYLOG=jit-log-opt:out pypy <program.py>
 ```
 
+This environment variable works for PyPy, but also for other virtual machines
+built with RPython.
+
 (This is really a side point for the rest of the blog post, but since the
 question came up I wanted to clarify it: Operations on integers in the Python
 program that the JIT is running don't all correspond 1-to-1 with the `int_...`

From 1dd963b1106c36c210187759b780d486f7c88836 Mon Sep 17 00:00:00 2001
From: CF Bolz-Tereick <cfbolz@gmx.de>
Date: Fri, 19 Jul 2024 18:57:33 +0200
Subject: [PATCH 16/16] fix time

---
 posts/2024/07/mining-jit-traces-missing-optimizations-z3.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md b/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
index ca48632ae..40e844a6c 100644
--- a/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
+++ b/posts/2024/07/mining-jit-traces-missing-optimizations-z3.md
@@ -1,7 +1,7 @@
 <!--
 .. title: Mining JIT traces for missing optimizations with Z3
 .. slug: mining-jit-traces-missing-optimizations-z3
-.. date: 2024-07-19 19:14:09 UTC
+.. date: 2024-07-19 17:01:09 UTC
 .. tags:
 .. category:
 .. link: