Description
Note: pie-in-the-sky ideas incoming. I'm not proposing we stop everything and do this immediately, just that we consider it for Cranelift's long-term evolution. Apologies if anything here isn't super clear, I didn't sleep well last night and I'm really tired.
In #10340, we fixed a bug where we were doing code motion of a non-trapping and read-only load
operation such that its explicit dependencies were satisfied. However, CLIF only represents data dependencies between instructions explicitly, not control or effect dependencies. It turns out code motion was not safe because the load
's safety and non-trapping-ness implicitly depended upon control flow and its block's location in the CFG: specifically the load
was in a block that was dominated by a br_if
that performed a bounds check. Our optimizer could not see, and therefore did not abide by, that implicit constraint, and therefore performed "illegal" code motion leading to a miscompile. This bug could have been a very serious CVE, and basically was not due to a combination of luck and that it was lurking in less-exercised, off-by-default, not-yet-production-ready features that our fuzzers hadn't hammered on.
We addressed this issue with an initial stop-gap solution by adding a new can_move
flag to ir::MemFlags
, effectively conservatively disabling code motion for all memory operations by default, and requiring that a memory operation set this flag to opt into dependency-based code motion in the optimizer. I remain concerned that we have similar misoptimization bugs still lurking, that we annotated a load with can_move
when it should not have been, or that we will make an innocuous change in the future that, from a distance, invalidates the soundness of a downstream instruction's can_move
annotation. A more-principled solution would be to eliminate implicit dependencies and explicitly represent all dependencies in the IR itself. Taken to its limit, where you no longer have a control-flow graph of basic blocks and instead only have control dependency edges between instructions, this is the "sea of nodes" approach to IR design. I believe that, after some offline discussions and brainstorming with Chris and Alex, I may have found somewhat of a sweet spot for Cranelift that doesn't require fully boiling the ocean to completely redefine CLIF, transforming it beyond recognition (we can keep block arguments, for example, which feel very CLIF-y) into that full-blown sea-of-nodes limit, but nonetheless makes all dependencies explicit.
Benefits of Explicit Dependencies
First, let me enumerate the benefits we can reap by making all of an IR instruction's dependencies explicit:
-
We eliminate the whole class of misoptimization bugs that occur because the optimizer fails to satisfy constraints that it isn't privy to. Implicit constraints that the optimizer is ignorant of no longer exist.
-
We do not need to conservatively deny optimizations by default and require opting into, e.g., code motion. Our defaults can be to optimize exactly as much as the dependencies allow and no more than that. Optimization is more uniform. CLIF frontends no longer must consider whether a given memory operation is hot enough that it is worth opting into more aggressive optimizations for it and, if so, taking the time to verify whether doing so would even be sound.
-
Implementing the optimizer becomes easier. All IR passes and transformations must already take care to schedule instructions such that their dependencies are satisfied (i.e. that an instruction is dominated by the defs of any values it uses). However, right now we additionally encode an alias region in memory operations, and we must do an alias analysis in order to soundly perform redundant load elimination and store-to-load forwarding optimizations; we also must perform a state-coloring analysis during CLIF-to-MachInst lowering to determine whether fusing a particular memory operation into an instruction's operand (on architectures like
x86_64
where many instructions support both register and memory operands) is sound. By making all dependencies explicit, we can remove alias regions in memory operations and vastly simplify the alias and state-coloring analyses (to the point that, I believe, we would barely even consider them full-blown analyses anymore). -
Explicit side-effect dependencies generalize our small number of fixed alias regions into any number of logical alias regions: just depend on this logical alias region's most-recent side effect.
-
We can more easily implement new, more-aggressive optimizations, such as branch-folding and optimizing side-effectful instructions.
Our current architecture makes it difficult to optimize side-effectful and control-flow instructions in the egraph because the egraph works by rewriting values and organizing values into eclasses; side-effects and control-flow are not represented by values but are instead implicit in their associated instruction. The optimizer lifts everything it can into the egraph and leaves behind the "skeleton" of side-effectful and control-flow instructions in the CFG and its basic blocks. This skeleton's instruction's operands act to root demanded computations inside the egraph from the outside.
If we represent control-flow and side-effect dependencies as compile-time-only values, then we can rewrite and optimize these values in the egraph. We don't demand computations from the egraph via a side-effectful skeleton, but instead with a set of control-flow and side-effect value dependencies.
We can additionally do things like if we
call
a function that is known not to modify thevmctx
, then we can reuse an already-loaded value from thevmctx
across thatcall
.
Sketch of the CLIF Changes
With that out of the way, here is a sketch of the changes I am imagining:
-
We introduce a new
ir::types::Effect
. Values of this type are used to represent side-effect and control (control-flow being a type of side-effect, if you squint hard enough) dependencies of an instruction at compile time. They are not materialized at run time. -
The function's entry block's first argument will be an effect value. This represents the control flow and state of the world (e.g. memory) upon entering the function.
-
Instructions that mutate external state, like stores, will both take and produce an effect value:
effect1 = store.i32 effect0, value, pointer+offset
This means that the side-effect that
effect0
represents must have already happened before we perform thisstore
, and any instruction (such as aload
) that might depend on this store having been performed can takeeffect1
(or another effect value derived fromeffect1
) as an operand. -
Instructions that depend on external state but do not themselves have side-effects -- like read-only, non-trapping memory loads -- will only take an effect value as an operand, and will not return a new effect value:
val = load.i32 effect, pointer+offset
This means that the side-effect that
effect
represents must have already happened before we perform thisload
. -
Similarly, instructions which may trap take and produce effect values:
effect1 = trapnz effect0, condition, trap_code
-
Unconditional jumps take an effect value operand. Because effect values are just
ir::Value
s, they may also, but are not required to, pass effect values as arguments to the target block.jump effect, block(v0, v1, ...)
-
Conditional branches take an effect value operand and additionally define a unique effect value result on each of their outgoing control edges. This value-definition-on-control-edge is similar to how
try_call
instructions define values on their outgoing control edges and we can use the same technique proposed there to pass these values to their target blocks: by syntactically passing thectrl
placeholder as an argument to the block call, rather than an already-defined value.block0(v0: effect, v1: i8, v2: i32, v3: i32): ... br_if v0, v1, block1(ctrl, v2), block2(v3, ctrl) block1(v4: effect, v5: i32): ... block2(v6: i32, v7: effect): ...
-
We will probably want an instruction to combine multiple effect values into a single new effect value:
effect2 = combine_effects effect0, effect1, ...
This can be used to make an instruction depend on both a previous
store
and abr_if
bounds check.
More Implications
And here are some more random things that fall out from this design:
-
It still looks like CLIF! It remains familiar. Most instructions are pure, in fact, and therefore unchanged. We still have block args/params instead of phis. We still have a text format that is easy to parse and dump. It isn't just one big spaghetti mess of instructions. Instructions are still organized in blocks, we just have more freedom to do code motion.
-
That said, the order of instructions in a block, and the whole of
ir::Layout
, represents only a particular scheduling of instructions and does not encode any implicit constraints about where an instruction may or may not be placed. Placement constraints (i.e. how far and where we can code motion something) are solely defined by the instruction's explicit operands. -
Our existing optimizations that operate on values, uses, and defs, like GVN and unnecessary phi removal, will Just Work with effect values and will therefore automatically be extended from pure instructions to side effectful instructions (no more
side_effects_idempotent
flag needed for instructions!) and this will improve code generation. Removing unnecessary effect value phis, for example, lets us do even more code motion than we otherwise would have with the phi. GVN deduplicating identicalstore
instructions (same pointer, same value, and same effect operand) is equivalent to a basic redundant-store-elimination optimization -- without any additional alias analysis required. -
We can remove alias regions from
ir::MemFlags
and use effect values instead. Today, the optimizer is free to hoist the load past the store and out of the loop because they are annotated with different alias regions:block0(v0, v1): jump block1(v1) block1(v2): store.i32 table v0, v2 v3 = load.i32 vmctx, v1 ... jump block1(...) ==> block0(v0, v1): v3 = load.i32 vmctx, v1 jump block1(v1) loop(v2): store.i32 table v0, v2 ... jump block1(...)
But we only have a fixed number of alias regions, and we want many more.
With effect values, we don't have any limit on the number of logical alias regions we create. As long as we wire up the instruction dependencies correctly, and don't over-constrain instructions, then everything falls into place naturally:
block0(effect0, v0, v1): jump effect0, block1(effect0, v1) block1(effect1, v2): effect2 = store.i32 effect1, v0, v2 v3 = load.i32 effect0, v1 ... jump effect2, block1(effect2, ...) ==> block0(effect0, v0, v1): v3 = load.i32 effect0, v1 jump effect0, block1(effect0, v1) block1(effect1, v2): effect2 = store.i32 effect1, v0, v2 ... jump effect2, block1(effect2, ...)
Potential Challenges
Of course, any large change to CLIF like this will also face challenges. Here are some that I foresee:
-
Although we no longer need to track implicit dependencies in core Cranelift, CLIF frontends must track them and explicitly encode them when producing CLIF. This is additional work that the frontend must perform. That said, frontends are free to over-constrain instructions with false dependencies. That can make initial migration easier, and then they can start refining those dependencies, making them more fine-grained and relaxing over-approximated constraints towards their precise constraints, to unlock more aggressive optimizations as needed.
-
Incrementally implementing these changes in Cranelift might be difficult. I haven't thought too deeply about this yet. We might be able to introduce these dependencies in CLIF, but not take action on them yet, then start using and preserving them in the egraph, then start using them in the alias analysis, etc... until everything is migrated over, at which point we can remove alias regions from
ir::MemFlags
and finalize the migration and all of that. It may also be worth defining new copies of all of our relevant instructions that take and return effect values, alongside their current versions that don't take and return effect values, so that we don't have to move over, e.g., all memory operations in all of Cranelift all at once. -
Performance. Explicitly representing all control-flow and side-effect dependencies is a lot of new edges in the IR and additional operands and results. That said, we won't generally need to work with the
ir::Layout
or iterate over instructions inside a block any more (really just need that at the end of the mid-end when scheduling instructions and lowering to machine instructions). So in some sense we are "just" shifting some edges from inside their::Layout
to the instructions themselves. Of course, the devil will be in the details and we will definitely need to do a fair amount of profiling and performance work to preserve our current compile times.