Perf Regression in Julia 1.12


## Root Cause 1: Julia 1.11 Array → Memory indirection (primary culprit)

Julia 1.11 restructured `Array{T,N}` internally. Before 1.11, a `Vector{T}` was a single GC object: a header followed immediately by element data. Starting with 1.11, `Vector{T}` became a thin wrapper that holds a pointer to a separate `Memory{T}` object, which holds the actual data. That is **two pointer dereferences** to reach element `i` instead of one.

This matters enormously for Metatheory because `VecExpr` is `struct VecExpr { data::Vector{UInt64} }`. In the ematcher inner loop, every instruction that reads a child ID does `n.data[i]`, and since `n` itself comes from `eclass.nodes` (also a `Vector`), the read chain is:

```
eclass ptr → nodes::Vector ptr → Memory ptr → element i   (1.11+)
eclass ptr → nodes::Vector ptr → element i                (≤ 1.10)
```

The ematcher for a rule with N variables executes this chain O(N × eclasses × iterations) times. For the CAS test — the heaviest theory in the suite, with a ring of commutativity/associativity/distributivity rules running over a medium-sized e-graph — this accumulates into an enormous penalty.

The 75% allocations increase you see in the CI numbers is the smoking gun. Before 1.11, `push!(seg_buf, ...)` / `v_new(arity)` / `n.data` fields each created one GC object. In 1.11+, every `Vector` allocation creates **two** GC objects: the `Array` wrapper and the `Memory` backing store. `OptBuffer` uses a `Vector` internally; `VecExpr` wraps a `Vector`. Multiply by the number of e-nodes created and ematch buffer writes per CAS saturation run, and you get roughly 2× the GC pressure → the collector fires more, pauses accumulate, and total wall time bloats.

## Root Cause 2: `if pc === k` dispatch is not a jump table in Julia 1.12

The generated ematchers use independent `if pc === k; ...; end` blocks (not an `if/elseif` chain):

```julia
if pc === 0x0001; ...; end
if pc === 0x0002; ...; end
if pc === 0x0003; ...; end
```

LLVM can sometimes recognise a run of `if x === k` against a single variable and emit a switch/jump-table. But this is **not guaranteed** — it depends on whether LLVM sees the blocks as exhaustive and non-overlapping, and on whether Julia's lowering collapses them before LLVM sees them. In Julia 1.9, LLVM 14 was more aggressive about this pattern. Julia 1.12 ships with LLVM 18, which changed several optimization passes. If LLVM 18 no longer folds these into a switch, the dispatch becomes O(n_instructions) linear scan per VM step. For a rule with 10 instructions, that is 10× the branch pressure. Multiplied across all rules and all e-class candidates, it compounds badly.

An `if/elseif` chain, on the other hand, is lowered by Julia's front-end into a single `br_table`-friendly IR structure that LLVM reliably turns into a jump table regardless of LLVM version.

## Root Cause 3: `eclass.nodes` inner-loop allocation in `cached_ids` for segment patterns

The segment-pattern fallback in `cached_ids` uses a generator with `any(...)` over `eclass.nodes`:

```julia
(class_key.val for (class_key, eclass) in g.classes
  if any(n -> v_flags(n) == flags_p && ..., eclass.nodes))
```

`any` with a closure over a `Vector` now goes through the `Memory` indirection too. But more importantly, the closure captures `flags_p`, `head_h`, `name_h` — three captured variables in a heap-allocated closure object. In 1.11+, closures over multiple variables allocate slightly differently. This is secondary, but worth noting.

## Why CAS specifically

The CAS test exercises the most rules (commutativity, associativity, distributivity, multiple operators), runs the most iterations before saturation, and builds the largest e-graph. Every percentage-point slowdown in the inner ematcher loop appears at CAS scale as many more absolute nanoseconds than in the smaller unit tests. That is why it goes from 160 s to 739 s while smaller tests are less affected.

## Summary of contributing factors (ordered by impact)

1. **Array → Memory indirection** (Julia 1.11): every `n.data[i]` in the ematcher hot loop now costs one extra pointer chase. 2× GC objects per Vector allocation → 75% allocation spike → more frequent GC → wall-clock inflation.
2. **`if pc === k` blocks not jump-table-compiled by LLVM 18** (Julia 1.12): O(n_instructions) linear dispatch scan instead of O(1) switch, multiplied across all VM steps.
3. **Closure allocation in `cached_ids` segment fallback**: minor, only affects segment rules.


## Proposed fixes

**Fix 1 (high impact, easy):** In `ematch_compiler.jl`, change the generated `if pc === k; ...; end` sequence to a single `if pc === 1 elseif pc === 2 ... end` chain. Julia lowers `if/elseif` into switch-friendly IR that LLVM reliably jump-tables regardless of version.

**Fix 2 (high impact, moderate effort):** On Julia ≥ 1.11, use `Memory{UInt64}` directly inside `OptBuffer` instead of `Vector{UInt64}`. `Memory` is the raw backing store — it is a single GC object with no wrapper indirection. Same change for `VecExpr.data`. This restores the pre-1.11 single-dereference access pattern. The API is essentially identical: `m[i]`, `length(m)`, with manual resizing via `Memory{T}(undef, n)` + copy when needed.

**Fix 3 (medium impact, easy):** Pre-materialise the `cached_ids` segment fallback as a plain `Vector{Id}` scan rather than a generator with a heap closure, to keep it cache-friendly.

Fix 1 alone should recover the jump-table regression. Fix 2 alone should recover most of the allocation/indirection regression. Together they should bring Julia 1.12 performance back to parity with — or better than — Julia 1.9, since LLVM 18 has better vectorisation for tight integer loops once the structural issues are resolved.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf Regression in Julia 1.12 #288

Root Cause 1: Julia 1.11 Array → Memory indirection (primary culprit)

Root Cause 2: `if pc === k` dispatch is not a jump table in Julia 1.12

Root Cause 3: `eclass.nodes` inner-loop allocation in `cached_ids` for segment patterns

Why CAS specifically

Summary of contributing factors (ordered by impact)

Proposed fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Perf Regression in Julia 1.12 #288

Description

Root Cause 1: Julia 1.11 Array → Memory indirection (primary culprit)

Root Cause 2: if pc === k dispatch is not a jump table in Julia 1.12

Root Cause 3: eclass.nodes inner-loop allocation in cached_ids for segment patterns

Why CAS specifically

Summary of contributing factors (ordered by impact)

Proposed fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Root Cause 2: `if pc === k` dispatch is not a jump table in Julia 1.12

Root Cause 3: `eclass.nodes` inner-loop allocation in `cached_ids` for segment patterns