Root Cause 1: Julia 1.11 Array → Memory indirection (primary culprit)
Julia 1.11 restructured Array{T,N} internally. Before 1.11, a Vector{T} was a single GC object: a header followed immediately by element data. Starting with 1.11, Vector{T} became a thin wrapper that holds a pointer to a separate Memory{T} object, which holds the actual data. That is two pointer dereferences to reach element i instead of one.
This matters enormously for Metatheory because VecExpr is struct VecExpr { data::Vector{UInt64} }. In the ematcher inner loop, every instruction that reads a child ID does n.data[i], and since n itself comes from eclass.nodes (also a Vector), the read chain is:
eclass ptr → nodes::Vector ptr → Memory ptr → element i (1.11+)
eclass ptr → nodes::Vector ptr → element i (≤ 1.10)
The ematcher for a rule with N variables executes this chain O(N × eclasses × iterations) times. For the CAS test — the heaviest theory in the suite, with a ring of commutativity/associativity/distributivity rules running over a medium-sized e-graph — this accumulates into an enormous penalty.
The 75% allocations increase you see in the CI numbers is the smoking gun. Before 1.11, push!(seg_buf, ...) / v_new(arity) / n.data fields each created one GC object. In 1.11+, every Vector allocation creates two GC objects: the Array wrapper and the Memory backing store. OptBuffer uses a Vector internally; VecExpr wraps a Vector. Multiply by the number of e-nodes created and ematch buffer writes per CAS saturation run, and you get roughly 2× the GC pressure → the collector fires more, pauses accumulate, and total wall time bloats.
Root Cause 2: if pc === k dispatch is not a jump table in Julia 1.12
The generated ematchers use independent if pc === k; ...; end blocks (not an if/elseif chain):
if pc === 0x0001; ...; end
if pc === 0x0002; ...; end
if pc === 0x0003; ...; end
LLVM can sometimes recognise a run of if x === k against a single variable and emit a switch/jump-table. But this is not guaranteed — it depends on whether LLVM sees the blocks as exhaustive and non-overlapping, and on whether Julia's lowering collapses them before LLVM sees them. In Julia 1.9, LLVM 14 was more aggressive about this pattern. Julia 1.12 ships with LLVM 18, which changed several optimization passes. If LLVM 18 no longer folds these into a switch, the dispatch becomes O(n_instructions) linear scan per VM step. For a rule with 10 instructions, that is 10× the branch pressure. Multiplied across all rules and all e-class candidates, it compounds badly.
An if/elseif chain, on the other hand, is lowered by Julia's front-end into a single br_table-friendly IR structure that LLVM reliably turns into a jump table regardless of LLVM version.
Root Cause 3: eclass.nodes inner-loop allocation in cached_ids for segment patterns
The segment-pattern fallback in cached_ids uses a generator with any(...) over eclass.nodes:
(class_key.val for (class_key, eclass) in g.classes
if any(n -> v_flags(n) == flags_p && ..., eclass.nodes))
any with a closure over a Vector now goes through the Memory indirection too. But more importantly, the closure captures flags_p, head_h, name_h — three captured variables in a heap-allocated closure object. In 1.11+, closures over multiple variables allocate slightly differently. This is secondary, but worth noting.
Why CAS specifically
The CAS test exercises the most rules (commutativity, associativity, distributivity, multiple operators), runs the most iterations before saturation, and builds the largest e-graph. Every percentage-point slowdown in the inner ematcher loop appears at CAS scale as many more absolute nanoseconds than in the smaller unit tests. That is why it goes from 160 s to 739 s while smaller tests are less affected.
Summary of contributing factors (ordered by impact)
- Array → Memory indirection (Julia 1.11): every
n.data[i] in the ematcher hot loop now costs one extra pointer chase. 2× GC objects per Vector allocation → 75% allocation spike → more frequent GC → wall-clock inflation.
if pc === k blocks not jump-table-compiled by LLVM 18 (Julia 1.12): O(n_instructions) linear dispatch scan instead of O(1) switch, multiplied across all VM steps.
- Closure allocation in
cached_ids segment fallback: minor, only affects segment rules.
Proposed fixes
Fix 1 (high impact, easy): In ematch_compiler.jl, change the generated if pc === k; ...; end sequence to a single if pc === 1 elseif pc === 2 ... end chain. Julia lowers if/elseif into switch-friendly IR that LLVM reliably jump-tables regardless of version.
Fix 2 (high impact, moderate effort): On Julia ≥ 1.11, use Memory{UInt64} directly inside OptBuffer instead of Vector{UInt64}. Memory is the raw backing store — it is a single GC object with no wrapper indirection. Same change for VecExpr.data. This restores the pre-1.11 single-dereference access pattern. The API is essentially identical: m[i], length(m), with manual resizing via Memory{T}(undef, n) + copy when needed.
Fix 3 (medium impact, easy): Pre-materialise the cached_ids segment fallback as a plain Vector{Id} scan rather than a generator with a heap closure, to keep it cache-friendly.
Fix 1 alone should recover the jump-table regression. Fix 2 alone should recover most of the allocation/indirection regression. Together they should bring Julia 1.12 performance back to parity with — or better than — Julia 1.9, since LLVM 18 has better vectorisation for tight integer loops once the structural issues are resolved.
Root Cause 1: Julia 1.11 Array → Memory indirection (primary culprit)
Julia 1.11 restructured
Array{T,N}internally. Before 1.11, aVector{T}was a single GC object: a header followed immediately by element data. Starting with 1.11,Vector{T}became a thin wrapper that holds a pointer to a separateMemory{T}object, which holds the actual data. That is two pointer dereferences to reach elementiinstead of one.This matters enormously for Metatheory because
VecExprisstruct VecExpr { data::Vector{UInt64} }. In the ematcher inner loop, every instruction that reads a child ID doesn.data[i], and sincenitself comes fromeclass.nodes(also aVector), the read chain is:The ematcher for a rule with N variables executes this chain O(N × eclasses × iterations) times. For the CAS test — the heaviest theory in the suite, with a ring of commutativity/associativity/distributivity rules running over a medium-sized e-graph — this accumulates into an enormous penalty.
The 75% allocations increase you see in the CI numbers is the smoking gun. Before 1.11,
push!(seg_buf, ...)/v_new(arity)/n.datafields each created one GC object. In 1.11+, everyVectorallocation creates two GC objects: theArraywrapper and theMemorybacking store.OptBufferuses aVectorinternally;VecExprwraps aVector. Multiply by the number of e-nodes created and ematch buffer writes per CAS saturation run, and you get roughly 2× the GC pressure → the collector fires more, pauses accumulate, and total wall time bloats.Root Cause 2:
if pc === kdispatch is not a jump table in Julia 1.12The generated ematchers use independent
if pc === k; ...; endblocks (not anif/elseifchain):LLVM can sometimes recognise a run of
if x === kagainst a single variable and emit a switch/jump-table. But this is not guaranteed — it depends on whether LLVM sees the blocks as exhaustive and non-overlapping, and on whether Julia's lowering collapses them before LLVM sees them. In Julia 1.9, LLVM 14 was more aggressive about this pattern. Julia 1.12 ships with LLVM 18, which changed several optimization passes. If LLVM 18 no longer folds these into a switch, the dispatch becomes O(n_instructions) linear scan per VM step. For a rule with 10 instructions, that is 10× the branch pressure. Multiplied across all rules and all e-class candidates, it compounds badly.An
if/elseifchain, on the other hand, is lowered by Julia's front-end into a singlebr_table-friendly IR structure that LLVM reliably turns into a jump table regardless of LLVM version.Root Cause 3:
eclass.nodesinner-loop allocation incached_idsfor segment patternsThe segment-pattern fallback in
cached_idsuses a generator withany(...)overeclass.nodes:anywith a closure over aVectornow goes through theMemoryindirection too. But more importantly, the closure capturesflags_p,head_h,name_h— three captured variables in a heap-allocated closure object. In 1.11+, closures over multiple variables allocate slightly differently. This is secondary, but worth noting.Why CAS specifically
The CAS test exercises the most rules (commutativity, associativity, distributivity, multiple operators), runs the most iterations before saturation, and builds the largest e-graph. Every percentage-point slowdown in the inner ematcher loop appears at CAS scale as many more absolute nanoseconds than in the smaller unit tests. That is why it goes from 160 s to 739 s while smaller tests are less affected.
Summary of contributing factors (ordered by impact)
n.data[i]in the ematcher hot loop now costs one extra pointer chase. 2× GC objects per Vector allocation → 75% allocation spike → more frequent GC → wall-clock inflation.if pc === kblocks not jump-table-compiled by LLVM 18 (Julia 1.12): O(n_instructions) linear dispatch scan instead of O(1) switch, multiplied across all VM steps.cached_idssegment fallback: minor, only affects segment rules.Proposed fixes
Fix 1 (high impact, easy): In
ematch_compiler.jl, change the generatedif pc === k; ...; endsequence to a singleif pc === 1 elseif pc === 2 ... endchain. Julia lowersif/elseifinto switch-friendly IR that LLVM reliably jump-tables regardless of version.Fix 2 (high impact, moderate effort): On Julia ≥ 1.11, use
Memory{UInt64}directly insideOptBufferinstead ofVector{UInt64}.Memoryis the raw backing store — it is a single GC object with no wrapper indirection. Same change forVecExpr.data. This restores the pre-1.11 single-dereference access pattern. The API is essentially identical:m[i],length(m), with manual resizing viaMemory{T}(undef, n)+ copy when needed.Fix 3 (medium impact, easy): Pre-materialise the
cached_idssegment fallback as a plainVector{Id}scan rather than a generator with a heap closure, to keep it cache-friendly.Fix 1 alone should recover the jump-table regression. Fix 2 alone should recover most of the allocation/indirection regression. Together they should bring Julia 1.12 performance back to parity with — or better than — Julia 1.9, since LLVM 18 has better vectorisation for tight integer loops once the structural issues are resolved.