Skip to content

Perf Regression in Julia 1.12 #288

@0x0f0f0f

Description

@0x0f0f0f

Root Cause 1: Julia 1.11 Array → Memory indirection (primary culprit)

Julia 1.11 restructured Array{T,N} internally. Before 1.11, a Vector{T} was a single GC object: a header followed immediately by element data. Starting with 1.11, Vector{T} became a thin wrapper that holds a pointer to a separate Memory{T} object, which holds the actual data. That is two pointer dereferences to reach element i instead of one.

This matters enormously for Metatheory because VecExpr is struct VecExpr { data::Vector{UInt64} }. In the ematcher inner loop, every instruction that reads a child ID does n.data[i], and since n itself comes from eclass.nodes (also a Vector), the read chain is:

eclass ptr → nodes::Vector ptr → Memory ptr → element i   (1.11+)
eclass ptr → nodes::Vector ptr → element i                (≤ 1.10)

The ematcher for a rule with N variables executes this chain O(N × eclasses × iterations) times. For the CAS test — the heaviest theory in the suite, with a ring of commutativity/associativity/distributivity rules running over a medium-sized e-graph — this accumulates into an enormous penalty.

The 75% allocations increase you see in the CI numbers is the smoking gun. Before 1.11, push!(seg_buf, ...) / v_new(arity) / n.data fields each created one GC object. In 1.11+, every Vector allocation creates two GC objects: the Array wrapper and the Memory backing store. OptBuffer uses a Vector internally; VecExpr wraps a Vector. Multiply by the number of e-nodes created and ematch buffer writes per CAS saturation run, and you get roughly 2× the GC pressure → the collector fires more, pauses accumulate, and total wall time bloats.

Root Cause 2: if pc === k dispatch is not a jump table in Julia 1.12

The generated ematchers use independent if pc === k; ...; end blocks (not an if/elseif chain):

if pc === 0x0001; ...; end
if pc === 0x0002; ...; end
if pc === 0x0003; ...; end

LLVM can sometimes recognise a run of if x === k against a single variable and emit a switch/jump-table. But this is not guaranteed — it depends on whether LLVM sees the blocks as exhaustive and non-overlapping, and on whether Julia's lowering collapses them before LLVM sees them. In Julia 1.9, LLVM 14 was more aggressive about this pattern. Julia 1.12 ships with LLVM 18, which changed several optimization passes. If LLVM 18 no longer folds these into a switch, the dispatch becomes O(n_instructions) linear scan per VM step. For a rule with 10 instructions, that is 10× the branch pressure. Multiplied across all rules and all e-class candidates, it compounds badly.

An if/elseif chain, on the other hand, is lowered by Julia's front-end into a single br_table-friendly IR structure that LLVM reliably turns into a jump table regardless of LLVM version.

Root Cause 3: eclass.nodes inner-loop allocation in cached_ids for segment patterns

The segment-pattern fallback in cached_ids uses a generator with any(...) over eclass.nodes:

(class_key.val for (class_key, eclass) in g.classes
  if any(n -> v_flags(n) == flags_p && ..., eclass.nodes))

any with a closure over a Vector now goes through the Memory indirection too. But more importantly, the closure captures flags_p, head_h, name_h — three captured variables in a heap-allocated closure object. In 1.11+, closures over multiple variables allocate slightly differently. This is secondary, but worth noting.

Why CAS specifically

The CAS test exercises the most rules (commutativity, associativity, distributivity, multiple operators), runs the most iterations before saturation, and builds the largest e-graph. Every percentage-point slowdown in the inner ematcher loop appears at CAS scale as many more absolute nanoseconds than in the smaller unit tests. That is why it goes from 160 s to 739 s while smaller tests are less affected.

Summary of contributing factors (ordered by impact)

  1. Array → Memory indirection (Julia 1.11): every n.data[i] in the ematcher hot loop now costs one extra pointer chase. 2× GC objects per Vector allocation → 75% allocation spike → more frequent GC → wall-clock inflation.
  2. if pc === k blocks not jump-table-compiled by LLVM 18 (Julia 1.12): O(n_instructions) linear dispatch scan instead of O(1) switch, multiplied across all VM steps.
  3. Closure allocation in cached_ids segment fallback: minor, only affects segment rules.

Proposed fixes

Fix 1 (high impact, easy): In ematch_compiler.jl, change the generated if pc === k; ...; end sequence to a single if pc === 1 elseif pc === 2 ... end chain. Julia lowers if/elseif into switch-friendly IR that LLVM reliably jump-tables regardless of version.

Fix 2 (high impact, moderate effort): On Julia ≥ 1.11, use Memory{UInt64} directly inside OptBuffer instead of Vector{UInt64}. Memory is the raw backing store — it is a single GC object with no wrapper indirection. Same change for VecExpr.data. This restores the pre-1.11 single-dereference access pattern. The API is essentially identical: m[i], length(m), with manual resizing via Memory{T}(undef, n) + copy when needed.

Fix 3 (medium impact, easy): Pre-materialise the cached_ids segment fallback as a plain Vector{Id} scan rather than a generator with a heap closure, to keep it cache-friendly.

Fix 1 alone should recover the jump-table regression. Fix 2 alone should recover most of the allocation/indirection regression. Together they should bring Julia 1.12 performance back to parity with — or better than — Julia 1.9, since LLVM 18 has better vectorisation for tight integer loops once the structural issues are resolved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions