Skip to content

perf(pmt): chunked-leaf packing for basic lists and container_struct#346

Draft
GrapeBaBa wants to merge 46 commits intomainfrom
gr/pmt-chunked-leaf
Draft

perf(pmt): chunked-leaf packing for basic lists and container_struct#346
GrapeBaBa wants to merge 46 commits intomainfrom
gr/pmt-chunked-leaf

Conversation

@GrapeBaBa
Copy link
Copy Markdown
Contributor

@GrapeBaBa GrapeBaBa commented May 5, 2026

Motivation

State transition is dominated by PMT operations on BeaconState's large basic-element lists (Balances, EpochParticipation, InactivityScores, ~1.4M items each) and per-field tree access for struct-shaped containers (Validator). On a mainnet fulu state with 2.18M validators, these account for the bulk of processEpoch and processBlock runtime.

Description

Five PMT-level changes that compose:

  1. u64 payload column — collapses every node kind's payload (branch left/right Ids, chunked_leaf pointer, container_struct vtable pointer, free-list link) into one machine word. State packs [free_bit:1 | kind:3 | ref_count:28] in u32. Cache validity moves out of kind into a 0xFF… sentinel in the root column. Hot tree-walk visits touch only state (1 B) + 4-8 B from payload, fitting one cache line.

  2. chunked_leaf — opt-in via opts.chunked_leaf=true on FixedListType / FixedVectorType. Bottom k_log2 = 10 levels of the chunks subtree fold into one *ChunkedLeaf heap blob holding K=1024 chunks. For 1M-item List<u64> pool metadata drops ~1000× (256K Node.Id → 256 ChunkedLeaves + 256 heap blobs). Bulk read/write get SIMD-batched root recomputation and amortized CoW (one 32 KB memcpy per dirty leaf instead of 10 path clones per dirty chunk).

  3. container_struct (originally feat: model phase0 Validator as struct #232) — a node kind whose payload is *ContainerStructRef (vtable + caller-allocated T). Backs StructContainerType for Validator etc. Field access = O(1) struct read instead of per-field tree walk; hashTreeRoot calls type's cached get_root directly.

  4. Pool dual-allocatorPool now keeps two allocators routed by allocation kind: page_allocator for MAL columns + chunked_leaf 32 KB blobs (page-aligned, infrequent), and allocator for per-node small heap blobs (ContainerStructRef + WrappedT). Page-per-alloc on the small lane wastes ~70 GB of virtual address space at 2.18M validators on macOS arm64 and thrashes the TLB; bucket allocator (default c_allocator) packs them densely. Pool.init switched to options-struct shape (Pool.init(.{}) for production defaults). This unblocks serializeValidators / getEffectiveBalanceIncrementsZeroInactive / getSingleProof binding tests at mainnet scale (24 s → ~500 ms, 50× speedup, equal to main).

  5. Zero-copy validator access — completes the container_struct value chain. PR feat: model phase0 Validator as struct #232 added pool.getStructPtr(node, T) but no list-iteration API was built on top, so callers still cloned the full 263 MB validators slice per epoch transition. This PR adds:

    • StructContainerType.tree.getValuePtr(node, pool) -> *const T — direct typed pointer into the pool's container_struct payload.
    • ListCompositeTreeView.ReadonlyIterator.nextValuePtr() -> *const Element.Type — list iteration that hands out per-element pointers as the depth-iterator walks the tree.
    • BeaconState.validatorsPtrSlice(allocator) -> []*const Validator.Type — random-access pointer slice for callers that need sort / parallel workers / multi-pass.

    The two APIs are complementary: iterator wins for single forward read passes (epoch_transition_cache.init, getEffectiveBalanceIncrementsZeroInactive); pointer slice wins for sort + random index access + parallel workers (epoch_cache.init calling syncPubkeys, slashings_cache.buildFromStateIfNeeded, upgrade_state_to_altair). 8 of 9 hot callers migrated; the last (upgrade_state_to_electra) keeps the value slice because its mutate-then-reread pattern would invalidate pointers.

Bench

bench_process_epoch and bench_process_block on mainnet era, fulu fork, slot 13336576 (2.18M validators), ReleaseFast, Apple Silicon. Both branches run with the same bench harness using c_allocator (no DebugAllocator overhead) for apples-to-apples comparison.

Process epoch (segmented breakdown, ms/run averaged over 50 runs)

step main this branch speedup
epoch_total 418.0 74.9 5.58×
before_process_epoch 193.8 39.6 4.89×
inactivity_updates 58.6 4.4 13.3×
rewards_and_penalties 80.8 14.2 5.69×
effective_balance_updates 67.2 5.3 12.7×
proposer_lookahead 16.5 11.0 1.50×

before_process_epoch (EpochTransitionCache.init) drops 4.89×: container_struct gives O(1) per-field reads on validators, and the new nextValuePtr iterator skips the 263 MB clone that validatorsSlice used to do every epoch. inactivity_updates, rewards_and_penalties, effective_balance_updates get 5-13× from chunked_leaf making bulk reads/writes on Balances / InactivityScores / EpochParticipation SIMD-friendly + amortized CoW.

Process block (segmented breakdown, ms/run averaged over 50 runs)

step main this branch speedup
block_total 166.1 66.0 2.52×
operations 162.9 62.6 2.60×
block_header 0.243 0.246 ~same
withdrawals 0.021 0.023 ~same
execution_payload 0.201 0.200 ~same
randao 1.088 1.100 ~same
sync_aggregate 1.675 1.827 ~same

operations (bulk of block processing) gets 2.60× — chunked_leaf on the balance writes plus zero-copy validator reads in slashings_cache.buildFromStateIfNeeded. sync_aggregate is unchanged: the chunked_leaf CoW cost from scattered sync-committee balance writes (~290 µs of memcpy at mainnet sync committee size) is dominated by BLS aggregate verification (~1.1 ms fixed cost, identical across branches).

Binding tests at mainnet scale

test before this PR after this PR speedup
getEffectiveBalanceIncrementsZeroInactive 762 ms 40 ms 19×
serializeValidators (broken w/ page_allocator) 24097 ms (timeout) ~500 ms 48× (now passing)
getSingleProof(169) (broken) 24956 ms (timeout) 1543 ms now passing

Linux verification (AMD EPYC 9V74, 16 vCPU codespace, ReleaseFast)

Same fixture (mainnet era, fulu fork, slot 13336576, 2.18M validators), 50 runs/step. Speedup ratios reproduce on Linux/x86; absolute numbers are higher than Apple Silicon due to per-core differences.

Process epoch (segmented breakdown, ms/run)

step main this branch speedup
epoch_total 666.0 131.7 5.06×
before_process_epoch 344.8 78.6 4.39×
inactivity_updates 67.6 4.12 16.4×
rewards_and_penalties 112.9 23.21 4.87×
effective_balance_updates 93.0 6.55 14.2×
proposer_lookahead 43.8 16.07 2.73×

Process block (segmented breakdown, ms/run)

step main this branch speedup
block_total 350.8 109.8 3.19×
operations 346.0 105.4 3.28×

Process block (end-to-end fused, ms/run)

variant main this branch speedup
process_block (with BLS) 49.3 38.9 1.27×
process_block_no_sig 10.63 2.67 3.98×

process_block_no_sig (BLS bypassed) drops 3.98× — the optimizations land cleanly on the non-BLS portion. The fused 1.27× reflects ~36-46 ms going to BLS aggregate signature verification per block, which is unaffected by PMT changes.

GrapeBaBa added 30 commits May 1, 2026 22:39
- Node is now union(enum) { free, zero, leaf, branch } with branch.root: ?[32]u8
  encoding lazy/computed state instead of a bit-packed State enum.
- Pool storage is MultiArrayList(NodeWithMeta) splitting Node and ref_count
  into separate columns (SoA-friendly for unref scans).
- ref_count moves from 28-bit packed in State to standalone u32, max_ref_count
  raised from 0x1FFFFFFF to maxInt(u32) - 1.
- Free list is encoded via the .free variant carrying next_free: Id, replacing
  the high-bit-of-state hack.
- Public API surface unchanged: Pool methods and Id methods keep their
  signatures; StateView shim preserves predicate-based State semantics for
  existing callers.
- proof.zig, View.zig, gindex.zig, ssz/* unchanged.
… tests

Code review feedback on commit 940373f:
- computeRoot: drop the 32 KB buf_a stack copy; read first-reduction
  pairs directly from slab.chunks. Saves 32 KB stack frame + 32 KB
  memcpy per call.
- Storage: add doc comments on chunks (zero-tail invariant), len
  (semantic), and dirty (cache-coherence role).
- Move slab tests to slab_test.zig matching the project's per-module
  test file convention; node_test.zig stays focused on Node tests.
len and dirty belong inline in the upcoming Node.slab union variant,
not in heap Storage — keeping them in both places would force every CoW
to sync two copies. Storage is now chunks-only (32 KB heap alloc per
slab); per-slab metadata lives Node-side from B2 onward.

Test for the removed dirty bitset assertion is dropped; remaining tests
adjust their no-op .len assignment.
- Node now has a fifth variant: slab { chunks: [*]align(64), len: u16,
  dirty: StaticBitSet(K), root: ?[32]u8 }.
- chunks is a many-pointer to a heap-allocated Slab.Storage; len/dirty/
  root live inline in the union variant per the canonical design (Y is
  the layout we'll bench against in B6).
- getRoot recovers *Slab.Storage via @ptrCast(@aligncast(chunks)) and
  caches the merkleized root in node.slab.root.
- noChild treats slab as terminal (no Id-children).
- StateView gains isSlab predicate.
- Pool.unref does NOT yet free slab heap storage — that lands with
  Pool.createSlab in B3 (TODO marker placed).
- Pool.createSlab(chunks, len) heap-allocates Slab.Storage, copies the
  caller's chunks in, and seeds a slab Node with len/empty-dirty/null-root.
- Pool.getSlabChunks / getSlabLen expose read-only access for downstream
  SSZ paths (Phase C).
- Pool.unref now frees node.slab.chunks via Slab.destroy when refcount
  drops to zero (was a TODO(B3) placeholder); slab lifecycle is now
  fully Pool-managed.
- Replace B2's white-box slot-planting test with three API-driven tests:
  create/get round-trip, unref-leak check, getRoot integration.

After this commit, callers should never construct Node.slab variants
manually — go through Pool.createSlab.
…riant

- getSlabChunks / getSlabLen migrate from Pool methods to Id methods,
  matching Id.getLeft / Id.getRight conventions.
- Asserts replaced with Error.InvalidNode returns: ReleaseFast strips
  std.debug.assert, exposing UB if a non-slab Id were passed; explicit
  errors are safe in all build modes.
- Tests updated to the new call form: slab_id.getSlabChunks(&pool).

Pool.createSlab stays a Pool method (it allocates from the Pool's
allocator); only the read-side getters move to Id.
- Both methods clone the heap Slab.Storage, apply the requested chunk
  writes, and plant a fresh slab Node pointing at the new Storage; the
  receiver slab is unchanged (CoW).
- New slab has root: null (lazy) and dirty bits set on every changed
  index. The Slab.Storage zero-tail invariant is preserved because we
  only overwrite chunks at indices < K (asserted) and never touch
  chunks[len..K] during the copy.
- Both methods return Error.InvalidNode on non-slab Id (sibling-API
  consistency with Id.getSlabChunks/Len).
- Tests: basic CoW invariant, len preservation, batch update, empty
  batch edge, error path.
Verifies that slab Ids appended to FillWithContentsIterator at the
appropriate depth produce a merkleized root identical to a tree built
from individual chunk leaves. Catches regressions in:
- noChild semantics (slab as terminal)
- getRoot integration of slab + branch lazy compute
- Slab.computeRoot vs std merkleize equivalence

40 PMT tests passing (was 39).
…threshold

When isBasicType(Element) and max_chunk_count >= 2 * Slab.K, FixedListType
builds the SSZ tree via Pool.createSlab leaves at depth chunk_depth - k_log2
instead of pool.createLeaf at depth chunk_depth. Each slab packs K=1024
chunks worth of items contiguously; trailing chunks within the partial last
slab remain zero-bytes (Storage zero-tail invariant preserved).

Affected fields when used in BeaconState (limit-driven): balances,
randao_mixes, slashings, participation, inactivity_scores, etc.

- Re-export Slab from persistent_merkle_tree module.
- comptime threshold: max_chunk_count >= 2 * Slab.K (== 2048 chunks).
  Smaller lists keep the existing per-chunk leaf path (regression-tested).
- The slab path manually inlines an iterator-style merging loop because
  FillWithContentsIterator's zero filler hardcodes @enumFromInt(level),
  which gives the depth-level zero hash. With slab leaves each at depth
  k_log2, the correct filler is at depth (level + k_log2). C1.5 will
  factor this out by adding a leaf_offset parameter to the iterator.
- Tests: large packed u64 list (item_count = 2*K*4 = 8192) verifies
  root vs std hashTreeRoot AND slab nodes present at boundary depth.
  Small list (limit=1024) confirms threshold gate keeps leaf path.
When each appended Id represents a non-leaf subtree (e.g., a Slab Id is
a depth-k_log2 subtree of K chunks), the zero filler emitted by finish()
for missing right siblings at iterator level L must be at absolute depth
L + leaf_offset, not L.

- initWithOffset(pool, depth, leaf_offset) is the new explicit constructor.
- init(pool, depth) preserves the legacy depth-0-leaves contract by
  delegating to initWithOffset(..., 0).
- append() unchanged (doesn't depend on leaf_offset).
- finish() uses (level + leaf_offset) wherever it previously used (level).

Tests: full-tree slab build matches per-leaf reference; partial-fill slab
build matches per-leaf reference with 1/4 missing slabs zero-padded.
This validates that finish()'s zero-filler emits zeros at the correct
absolute depth, not iterator-level depth.
- Adds toValuePackedFromBytes / fromValuePackedIntoChunk to UintType and
  Bool element-tree APIs. These let slab-backed containers decode/encode
  packed items directly from chunk bytes, no Node.Id round-trip.
- BasicPackedChunks gains a 4th comptime parameter use_slab. When true,
  get / set / getAllInto navigate to slab-boundary depth (chunk_depth -
  Slab.k_log2) and read/write chunks via Id.getSlabChunks / setSlabChunk
  rather than per-chunk Ids. populateAllNodes is a no-op in slab mode.
- Existing callers (list_basic.zig, array_basic.zig) pass use_slab=false
  to preserve current semantics; no behavior change.
- Re-exports Slab as persistent_merkle_tree.Slab so chunks.zig can
  reference Slab.K / Slab.k_log2 from its slab-mode comptime branch.

This lays groundwork for FixedListType/FixedVectorType opts.slab in step 3.
Adds TypeOpts { slab: bool = false } parameter to FixedListType and
FixedVectorType. When opts.slab=true and isBasicType(Element), the
type's tree paths build / walk slab leaves instead of per-chunk leaves.

- tree.fromValue / deserializeFromBytes use FillWithContentsIterator.
  initWithOffset(slab_depth, k_log2). tree.toValue / serializeIntoBytes
  walk slab Ids at slab_depth and read chunks via Id.getSlabChunks.
- TreeView (list_basic / array_basic) propagates ST.opts.slab to
  BasicPackedChunks(.., use_slab=ST.opts.slab).
- list_basic features that traverse at chunk_depth (iteratorReadonly,
  sliceTo) compile-error when ST.opts.slab=true. Use get/set/getAll/
  serialize/deserialize/hashTreeRoot for slab-enabled types.
- All existing FixedListType/FixedVectorType callers updated to pass
  .{} (default slab=false). Mechanical signature update across
  consensus_types/{phase0,altair,bellatrix,capella,deneb,electra,fulu},
  ssz/{type,tree_view}/*, test/{fuzz,spec}/*.
- Round-trip tests for slab-enabled list and vector verify hash equality
  against the leaf-path reference.

Verification:
- PMT 42/42 ✓
- SSZ 198/198 ✓ (was 194 + 4 new slab round-trip tests)
- state_transition 85/85 ✓ (no regression — opt-in is comptime, no
  existing type opts in to slab, leaf path unchanged)

State_transition green confirms the opt-in plumbing has zero leak into
the leaf path. Existing BeaconState fields (balances, randao_mixes,
slashings, etc.) keep their leaf-per-chunk behavior. Per-field slab
opt-in (and bench-driven validation) is future work.
Adds bench/ssz/list_slab.zig comparing the leaf-default and opts.slab=true
instantiations of FixedListType(UintType(64), 1<<20) on representative
balances-scale workloads. Five scenarios:
  fromValue           — build tree from a populated value
  fromValue+getRoot   — build + compute root (cold lazy hashes)
  toValue             — decode all items back from the tree
  bulkSet+getRoot     — full-rewrite shape (epoch-rewards-like)

Run with `zig build run:bench_list_slab -Doptimize=ReleaseFast`.

Apple silicon ReleaseFast results (1M u64 items):

  workload                   leaf      slab     speedup
  fromValue 1M               15.4 ms    1.8 ms    8.5×
  fromValue+getRoot 1M       39.5 ms   17.6 ms    2.2×
  toValue 1M                 20.2 ms    1.1 ms   18×
  bulkSet+getRoot 1M         40.4 ms   17.5 ms    2.3×

Reading: slab path eliminates 250K leaf-node allocations (8.5× on
fromValue), provides contiguous chunk reads for bulk decode (18× on
toValue), and benefits from the fixed-shape K-leaf merkleization
specialization on root computation (~2× on getRoot-bound workloads).

The bulkSet+getRoot scenario is shaped after balances epoch-rewards
writeback — full overwrite + recompute root — and shows ~2.3× end-to-end
gain. Sufficient signal to consider opting balances and similar 1M-scale
packed primitive lists into opts.slab=true in a follow-up change.
Switches phase0.Balances from FixedListType(Gwei, VALIDATOR_REGISTRY_LIMIT, .{})
to .{ .slab = true }, taking the chunked-leaf path for the most-touched
packed primitive list in BeaconState. Also drops the inline duplicate of
Balances in fulu.BeaconState — it now aliases phase0.Balances correctly,
which was already the case in altair/bellatrix/capella/deneb/electra.

Bench: zig build run:bench_process_epoch -Doptimize=ReleaseFast
       (mainnet state at slot 13336576, fulu fork, 2.18M validators)

Per-step deltas (segmented breakdown, time/run):

  step                          leaf       slab      speedup
  rewards_and_penalties         267.8 ms   37.6 ms    7.1×
  effective_balance_updates     120.5 ms    9.5 ms   12.7×
  inactivity_updates            109.0 ms  105.6 ms    flat
  participation_flags             0.74 ms   0.73 ms   flat
  proposer_lookahead             16.6 ms   16.4 ms    flat

End-to-end epoch (non-segmented):
  leaf:  5.929 s
  slab:  5.504 s     ( -7.2% )

Total epoch time gain is dominated by `before_process_epoch` (cache
rebuild ~5.3 s) which slab doesn't touch. The targeted balance-bound
operations get 7-13× speedups; this commit unlocks that for any caller
that iterates rewards / effective-balance updates against `state.balances`.

Verification:
- state_transition: 85/85 ✓
- PMT 42/42 ✓ + SSZ 198/198 ✓ (unchanged from prior commits)

Other slab candidates (inactivity_scores, participation, randao_mixes,
slashings) are deliberately left at .{} — opt them in only after their
specific consumer paths are verified slab-safe and bench shows wins.
… opts.slab=true

Switches altair.InactivityScores and altair.EpochParticipation (used by
both previous_epoch_participation and current_epoch_participation) to
opts.slab=true. Same idea as phase0.Balances in 304386c.

Also drops fulu.BeaconState's inline duplicates of these list types in
favor of the altair aliases — bellatrix/capella/deneb/electra already
referenced altair correctly; fulu was the only outlier.

Bench: zig build run:bench_process_epoch -Doptimize=ReleaseFast
       (mainnet state at slot 13336576, fulu fork, 2.18M validators)

End-to-end deltas (time/run, non-segmented epoch):

  config                                                epoch
  leaf baseline (no slab)                              5.929 s
  + Balances slab (304386c)                           5.504 s
  + InactivityScores + EpochParticipation slab         0.557 s   ← this commit

  10.6× total speedup over leaf baseline.

Per-step segmented breakdown (this commit vs leaf baseline):

  step                          leaf       all-slab    speedup
  before_process_epoch          5.30 s     0.555 s      9.6×
  inactivity_updates          109.0 ms     5.82 ms     18.8×
  rewards_and_penalties       267.8 ms     18.1 ms     14.9×
  effective_balance_updates   120.5 ms     6.65 ms     18.1×
  participation_flags          0.74 ms    0.001 ms    ~700×
  proposer_lookahead          16.6 ms     15.6 ms       flat

The 9.6× speedup of before_process_epoch is the second-order effect:
cache-rebuild calls getAll() on participation lists, which used to walk
the full leaf-per-chunk tree path. Slab-enabling those lists collapses
that work to contiguous chunk reads — a win that single-field opt-in
(balances only) couldn't unlock.

Verification:
- state_transition: 85/85 ✓
- PMT 42/42 + SSZ 198/198 unchanged

Remaining slab candidates: randao_mixes (FixedVector(Bytes32)) and
slashings (FixedVector(Gwei)). Both are vectors not lists; randao_mixes
also has Bytes32 element type which may or may not satisfy isBasicType
— defer to a follow-up that explicitly verifies before opt-in.
When a slab-opted-in list/vector is initially empty (e.g. genesis state
before any participation flags are set) or sparsely filled, the tree's
content shape is the early-return zero-leaf form (chunk_count==0 →
createBranch(@enumFromInt(chunk_depth), @enumFromInt(0))). Navigating
this tree at slab_depth lands on a zero sentinel — not a slab — and
the slab-path consumers blow up with Error.InvalidNode in getSlabChunks.

A zero subtree at slab boundary is semantically an all-zero slab. Fix
the slab consumers to handle this:

- BasicPackedChunks.get: return std.mem.zeroes(Element).
- BasicPackedChunks.set: materialize a fresh zero-filled slab before
  applying the write; intermediate slab is unrefed after setSlabChunk.
- BasicPackedChunks.getAllInto: @Memset the output slice for the slab's
  worth of items.
- list.zig tree.toValue (slab branch): out.items already initialised
  to default_value; skip the slab payload read.
- list.zig tree.serializeIntoBytes (slab branch): @Memset the output.
- vector.zig tree.toValue / serializeIntoBytes (slab branches): same
  shape as list.

Repro: minimal-preset altair finality_rule_4 — pre-state's
previous_epoch_participation has length 0 at genesis. tree.fromValue
takes the early-return path. processBlock then attempts to push
attestation flags via epoch_participation.set; navigation lands on
zero sentinel slot 10 (zero hash at depth k_log2=10) instead of a
slab Node, causing getSlabChunks to fail.

Verification (post-fix):
- PMT 42/42 ✓ + SSZ 198/198 ✓ + state_transition 85/85 ✓
- spec_tests minimal (all 9 shards):
    operations 1731/1828 (97 skipped), epoch_processing 727/727,
    sanity 557/557, rewards 380/380, transition 267/267,
    random 561/561, fork 314/314, finality 48/48, merkle_proof 19/19
  — total 4604/4701 pass, 0 failed (97 skipped due to missing fork
  test data).
When BasicPackedChunks.set is called on a slab Id that no parent Branch
references yet (rc==0), the slab is exclusively owned by the TreeView's
children_nodes cache. Mutating the heap chunks in place is safe and
avoids the 32 KB CoW clone per write.

The first write per slab still hits the CoW path (rc>=1, owned by the
persistent tree), publishing a fresh transient slab to the cache. Every
subsequent write to the same slab finds rc==0 and writes a single byte
directly into the heap chunks, accumulating dirty bits across writes
and invalidating the cached root.

For attestation processing in altair+ blocks (~committee_size validators
× many attestations per block, all hitting the same participation
slab), this collapses N × 32 KB writes into 1 × 32 KB clone +
(N-1) × byte writes.

Bench impact (process_block ReleaseSafe, mainnet state slot 13336576):

  step                     leaf      slab CoW   slab+in-place
  operations_no_sig        9.1 ms    210 ms     1.6 ms          (5.7× vs leaf)
  process_block_no_sig    15.5 ms    214 ms     4.8 ms          (3.2× vs leaf)
  operations              39.5 ms    234 ms     27.2 ms         (1.4× vs leaf)
  process_block           42 ms      240 ms     34 ms           (1.2× vs leaf)

Process_epoch (ReleaseFast) per-step deltas remain or improve slightly:

  step                          slab CoW   slab+in-place
  inactivity_updates             6.3 ms    5.4 ms
  rewards_and_penalties          19 ms     19 ms
  effective_balance_updates      7.7 ms    6.7 ms

End-to-end: process_block 14× regression (slab CoW vs leaf) flips to
3.2× speedup; process_epoch keeps its order-of-magnitude advantage on
the bulk-getAll cache-rebuild path.

Verification:
- PMT 42/42 ✓ + SSZ 198/198 ✓ + state_transition 85/85 ✓

Correctness sketch: rc==0 on a slab Node Id means no parent Branch has
ref'd it yet. After Pool.createSlab or Id.setSlabChunk, the returned
Id has rc==0 until Pool.rebind during commit raises it to 1. While in
the TreeView's transient state (children_nodes cache, gindex in
`changed`), the slab is observable only via the current TreeView and
its heap chunks can be mutated without violating shared-tree invariants.
Replaces the slab Node variant's many-pointer + len encoding with a
single `*Slab.Storage` pointer (Lighthouse milhouse's Arc<PackedLeaf>
shape adapted to our Pool-managed ref counting). Storage is now self-
contained: chunks + len in one heap allocation, owned uniquely by the
slab Node Id whose ref count tracks parent Branch references.

Variant before:
  slab: { chunks: [*]align(64) [32]u8, len: u16, root: ?[32]u8 }
Variant after:
  slab: { storage: *Slab.Storage, root: ?[32]u8 }

Side effects:
- Removes every `@ptrCast(@aligncast(node.slab.chunks))` reconstruction
  in getRoot / unref / setSlabChunk / setSlabChunks. Each callsite now
  uses `node.slab.storage` directly.
- Slab.Storage gains a `len: u16` field (was inline in the Node variant).
  allocZero initialises len = 0; createSlab / setSlabChunk* propagate.
- Drops the unused `dirty: StaticBitSet(K)` field that was inflating the
  slab variant by 128 B for no read-side benefit (root invalidation via
  `root = null` already drives lazy recompute, and computeRoot always
  merkleizes the full K-leaf subtree).
- Slab variant size drops from ~171 B (with dirty) to ~48 B, on par with
  the branch variant (41 B). The Node union sizes itself to whichever
  variant is largest.

Verification:
- PMT 42/42 ✓ + SSZ 198/198 ✓ + state_transition 85/85 ✓
- process_block ReleaseSafe still produces correct slab+in-place numbers
  (operations_no_sig 1.6 ms vs leaf 9.1 ms — 5.7×).

Note: bench_process_block still SEGVs in ReleaseFast / ReleaseSmall on
this branch, independent of this refactor (ReleaseSafe's runtime tag
checks mask the UB). Suspected Zig 0.16 codegen issue on tagged union
dispatch in deep recursion paths through getRoot. Filed as a follow-up
to investigate / report upstream; doesn't block correctness or the
ReleaseSafe perf story.
…Depth

`unfinalized_parents_buf` was declared as `undefined` and never initialized.
On iteration 0, the post-rebind cleanup loop reads garbage Optional bytes
and may call `pool.unref` on an arbitrary Id — recursively decrementing
its children's ref counts. When this hits a slab Id still referenced by
a parent branch, the slab is destroyed prematurely (use-after-free).

Manifested as SEGV in `bench_process_block -Doptimize=ReleaseFast` deep
inside `Slab.computeRoot`, with a bogus storage pointer pattern matching
two adjacent u32 values from a freed slot's `next_free` field. ReleaseSafe
hid the bug because its `.free => unreachable` runtime check panics
explicitly; ReleaseFast UB-eliminates the unreachable arm and silently
misroutes `.free` slots into the `.slab` arm.

Fix:
  - Initialize `unfinalized_parents_buf` to all-null so iter 0 sees no
    stale entries.
  - Replace `.free => unreachable` in `Id.getRoot` with `@panic` so any
    future use-after-free surfaces as a clean panic rather than UB-driven
    SEGV under ReleaseFast.

Verified:
  - test:persistent_merkle_tree 42/42
  - test:state_transition 85/85, 0 leaks (was 236)
  - test:ssz 198/198
  - test:spec_tests fork mainnet 276/38 skip (matches pre-fix baseline)
  - bench process_block ReleaseFast: operations_no_sig 1.7ms, process_block
    33ms (matches ReleaseSafe baseline; previously SEGV)
The branch and slab variants stored root as `?[32]u8` so that lazy
(uncomputed) state was represented by `null`. The Optional encoding adds
a 1-byte tag and forces every read site to unwrap via `.?`.

Replace with a plain `[32]u8` field plus a global sentinel constant
`lazy_sentinel = [_]u8{0xFF} ** 32`. SHA-256 outputs are cryptographically
unlikely to equal this value (~1 in 2^256). The unwrap-`.?` pattern is
gone and the code is simpler.

Note: this does NOT shrink `@sizeOf(Node)`. The slab variant's
`*Slab.Storage` pointer forces 8-byte union alignment, which already
pads the union to 48 bytes regardless of the Optional. So this is a
correctness/clarity refactor, not a perf change. The follow-up SoA-split
refactor (decomposing the union into top-level NodeWithMeta fields) is
where the 6× per-node-visit cache footprint reduction will come from.

Verified:
  - test:persistent_merkle_tree 42/42
  - test:state_transition 85/85, 0 leaks
  - bench process_epoch ReleaseFast: epoch_total 328ms (vs 324ms baseline,
    within noise)
The tagged-union NodeWithMeta forced every navigation visit to load 48 B
(cache density 1.3 nodes/cache-line). Profile (sample, 30s, ReleaseSafe)
identified `Node.Id.getNodesAtDepth` as the dominant hotspot (+42% /
+11.6 percentage points share) of the regression vs the pre-refactor
SoA u32-column layout.

Replace `union(enum) Node` with five top-level fields on `NodeWithMeta`
(left, right, root, cache, kind) plus ref_count, so MultiArrayList yields
six dense columns. Hot navigation (`getNodesAtDepth`, `setNodes`,
`setNodesAtDepth`, `getNode`, `setNode`, `truncateAfterIndex`,
`fillToLength`, `DepthIterator.next`) now reads only `kind` (1 B) and
one child Id (4 B) per visit — beating the pre-fecab80f layout's
footprint and recovering the 2.8x regression plus extra headroom.

The `cache: ?*anyopaque` column holds `*Slab.Storage` for slab nodes;
future variants (validators struct cache, sync committee cache, etc.)
reuse the same column via type-erased pointer cast — no schema changes
needed.

Free-list link reuses the `left` column (free slots have no other use
for it), keeping the column count at 6.

Bench (process_epoch ReleaseFast, fulu, 2.18M validators):
  before_process_epoch:  282ms -> 169ms  (-40%)
  epoch_total:           328ms -> 215ms  (-35%)
  vs upstream main cd45847 (leaf):
    epoch_total:         394ms -> 215ms  (-46%, 1.84x faster)
    before_process_epoch: 190ms -> 169ms (-11%, faster than leaf)
    inactivity_updates:  54.5ms -> 5.3ms (-90%)
    rewards_and_penalties: 74.6ms -> 17.9ms (-76%)
    effective_balance_updates: 58.9ms -> 6.5ms (-89%)

process_block ReleaseFast: equivalent to upstream main within noise
(~33ms standalone, ~214ms segmented). No path regression.

All invariants preserved:
  - unfinalized_parents_buf null-init in setNodes/setNodesAtDepth
    (UAF fix from ca5ac32)
  - .free => @Panic in getRoot (defense vs ReleaseFast UB elimination)
  - lazy_sentinel for branch/slab uncomputed roots (from 17ded4a)

Tests: 42/42 PMT, 198/198 SSZ, 85/85 state_transition (0 leaks),
276/38skip fork shard mainnet (matches baseline).
Introduces a new persistent_merkle_tree variant `branch_struct` whose
payload is a heap-allocated `BranchStructRef` (vtable + cached
deserialized struct pointer). Storage piggy-backs on the existing SoA
`cache` column (no pointer-packing) and root caching uses the existing
`lazy_sentinel` mechanism — same one-arm dispatch shape as `.slab`.

A new SSZ type `StructContainerType(ST)` and view `StructContainerTreeView(ST)`
expose this layer:
  - tree.fromValue / toValue / serialize round-trip via the branch_struct
    slot (one Pool node per container, regardless of field count)
  - field reads/writes are O(1) struct accesses; commit lazily recomputes
    the root only when needed
  - get_root is dispatched through the BranchStructRef vtable so the Pool
    stays type-erased

Switches phase0.Validator from FixedContainerType → StructContainerType.
`validatorsSlice` now reads directly from the cached struct array instead
of walking 2M × 8-leaf merkle subtrees.

Bench (process_epoch ReleaseFast, fulu, 2.18M validators):
  before_process_epoch:    169 ms -> 68 ms   (-60%)
  epoch_total:             215 ms -> 110 ms  (-49%)
  vs upstream main cd45847 (no slab, no struct cache):
    epoch_total:           394 ms -> 110 ms  (-72%, 3.6x faster)
    before_process_epoch:  190 ms -> 68 ms   (-64%, 2.8x faster)
    inactivity_updates:    54.5 ms -> 5.1 ms (10.7x)
    rewards_and_penalties: 74.6 ms -> 16.2 ms (4.6x)
    effective_balance_updates: 58.9 ms -> 6.0 ms (9.8x)

Bench (process_block ReleaseFast):
  block_total (segmented):  214 ms -> 111 ms  (-48%)
  process_block standalone: 33.5 ms -> 31.6 ms (~6% faster within noise)

Tests: 42/42 PMT, 199/199 SSZ (+1 StructContainer test), 85/85
state_transition (0 leaks), 276/38skip fork shard mainnet (matches baseline).

Design notes (vs PR #232 reference):
  - Stores BranchStructRef* in the SoA `cache` column rather than packing
    into the Id columns — cleaner, no bit-packing.
  - Reuses lazy_sentinel for root caching (one-arm dispatch like slab) vs
    PR 232's two-state branch_struct_lazy/computed enum split.
  - Omits `to_tree` materialization vtable slot (no current consumers
    need single-leaf proofs through Validator subfields; trivial to add
    later).
…d-root leak

Two related leaks identified by Codex review of the chunked-leaf branch:

1. `Pool.deinit` only freed the MultiArrayList columns. Slots with `.slab`
   or `.branch_struct` kind own a heap-allocated cache pointer
   (`*Slab.Storage` or `*BranchStructRef`); when callers tear down the
   pool without first unref'ing every root, those payloads were leaked.
   Walk all slots and free the cache payload for any kind that owns one
   before destroying the column buffers.

2. `StructContainerTreeView.getFieldRoot` allocated a fresh PMT slot via
   `ChildST.tree.fromValue(pool, &field_value)` just to read the cached
   root, but never unref'ed it. Each call leaked one Pool slot for the
   pool's lifetime — composite child types would also leak their heap
   payloads. Add a per-view `field_root_cache: [chunk_count][32]u8`
   stable backing store, copy the hash into it, and unref the temporary
   slot immediately so the returned `*const [32]u8` stays valid without
   leaking.

Verified: PMT 42/42, SSZ 199/199, state_transition 85/85 (0 leaks).
Opaque PMT nodes (.branch_struct from StructContainerType, .slab from
chunked-leaf packing) terminate `getLeft`/`getRight` traversal because
their children are not stored as separate Ids. This blocked single- and
compact-multi proofs through any path that crosses validators[i].field
or balances[i] / inactivity_scores[i] / participation[i].

Add `Pool.materializeBranchStruct` and `Pool.materializeSlab`, both of
which lazily build a temporary plain PMT subtree from the cached payload
that proof traversal can navigate. Extend the BranchStructRef vtable
with `to_tree`, implemented for StructContainerType.WrappedT via
`FixedCT.tree.fromValue`. proof.zig gains an `isOpaqueNode` /
`materializeOpaque` pair plus a deferred-unref ArrayList for the
compact-multi recursive walk; createSingleProof keeps a single optional
materialized root and rejects nested opaques with `error.NestedOpaque`.

Verified by two new regression tests (`single proof: validators[0].
withdrawal_credentials` and `single proof: balances[0]`) in
src/fork_types/any_beacon_state.zig — both fail before this commit
with `Error.InvalidNode` from `getRight` on the opaque node and pass
after.

Why spec tests didn't catch this:
  - EF `merkle_proof` shard only covers
    `BeaconBlockBody.blob_kzg_commitment_merkle_proof` (4 cases)
  - EF `light_client` shard has BeaconState proofs (sync_committee,
    finality_root) but lodestar-z has no `light_client` runner yet
    (separate follow-up). Even with a runner, those paths don't descend
    into list elements (`validators[i]`/`balances[i]`), so they wouldn't
    hit our opaque-node bug.
  - The included tests construct the exact `validators.0.X` and
    `balances.0` paths that branch_struct and slab respectively block.

Tests: PMT 42/42, SSZ 199/199, fork_types 9/9 (incl. 2 new), state_transition
85/85 (0 leaks), spec_tests fork mainnet 276/38skip (matches baseline).
Bench process_epoch ReleaseFast: 121ms / 77ms (within current variance,
no regression — opaque materialization only fires inside proof
generation, never on epoch hot paths).
Replace `pub const NodeWithMeta = struct {...}` with file-level fields
and `const Node = @this();`. `MultiArrayList(NodeWithMeta)` becomes
`MultiArrayList(Node)`. Pure syntactic refactor — no behavior change,
no perf change. Aligns with main branch's idiom and removes the awkward
`NodeWithMeta` vs `Node` naming distinction.
Replace separate `kind: NodeKind` (u8) and `ref_count: u32` columns with
a single packed `state: State` u32 column. Layout matches main:
  - bit 31 set   → free slot;  bits 0..30 = next-free Id
  - bit 31 clear → in-use;     bits 28..30 = kind, bits 0..27 = ref_count

Free-list link migrates from the `left` column into `state` itself
(matching the legacy union encoding). Branch-free `kind()` decoder
avoids a conditional on every navigation visit.

Saves 1 byte/node and consolidates the ref/unref hot path: a single
column read replaces touching both kind and ref_count.

External callers (chunks.zig, list.zig, vector.zig, proof.zig) updated
to read `state[idx].kind()` / `state[idx].refCount()` accordingly.

Bench impact: epoch -5%, block -5%, sync_aggregate -30% vs baseline
(plus 1 B/node memory savings).
Collapse three columns (`left: Id`, `right: Id`, `cache: ?*anyopaque`)
into a single `payload: u64` overloaded by kind:
  - branch        : low 32 = left Id,  high 32 = right Id
  - slab          : full 64-bit `*Slab.Storage` pointer
  - branch_struct : full 64-bit `*BranchStructRef` pointer
  - free          : low 32 = next-free Id (high 32 unused)

Final SoA layout: 3 columns total (payload + root + state), down from 6.
Memory savings vs baseline: 9 B/node total (1 B from State pack, 8 B
here from cache + one of left/right).

New helpers `packChildren`/`unpackLeft`/`unpackRight`/`payloadAsPtr`/
`ptrAsPayload` keep the encode/decode local. `slabStorage` and
`branchStructRef` now read from the payload column.

External slab CoW path in `chunks.zig` uses the new public
`Id.getSlabStorageMut` helper instead of poking the column directly.

Bench impact: roughly perf-neutral vs the State-only commit on
process_epoch / process_block; significant wins on slab-heavy paths
(set_nodes_randomly -28%). Validator-iteration (branch_struct heavy)
holds steady because the pointer fits in a single u64 read.

Note: payload-merged design assumes 64-bit usize. lodestar-z (and all
ETH consensus clients) target only 64-bit platforms.
`Id.getState(pool)` now returns the packed `State` value (matching main's
API) instead of a wrapper struct. The State enum already exposes all
predicates (`isFree`, `kind`, `refCount`, `nextFree`); the StateView
indirection became redundant after C2 re-introduced packed State.

Add `isZero/isLeaf/isBranch/isSlab/isBranchStruct` to State for parity
with main. `isBranchLazy` / `isBranchComputed` stay on Id because they
read both state and root columns.

Net -24 lines: removes a struct that just forwarded calls to State.
GrapeBaBa added 6 commits May 4, 2026 22:38
The "slab" name was borrowed from kernel slab-allocator terminology and
conflicted with the unrelated `state` field on TreeViewState (`self.state`
in chunks.zig). Rename to `ChunkedLeaf` (matching the existing branch
and design-doc terminology) so the data structure self-documents:
a single Node holding K=1024 packed leaf chunks instead of K individual
leaf Nodes.

Mechanical rename across ~290 sites:
- type Slab → ChunkedLeaf
- NodeKind.slab → .chunked_leaf
- TypeOpts.slab → .chunked_leaf
- methods: createSlab → createChunkedLeaf, setSlabChunk(s) →
  setChunkedLeafChunk(s), getSlabChunks/Len/StorageMut →
  getChunkedLeaf*, materializeSlab → materializeChunkedLeaf,
  isSlab → isChunkedLeaf
- file renames: slab.zig → chunked_leaf.zig, slab_test.zig →
  chunked_leaf_test.zig, bench/ssz/list_slab.zig →
  list_chunked_leaf.zig
- bench target: bench_list_slab → bench_list_chunked_leaf

Verified perf-neutral via 5 paired alternating bench runs
(paired t = 0.26, well below 95% CI threshold).
- Convert ChunkedLeaf to file-as-struct (Node.zig precedent), rename to
  ChunkedLeaf.zig, merge chunked_leaf_test.zig into it. Drop allocZero/
  destroy wrappers; callers use allocator.create directly (production
  paths immediately overwrite the chunks blob, so the explicit zero-init
  was dead work).
- Add Node.truncateAfterIndexWithLeafOffset so the chunked_leaf-mode
  sliceTo zeroes right siblings using ZeroHash[depthi + k_log2] instead
  of ZeroHash[depthi]; original truncateAfterIndex now delegates with
  offset=0. Asserts depth + leaf_offset <= max_depth.
- ListBasicTreeView: implement iteratorReadonly and sliceTo for
  chunked_leaf via two-level (chunked_leaf_idx, intra_chunk) navigation.
  iteratorReadonly caches the current ChunkedLeaf chunks pointer and
  handles .zero-sentinel boundaries; sliceTo fast-paths .zero boundaries
  (no extra alloc) and only materializes a trimmed ChunkedLeaf when the
  cut falls inside a real one. Use getChunkedLeafChunks (const) for the
  read paths in sliceTo and chunks.zig CoW.
- Compile-time guard: chunked_leaf=true requires limit (or length) >=
  K * items_per_chunk so chunk_depth >= ChunkedLeaf.k_log2; otherwise
  chunked_leaf_depth underflows. Update TypeOpts.chunked_leaf doc to
  reflect the new sliceTo/iteratorReadonly support.

Tests: 8 chunked_leaf tests covering happy paths + 2 defensive tests
that synthesize zero-sentinel boundary trees via FillWithContentsIterator
to exercise the .zero branches in next() and sliceTo.

Verified against ListTLeaf reference: every chunked_leaf sliceTo root
matches the equivalent leaf-path tree.
`branch_struct` named the variant after its tree-shape role (a "branch"
that wraps a struct), but it always backs a Container/StructContainer
SSZ type. `container_struct` matches the actual semantic.

Also tighten two doc comments:

- `lazy_sentinel`: now lists all three variants that use it (branch,
  chunked_leaf, container_struct) and explains why the lazy/computed
  flag lives in the root column rather than in `State.kind` (keeps
  structural type orthogonal to cache validity, so each new kind costs
  one slot in `kind` and `getRoot` toggles laziness without touching
  State).
- `ChunkedLeaf.computeRoot`: explain why we don't reuse
  `hashing.merkleize` (it requires mutable input; `chunks` is *const
  because the heap blob is shared across CoW siblings) and what the
  16 KB scratch `buf` is doing in each round.

No behavior change; PMT/SSZ/consensus_types/fork_types/state_transition
unit tests + zig fmt clean.
…order

Pool was placed below Id (lines 1062..1479) after the SoA refactor; main
keeps the canonical order State → Pool → Id which reads top-down: low
constants → storage → handle/methods. Moves the Pool block (~420 lines)
back to that slot. Also drops the stale "Defense-in-depth" comment block
on Id.getRoot's `.free` arm — the @Panic message itself ("getRoot called
on .free slot — use-after-free") is sufficient.

No behavior change; PMT/SSZ/state_transition tests + zig fmt clean.
…AtDepth

Both `setNodes` and `setNodesAtDepth` re-bound `states` and `payloads`
at the end of each loop iteration with a comment claiming `pool.unref`
or `pool.rebind` "may have grown the pool indirectly via subsequent
calls". Neither path can grow the MAL:

- `pool.rebind` writes existing slots and calls `refUnsafe` (rc++ only).
- `pool.unref` decrements rc, transitions slots to free, and calls
  `allocator.destroy` on chunked_leaf / container_struct heap blobs.
  The `container_struct.deinit` callback signature is `fn(ptr,
  Allocator) void` — it doesn't get a `*Pool` and so can't trigger
  `Pool.alloc`/`preheat`.

The only path that grows the underlying MultiArrayList is
`Pool.alloc` -> `Pool.preheat` (via free-list exhaustion), and that
call site already refreshes the slices when it returns true. The
end-of-iter refresh is dead code; removing it has no behavior change.

Also tighten the initial-bind comment to match: only `pool.alloc`
needs to be followed by a refresh.

PMT/SSZ/state_transition tests + zig fmt clean.
After the file-as-struct refactor `ChunkedLeaf.Storage` no longer
exists — `ChunkedLeaf` itself is the struct. The leftover `Storage`
naming is misleading.

- `Id.getChunkedLeafStorageMut` -> `Id.getChunkedLeafPtr` (also drops
  the Rust-style `Mut` suffix for Zig stdlib's `Ptr` convention; cf.
  `HashMap.getPtr`).
- Private helper `chunkedLeafStorage` -> `chunkedLeafPtr`.
- Doc comments referring to "the Storage invariant" / "Heap Storage
  cloned" updated to "trailing-zero invariant" / "Heap blob cloned".
- Drop two redundant inline comments that explained well-understood
  Zig idioms (slice-after-MAL-bind once, hot-loop bytes-per-visit
  micro-detail).

Local variable names `storage` / `old_storage` / `new_storage` kept —
they describe the role of the pointed-to heap blob, not the type.

PMT/SSZ tests + zig fmt clean.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant performance improvements to the persistent merkle tree, primarily targeting state transition efficiency. By restructuring node storage and introducing an opt-in chunked-leaf packing mechanism for basic lists, the changes drastically reduce memory overhead and cache misses during deep-tree traversals. Additionally, the PR includes necessary API extensions for these new node types and addresses critical memory safety bugs.

Highlights

  • Performance Optimization: Implemented flat SoA Node columns and chunked-leaf packing for basic lists to significantly reduce cache misses and improve state transition performance.
  • Memory Management: Optimized Node state encoding into a single u32 and introduced chunked-leaf nodes to reduce the number of Node IDs for large lists.
  • API Additions: Added new Pool and Id methods to support chunked_leaf and container_struct node kinds, including proof traversal support.
  • Bug Fixes: Fixed uninitialized memory in parent buffers and addressed use-after-free issues in Node.Id.getRoot.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the Persistent Merkle Tree and SSZ implementations by introducing chunked leaves and struct-backed containers, supported by a refactor of the Node and Pool structures into a compact SoA layout. Feedback identifies several safety and style guide violations, most notably the use of large stack allocations (up to 32KB) which risk stack overflow. The reviewer also highlighted potential reference count corruption in createBranch and rebind under error conditions, the use of forbidden recursion in getRoot, and suggested panicking on use-after-free in unref to maintain the project's safety standards.

Comment thread src/persistent_merkle_tree/ChunkedLeaf.zig Outdated
Comment thread src/persistent_merkle_tree/Node.zig Outdated
Comment thread src/persistent_merkle_tree/Node.zig Outdated
Comment thread src/ssz/tree_view/chunks.zig Outdated
Comment thread src/ssz/tree_view/list_basic.zig Outdated
Comment thread src/ssz/type/list.zig Outdated
Comment thread src/ssz/type/vector.zig Outdated
Comment thread src/ssz/type/vector.zig Outdated
Comment thread src/persistent_merkle_tree/Node.zig
Comment thread src/persistent_merkle_tree/Node.zig
Addresses Gemini code review on PR #346:

Stack growth (TigerStyle: no large stack frames). 7 sites previously
held 16-32 KB on the stack inside hot or recursive paths. Replaced
with heap-backed alternatives:
- `ChunkedLeaf.computeRoot` now takes a caller-supplied `*[K/2][32]u8`
  scratch (16 KB). New `computeRootAllocating(allocator, out)` wraps
  it with a per-call `allocator.alignedAlloc` + free; OOM panics.
  `Id.getRoot`'s chunked_leaf arm uses this; bench shows ~3% slowdown
  on `fromValue+getRoot` from per-call alloc, acceptable for keeping
  scratch off-stack.
- New `Pool.createChunkedLeafEmpty(len)` returns a zeroed heap blob;
  caller fills via `getChunkedLeafPtr`. Replaces the
  `var chunked_leaf_buf: [K][32]u8 align(64)` 32 KB stack scratch +
  `pool.createChunkedLeaf(&buf, ...)` memcpy pattern at:
  - `chunks.zig` set `.zero` materialization
  - `list_basic.zig` sliceTo boundary (already used the nullable
    + defer ownership pattern)
  - `list.zig` fromValue (byte-stream + item paths)
  - `vector.zig` fromValue (byte-stream + item paths)

Refcount ordering in `createBranch` and `rebind`. Both wrote parent
metadata (kind=.branch, payload=children) BEFORE acquiring child
refs. If `refUnsafe(right)` fails, parent is `.branch` pointing at a
child whose rc was never incremented; caller's `unref(parent)` then
recursively dec's the un-ref'd child, corrupting its rc. Bug exists
on main too but never triggered there (no heap-pointer kinds → no
SEGV from dangling payloads). Fix: ref both children first with
errdefer rollback, then write parent.

`unref(.free)` now panics. Was `silently continue`. The masked bug
was caught by `node_test.test.Node.State predicates` which had a
double-unref pattern (`createBranch(leaf, leaf)` refs leaf twice;
test had both `defer unref(leaf)` and `defer unref(branch)` —
branch's destruction cascade frees leaf, then explicit unref(leaf)
hits .free). Fixed the test by removing the unnecessary `defer
unref(leaf)`; createBranch transitively owns the leaf.

Ownership transfer pattern at 5 fromValue sites and chunks.zig set:
`var id_opt: ?Node.Id = try ...createChunkedLeafEmpty(...)` + `errdefer
if (id_opt) |id| pool.unref(id)` + `id_opt = null` after `it.append`.
Cleanly handles the early-`InvalidLength` failure mode of append.
(append's mid-loop createBranch failure is a pre-existing leak in
`FillWithContentsIterator` itself; out of scope here. A future
`appendAssumeCapacity`-style refactor would eliminate it.)

`chunks.zig` `.zero` materialization now passes `len = intra_chunk
+ 1` instead of `0` so the chunked_leaf trailing-zero invariant holds.

Verified: PMT/SSZ/state_transition tests + zig fmt clean. Bench
`bench_list_chunked_leaf` shows fromValue+getRoot 17.8 → 18.4 ms
(+3%, per-call alloc cost), fromValue/toValue within noise, no
regression.

Not addressed in this commit (deferred to follow-ups):
- `Id.getRoot` recursion → iterative (Gemini MEDIUM; bounded by
  max_depth=64, real but not urgent).
- `FillWithContentsIterator.append` mid-loop leak (needs API rework
  to appendAssumeCapacity model).
@GrapeBaBa GrapeBaBa changed the title perf(pmt): SoA columns + chunked-leaf packing for basic lists perf(pmt): chunked-leaf packing for basic lists May 5, 2026
@GrapeBaBa GrapeBaBa changed the title perf(pmt): chunked-leaf packing for basic lists perf(pmt): chunked-leaf packing for basic lists and container_struct May 5, 2026
GrapeBaBa added 7 commits May 6, 2026 17:00
Pool now holds two allocators routed by allocation kind:
- `page_allocator` (default `std.heap.page_allocator`): MAL columns +
  chunked_leaf 32 KB blobs + scratch. Page-aligned, infrequent.
- `allocator` (default `std.heap.c_allocator`): per-node small heap blobs
  (ContainerStructRef ~32 B, WrappedT ~150 B). Page-per-alloc on these
  costs ~70 GB virtual address space at 2.18 M validators on macOS arm64;
  bucket allocator packs them densely.

binding `serializeValidators` regression at mainnet scale: 24097 ms →
469 ms (51× speedup, equal to main).

`Pool.init` switched to options-struct shape with sensible production
defaults; callers can write `.init(.{})` for production or override any
combination of {page_allocator, allocator, pool_size}. `pool_size`
default sized for mainnet fulu BeaconState (~10 M nodes; touches only
the `state` column at init = ~10 MB physical, rest lazily committed).

Bench harness adopts the binding pattern: `DebugAllocator` in Debug,
`c_allocator` otherwise. Removes the previous `DebugAllocator`-everywhere
behavior that artificially inflated absolute numbers and made
`sync_aggregate` look 2.2× regressed.

`bindings/napi/pool.zig` collapses to a single `Pool.init(.{})` call.
Two CI failures from c120b5f:

1. `node_bench.zig:91` used `Pool.init(allocator, 50_000_000)` without a
   leading `.` (direct `Pool` import), so the sed pattern `\.Pool\.init(`
   skipped it. Fixed manually to the opts-struct form.

2. `test/spec/writer/{finality,random,sanity}.zig` embed test code as
   multi-line `writer.print` template strings. The previous sed expanded
   `Pool.init(allocator, pool_size)` → `Pool.init(.{ .allocator = ..., .pool_size = ... })`
   inside those templates, but the `{` and `}` braces inside the template
   are interpreted as `writer.print` format placeholders. Escaping them
   to `{{` and `}}` so they render as literal braces in the generated
   spec test files.
Adds two helpers to skip the 263 MB clone in `validatorsSlice`-style
read passes over a `List<StructContainerType<T>>`:

- `StructContainerType.tree.getValuePtr(node, pool) -> *const Type`
  reads the wrapped value directly from a `.container_struct` node's
  payload, no copy.
- `ListCompositeTreeView.ReadonlyIterator.nextValuePtr() -> *const T`
  uses it to hand out per-element pointers during iteration. Compile-
  time guard rejects element types that don't implement `getValuePtr`.

Migrating `EpochTransitionCache.init`'s validator loop (the largest
consumer) drops `before_process_epoch` 51.4 ms → 40.5 ms (-21%) and
`epoch_total` 86.5 ms → 75.4 ms (-13%) at mainnet fulu (2.18 M
validators).

The pointer is valid only while the node's refcount holds and no CoW
mutation replaces it, so this is restricted to transient read passes.
…tive to ptr iter

Drops the full 263 MB validators clone in this binding hot path. The
function only reads `validator.slashed` per element so the iterator's
per-element pointer access is exactly what's needed.

Mainnet binding test (`getEffectiveBalanceIncrementsZeroInactive`):
762 ms -> 40 ms (19× faster).
…lone)

Adds a pointer-slice variant of `validatorsSlice` that hands out
`*const Validator.Type` into the pool's container_struct payloads —
17 MB pointer slice vs 263 MB value clone at mainnet (2.18 M validators,
~16x lower).

Migrates `EpochCache.init` (state load hot path), `upgrade_state_to_altair`,
and the test utility `generate_state` to the pointer-slice form. The
callers' loops barely change because Zig auto-derefs `*const T` for field
access (`v.effective_balance` works whether `v` is `T` or `*const T`).

`syncPubkeys`/`syncPubkeysParallel` and `sumTargetUnslashedBalanceIncrements`
take `[]const *const Validator` instead of `[]const Validator` so the
parallel worker fan-out keeps random-access semantics on a slice of
pointers.

Pointers stay valid only while the validators list isn't mutated. Mutation
goes through the standard `tree.set(i, &new_val)` path — copy out the
value, edit, write back; the original pointer is then invalid for the
mutated slot.
`SlashingsCache.initFromValidators` only reads `validator.slashed` per
element — perfect zero-copy candidate. Switch its `validators` parameter
from `[]const Validator` to `[]const *const Validator`, and update
`buildFromStateIfNeeded` to call `state.validatorsPtrSlice` instead of
`getAllReadonlyValues`.

Saves the 263 MB validators clone every time `buildFromStateIfNeeded`
runs (which happens lazily on first slashing-cache lookup per state).
…Slice

`syncPubkeys` parameter changed to `[]const *const Validator` in
9aeab18; the bench harness was the last untouched caller.
@GrapeBaBa
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25d18710ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const chunked_leaf_count = (chunk_count + ChunkedLeaf.K - 1) / ChunkedLeaf.K;
const chunked_leaf_ids = try self.state.allocator.alloc(Node.Id, chunked_leaf_count);
defer self.state.allocator.free(chunked_leaf_ids);
try self.state.root.getNodesAtDepth(self.state.pool, chunked_leaf_depth, 0, chunked_leaf_ids);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read chunked list nodes through staged child cache

In the use_chunked_leaf branch, getAllInto reads chunked-leaf IDs directly from self.state.root, so it bypasses TreeViewState.children_nodes/changed where uncommitted set/push updates are staged. This means ListBasicTreeView.getAll/getAllInto can return stale pre-mutation values until commit() is called, unlike the non-chunked path which resolves nodes via state.getChildNode and includes staged edits.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. @codex review again

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

GrapeBaBa added 2 commits May 7, 2026 17:03
Reported by codex review on PR #346.

The chunked_leaf branch of `ListBasicTreeView.getAllInto`
(`chunks.zig:165`) walked `self.state.root` directly via
`getNodesAtDepth`, bypassing `TreeViewState.children_nodes` /
`changed`. Uncommitted `set` / `push` writes are staged in
`children_nodes` until `commitNodes()` rolls them into the root,
so callers reading without an intermediate `commit()` got the
pre-mutation values.

Fix: after the bulk root walk, override each chunked_leaf id with
the staged entry from `children_nodes` if present. The non-chunked
path already does the equivalent via `populateAllNodes` +
`getChildNode`.

Adds two regression tests covering staged `set` (boundary across
two chunked_leaves) and staged `push`. Both fail without the fix.
The readonly iterator walks `state.root` directly via DepthIterator and
bypasses `state.children_nodes`, so uncommitted `set`/`push` would be
invisible. TS's read-only iteration paths (`getAllReadonly`/`forEach`)
respect pending writes via `getReadonly(i)`; this Zig fast-path takes
the opposite trade-off (perf over implicit pending visibility).

Adds a `std.debug.assert(state.changed.count() == 0)` at both
`iteratorReadonly` entry points (list_basic, list_composite) so
violations panic in Debug/ReleaseSafe; ReleaseFast still pays nothing.
Doc comment on the public method makes the contract explicit.

All existing callers already commit before iterating; full module test
suite passes unchanged.
@GrapeBaBa
Copy link
Copy Markdown
Contributor Author

gemini review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant