perf(pmt): chunked-leaf packing for basic lists and container_struct#346
perf(pmt): chunked-leaf packing for basic lists and container_struct#346
Conversation
- Node is now union(enum) { free, zero, leaf, branch } with branch.root: ?[32]u8
encoding lazy/computed state instead of a bit-packed State enum.
- Pool storage is MultiArrayList(NodeWithMeta) splitting Node and ref_count
into separate columns (SoA-friendly for unref scans).
- ref_count moves from 28-bit packed in State to standalone u32, max_ref_count
raised from 0x1FFFFFFF to maxInt(u32) - 1.
- Free list is encoded via the .free variant carrying next_free: Id, replacing
the high-bit-of-state hack.
- Public API surface unchanged: Pool methods and Id methods keep their
signatures; StateView shim preserves predicate-based State semantics for
existing callers.
- proof.zig, View.zig, gindex.zig, ssz/* unchanged.
… tests Code review feedback on commit 940373f: - computeRoot: drop the 32 KB buf_a stack copy; read first-reduction pairs directly from slab.chunks. Saves 32 KB stack frame + 32 KB memcpy per call. - Storage: add doc comments on chunks (zero-tail invariant), len (semantic), and dirty (cache-coherence role). - Move slab tests to slab_test.zig matching the project's per-module test file convention; node_test.zig stays focused on Node tests.
len and dirty belong inline in the upcoming Node.slab union variant, not in heap Storage — keeping them in both places would force every CoW to sync two copies. Storage is now chunks-only (32 KB heap alloc per slab); per-slab metadata lives Node-side from B2 onward. Test for the removed dirty bitset assertion is dropped; remaining tests adjust their no-op .len assignment.
- Node now has a fifth variant: slab { chunks: [*]align(64), len: u16,
dirty: StaticBitSet(K), root: ?[32]u8 }.
- chunks is a many-pointer to a heap-allocated Slab.Storage; len/dirty/
root live inline in the union variant per the canonical design (Y is
the layout we'll bench against in B6).
- getRoot recovers *Slab.Storage via @ptrCast(@aligncast(chunks)) and
caches the merkleized root in node.slab.root.
- noChild treats slab as terminal (no Id-children).
- StateView gains isSlab predicate.
- Pool.unref does NOT yet free slab heap storage — that lands with
Pool.createSlab in B3 (TODO marker placed).
- Pool.createSlab(chunks, len) heap-allocates Slab.Storage, copies the caller's chunks in, and seeds a slab Node with len/empty-dirty/null-root. - Pool.getSlabChunks / getSlabLen expose read-only access for downstream SSZ paths (Phase C). - Pool.unref now frees node.slab.chunks via Slab.destroy when refcount drops to zero (was a TODO(B3) placeholder); slab lifecycle is now fully Pool-managed. - Replace B2's white-box slot-planting test with three API-driven tests: create/get round-trip, unref-leak check, getRoot integration. After this commit, callers should never construct Node.slab variants manually — go through Pool.createSlab.
…riant - getSlabChunks / getSlabLen migrate from Pool methods to Id methods, matching Id.getLeft / Id.getRight conventions. - Asserts replaced with Error.InvalidNode returns: ReleaseFast strips std.debug.assert, exposing UB if a non-slab Id were passed; explicit errors are safe in all build modes. - Tests updated to the new call form: slab_id.getSlabChunks(&pool). Pool.createSlab stays a Pool method (it allocates from the Pool's allocator); only the read-side getters move to Id.
- Both methods clone the heap Slab.Storage, apply the requested chunk writes, and plant a fresh slab Node pointing at the new Storage; the receiver slab is unchanged (CoW). - New slab has root: null (lazy) and dirty bits set on every changed index. The Slab.Storage zero-tail invariant is preserved because we only overwrite chunks at indices < K (asserted) and never touch chunks[len..K] during the copy. - Both methods return Error.InvalidNode on non-slab Id (sibling-API consistency with Id.getSlabChunks/Len). - Tests: basic CoW invariant, len preservation, batch update, empty batch edge, error path.
Verifies that slab Ids appended to FillWithContentsIterator at the appropriate depth produce a merkleized root identical to a tree built from individual chunk leaves. Catches regressions in: - noChild semantics (slab as terminal) - getRoot integration of slab + branch lazy compute - Slab.computeRoot vs std merkleize equivalence 40 PMT tests passing (was 39).
…threshold When isBasicType(Element) and max_chunk_count >= 2 * Slab.K, FixedListType builds the SSZ tree via Pool.createSlab leaves at depth chunk_depth - k_log2 instead of pool.createLeaf at depth chunk_depth. Each slab packs K=1024 chunks worth of items contiguously; trailing chunks within the partial last slab remain zero-bytes (Storage zero-tail invariant preserved). Affected fields when used in BeaconState (limit-driven): balances, randao_mixes, slashings, participation, inactivity_scores, etc. - Re-export Slab from persistent_merkle_tree module. - comptime threshold: max_chunk_count >= 2 * Slab.K (== 2048 chunks). Smaller lists keep the existing per-chunk leaf path (regression-tested). - The slab path manually inlines an iterator-style merging loop because FillWithContentsIterator's zero filler hardcodes @enumFromInt(level), which gives the depth-level zero hash. With slab leaves each at depth k_log2, the correct filler is at depth (level + k_log2). C1.5 will factor this out by adding a leaf_offset parameter to the iterator. - Tests: large packed u64 list (item_count = 2*K*4 = 8192) verifies root vs std hashTreeRoot AND slab nodes present at boundary depth. Small list (limit=1024) confirms threshold gate keeps leaf path.
…ove 2*K threshold" This reverts commit b4484bb.
When each appended Id represents a non-leaf subtree (e.g., a Slab Id is a depth-k_log2 subtree of K chunks), the zero filler emitted by finish() for missing right siblings at iterator level L must be at absolute depth L + leaf_offset, not L. - initWithOffset(pool, depth, leaf_offset) is the new explicit constructor. - init(pool, depth) preserves the legacy depth-0-leaves contract by delegating to initWithOffset(..., 0). - append() unchanged (doesn't depend on leaf_offset). - finish() uses (level + leaf_offset) wherever it previously used (level). Tests: full-tree slab build matches per-leaf reference; partial-fill slab build matches per-leaf reference with 1/4 missing slabs zero-padded. This validates that finish()'s zero-filler emits zeros at the correct absolute depth, not iterator-level depth.
- Adds toValuePackedFromBytes / fromValuePackedIntoChunk to UintType and Bool element-tree APIs. These let slab-backed containers decode/encode packed items directly from chunk bytes, no Node.Id round-trip. - BasicPackedChunks gains a 4th comptime parameter use_slab. When true, get / set / getAllInto navigate to slab-boundary depth (chunk_depth - Slab.k_log2) and read/write chunks via Id.getSlabChunks / setSlabChunk rather than per-chunk Ids. populateAllNodes is a no-op in slab mode. - Existing callers (list_basic.zig, array_basic.zig) pass use_slab=false to preserve current semantics; no behavior change. - Re-exports Slab as persistent_merkle_tree.Slab so chunks.zig can reference Slab.K / Slab.k_log2 from its slab-mode comptime branch. This lays groundwork for FixedListType/FixedVectorType opts.slab in step 3.
Adds TypeOpts { slab: bool = false } parameter to FixedListType and
FixedVectorType. When opts.slab=true and isBasicType(Element), the
type's tree paths build / walk slab leaves instead of per-chunk leaves.
- tree.fromValue / deserializeFromBytes use FillWithContentsIterator.
initWithOffset(slab_depth, k_log2). tree.toValue / serializeIntoBytes
walk slab Ids at slab_depth and read chunks via Id.getSlabChunks.
- TreeView (list_basic / array_basic) propagates ST.opts.slab to
BasicPackedChunks(.., use_slab=ST.opts.slab).
- list_basic features that traverse at chunk_depth (iteratorReadonly,
sliceTo) compile-error when ST.opts.slab=true. Use get/set/getAll/
serialize/deserialize/hashTreeRoot for slab-enabled types.
- All existing FixedListType/FixedVectorType callers updated to pass
.{} (default slab=false). Mechanical signature update across
consensus_types/{phase0,altair,bellatrix,capella,deneb,electra,fulu},
ssz/{type,tree_view}/*, test/{fuzz,spec}/*.
- Round-trip tests for slab-enabled list and vector verify hash equality
against the leaf-path reference.
Verification:
- PMT 42/42 ✓
- SSZ 198/198 ✓ (was 194 + 4 new slab round-trip tests)
- state_transition 85/85 ✓ (no regression — opt-in is comptime, no
existing type opts in to slab, leaf path unchanged)
State_transition green confirms the opt-in plumbing has zero leak into
the leaf path. Existing BeaconState fields (balances, randao_mixes,
slashings, etc.) keep their leaf-per-chunk behavior. Per-field slab
opt-in (and bench-driven validation) is future work.
Adds bench/ssz/list_slab.zig comparing the leaf-default and opts.slab=true instantiations of FixedListType(UintType(64), 1<<20) on representative balances-scale workloads. Five scenarios: fromValue — build tree from a populated value fromValue+getRoot — build + compute root (cold lazy hashes) toValue — decode all items back from the tree bulkSet+getRoot — full-rewrite shape (epoch-rewards-like) Run with `zig build run:bench_list_slab -Doptimize=ReleaseFast`. Apple silicon ReleaseFast results (1M u64 items): workload leaf slab speedup fromValue 1M 15.4 ms 1.8 ms 8.5× fromValue+getRoot 1M 39.5 ms 17.6 ms 2.2× toValue 1M 20.2 ms 1.1 ms 18× bulkSet+getRoot 1M 40.4 ms 17.5 ms 2.3× Reading: slab path eliminates 250K leaf-node allocations (8.5× on fromValue), provides contiguous chunk reads for bulk decode (18× on toValue), and benefits from the fixed-shape K-leaf merkleization specialization on root computation (~2× on getRoot-bound workloads). The bulkSet+getRoot scenario is shaped after balances epoch-rewards writeback — full overwrite + recompute root — and shows ~2.3× end-to-end gain. Sufficient signal to consider opting balances and similar 1M-scale packed primitive lists into opts.slab=true in a follow-up change.
Switches phase0.Balances from FixedListType(Gwei, VALIDATOR_REGISTRY_LIMIT, .{})
to .{ .slab = true }, taking the chunked-leaf path for the most-touched
packed primitive list in BeaconState. Also drops the inline duplicate of
Balances in fulu.BeaconState — it now aliases phase0.Balances correctly,
which was already the case in altair/bellatrix/capella/deneb/electra.
Bench: zig build run:bench_process_epoch -Doptimize=ReleaseFast
(mainnet state at slot 13336576, fulu fork, 2.18M validators)
Per-step deltas (segmented breakdown, time/run):
step leaf slab speedup
rewards_and_penalties 267.8 ms 37.6 ms 7.1×
effective_balance_updates 120.5 ms 9.5 ms 12.7×
inactivity_updates 109.0 ms 105.6 ms flat
participation_flags 0.74 ms 0.73 ms flat
proposer_lookahead 16.6 ms 16.4 ms flat
End-to-end epoch (non-segmented):
leaf: 5.929 s
slab: 5.504 s ( -7.2% )
Total epoch time gain is dominated by `before_process_epoch` (cache
rebuild ~5.3 s) which slab doesn't touch. The targeted balance-bound
operations get 7-13× speedups; this commit unlocks that for any caller
that iterates rewards / effective-balance updates against `state.balances`.
Verification:
- state_transition: 85/85 ✓
- PMT 42/42 ✓ + SSZ 198/198 ✓ (unchanged from prior commits)
Other slab candidates (inactivity_scores, participation, randao_mixes,
slashings) are deliberately left at .{} — opt them in only after their
specific consumer paths are verified slab-safe and bench shows wins.
… opts.slab=true Switches altair.InactivityScores and altair.EpochParticipation (used by both previous_epoch_participation and current_epoch_participation) to opts.slab=true. Same idea as phase0.Balances in 304386c. Also drops fulu.BeaconState's inline duplicates of these list types in favor of the altair aliases — bellatrix/capella/deneb/electra already referenced altair correctly; fulu was the only outlier. Bench: zig build run:bench_process_epoch -Doptimize=ReleaseFast (mainnet state at slot 13336576, fulu fork, 2.18M validators) End-to-end deltas (time/run, non-segmented epoch): config epoch leaf baseline (no slab) 5.929 s + Balances slab (304386c) 5.504 s + InactivityScores + EpochParticipation slab 0.557 s ← this commit 10.6× total speedup over leaf baseline. Per-step segmented breakdown (this commit vs leaf baseline): step leaf all-slab speedup before_process_epoch 5.30 s 0.555 s 9.6× inactivity_updates 109.0 ms 5.82 ms 18.8× rewards_and_penalties 267.8 ms 18.1 ms 14.9× effective_balance_updates 120.5 ms 6.65 ms 18.1× participation_flags 0.74 ms 0.001 ms ~700× proposer_lookahead 16.6 ms 15.6 ms flat The 9.6× speedup of before_process_epoch is the second-order effect: cache-rebuild calls getAll() on participation lists, which used to walk the full leaf-per-chunk tree path. Slab-enabling those lists collapses that work to contiguous chunk reads — a win that single-field opt-in (balances only) couldn't unlock. Verification: - state_transition: 85/85 ✓ - PMT 42/42 + SSZ 198/198 unchanged Remaining slab candidates: randao_mixes (FixedVector(Bytes32)) and slashings (FixedVector(Gwei)). Both are vectors not lists; randao_mixes also has Bytes32 element type which may or may not satisfy isBasicType — defer to a follow-up that explicitly verifies before opt-in.
When a slab-opted-in list/vector is initially empty (e.g. genesis state before any participation flags are set) or sparsely filled, the tree's content shape is the early-return zero-leaf form (chunk_count==0 → createBranch(@enumFromInt(chunk_depth), @enumFromInt(0))). Navigating this tree at slab_depth lands on a zero sentinel — not a slab — and the slab-path consumers blow up with Error.InvalidNode in getSlabChunks. A zero subtree at slab boundary is semantically an all-zero slab. Fix the slab consumers to handle this: - BasicPackedChunks.get: return std.mem.zeroes(Element). - BasicPackedChunks.set: materialize a fresh zero-filled slab before applying the write; intermediate slab is unrefed after setSlabChunk. - BasicPackedChunks.getAllInto: @Memset the output slice for the slab's worth of items. - list.zig tree.toValue (slab branch): out.items already initialised to default_value; skip the slab payload read. - list.zig tree.serializeIntoBytes (slab branch): @Memset the output. - vector.zig tree.toValue / serializeIntoBytes (slab branches): same shape as list. Repro: minimal-preset altair finality_rule_4 — pre-state's previous_epoch_participation has length 0 at genesis. tree.fromValue takes the early-return path. processBlock then attempts to push attestation flags via epoch_participation.set; navigation lands on zero sentinel slot 10 (zero hash at depth k_log2=10) instead of a slab Node, causing getSlabChunks to fail. Verification (post-fix): - PMT 42/42 ✓ + SSZ 198/198 ✓ + state_transition 85/85 ✓ - spec_tests minimal (all 9 shards): operations 1731/1828 (97 skipped), epoch_processing 727/727, sanity 557/557, rewards 380/380, transition 267/267, random 561/561, fork 314/314, finality 48/48, merkle_proof 19/19 — total 4604/4701 pass, 0 failed (97 skipped due to missing fork test data).
When BasicPackedChunks.set is called on a slab Id that no parent Branch references yet (rc==0), the slab is exclusively owned by the TreeView's children_nodes cache. Mutating the heap chunks in place is safe and avoids the 32 KB CoW clone per write. The first write per slab still hits the CoW path (rc>=1, owned by the persistent tree), publishing a fresh transient slab to the cache. Every subsequent write to the same slab finds rc==0 and writes a single byte directly into the heap chunks, accumulating dirty bits across writes and invalidating the cached root. For attestation processing in altair+ blocks (~committee_size validators × many attestations per block, all hitting the same participation slab), this collapses N × 32 KB writes into 1 × 32 KB clone + (N-1) × byte writes. Bench impact (process_block ReleaseSafe, mainnet state slot 13336576): step leaf slab CoW slab+in-place operations_no_sig 9.1 ms 210 ms 1.6 ms (5.7× vs leaf) process_block_no_sig 15.5 ms 214 ms 4.8 ms (3.2× vs leaf) operations 39.5 ms 234 ms 27.2 ms (1.4× vs leaf) process_block 42 ms 240 ms 34 ms (1.2× vs leaf) Process_epoch (ReleaseFast) per-step deltas remain or improve slightly: step slab CoW slab+in-place inactivity_updates 6.3 ms 5.4 ms rewards_and_penalties 19 ms 19 ms effective_balance_updates 7.7 ms 6.7 ms End-to-end: process_block 14× regression (slab CoW vs leaf) flips to 3.2× speedup; process_epoch keeps its order-of-magnitude advantage on the bulk-getAll cache-rebuild path. Verification: - PMT 42/42 ✓ + SSZ 198/198 ✓ + state_transition 85/85 ✓ Correctness sketch: rc==0 on a slab Node Id means no parent Branch has ref'd it yet. After Pool.createSlab or Id.setSlabChunk, the returned Id has rc==0 until Pool.rebind during commit raises it to 1. While in the TreeView's transient state (children_nodes cache, gindex in `changed`), the slab is observable only via the current TreeView and its heap chunks can be mutated without violating shared-tree invariants.
Replaces the slab Node variant's many-pointer + len encoding with a
single `*Slab.Storage` pointer (Lighthouse milhouse's Arc<PackedLeaf>
shape adapted to our Pool-managed ref counting). Storage is now self-
contained: chunks + len in one heap allocation, owned uniquely by the
slab Node Id whose ref count tracks parent Branch references.
Variant before:
slab: { chunks: [*]align(64) [32]u8, len: u16, root: ?[32]u8 }
Variant after:
slab: { storage: *Slab.Storage, root: ?[32]u8 }
Side effects:
- Removes every `@ptrCast(@aligncast(node.slab.chunks))` reconstruction
in getRoot / unref / setSlabChunk / setSlabChunks. Each callsite now
uses `node.slab.storage` directly.
- Slab.Storage gains a `len: u16` field (was inline in the Node variant).
allocZero initialises len = 0; createSlab / setSlabChunk* propagate.
- Drops the unused `dirty: StaticBitSet(K)` field that was inflating the
slab variant by 128 B for no read-side benefit (root invalidation via
`root = null` already drives lazy recompute, and computeRoot always
merkleizes the full K-leaf subtree).
- Slab variant size drops from ~171 B (with dirty) to ~48 B, on par with
the branch variant (41 B). The Node union sizes itself to whichever
variant is largest.
Verification:
- PMT 42/42 ✓ + SSZ 198/198 ✓ + state_transition 85/85 ✓
- process_block ReleaseSafe still produces correct slab+in-place numbers
(operations_no_sig 1.6 ms vs leaf 9.1 ms — 5.7×).
Note: bench_process_block still SEGVs in ReleaseFast / ReleaseSmall on
this branch, independent of this refactor (ReleaseSafe's runtime tag
checks mask the UB). Suspected Zig 0.16 codegen issue on tagged union
dispatch in deep recursion paths through getRoot. Filed as a follow-up
to investigate / report upstream; doesn't block correctness or the
ReleaseSafe perf story.
…Depth
`unfinalized_parents_buf` was declared as `undefined` and never initialized.
On iteration 0, the post-rebind cleanup loop reads garbage Optional bytes
and may call `pool.unref` on an arbitrary Id — recursively decrementing
its children's ref counts. When this hits a slab Id still referenced by
a parent branch, the slab is destroyed prematurely (use-after-free).
Manifested as SEGV in `bench_process_block -Doptimize=ReleaseFast` deep
inside `Slab.computeRoot`, with a bogus storage pointer pattern matching
two adjacent u32 values from a freed slot's `next_free` field. ReleaseSafe
hid the bug because its `.free => unreachable` runtime check panics
explicitly; ReleaseFast UB-eliminates the unreachable arm and silently
misroutes `.free` slots into the `.slab` arm.
Fix:
- Initialize `unfinalized_parents_buf` to all-null so iter 0 sees no
stale entries.
- Replace `.free => unreachable` in `Id.getRoot` with `@panic` so any
future use-after-free surfaces as a clean panic rather than UB-driven
SEGV under ReleaseFast.
Verified:
- test:persistent_merkle_tree 42/42
- test:state_transition 85/85, 0 leaks (was 236)
- test:ssz 198/198
- test:spec_tests fork mainnet 276/38 skip (matches pre-fix baseline)
- bench process_block ReleaseFast: operations_no_sig 1.7ms, process_block
33ms (matches ReleaseSafe baseline; previously SEGV)
The branch and slab variants stored root as `?[32]u8` so that lazy
(uncomputed) state was represented by `null`. The Optional encoding adds
a 1-byte tag and forces every read site to unwrap via `.?`.
Replace with a plain `[32]u8` field plus a global sentinel constant
`lazy_sentinel = [_]u8{0xFF} ** 32`. SHA-256 outputs are cryptographically
unlikely to equal this value (~1 in 2^256). The unwrap-`.?` pattern is
gone and the code is simpler.
Note: this does NOT shrink `@sizeOf(Node)`. The slab variant's
`*Slab.Storage` pointer forces 8-byte union alignment, which already
pads the union to 48 bytes regardless of the Optional. So this is a
correctness/clarity refactor, not a perf change. The follow-up SoA-split
refactor (decomposing the union into top-level NodeWithMeta fields) is
where the 6× per-node-visit cache footprint reduction will come from.
Verified:
- test:persistent_merkle_tree 42/42
- test:state_transition 85/85, 0 leaks
- bench process_epoch ReleaseFast: epoch_total 328ms (vs 324ms baseline,
within noise)
The tagged-union NodeWithMeta forced every navigation visit to load 48 B (cache density 1.3 nodes/cache-line). Profile (sample, 30s, ReleaseSafe) identified `Node.Id.getNodesAtDepth` as the dominant hotspot (+42% / +11.6 percentage points share) of the regression vs the pre-refactor SoA u32-column layout. Replace `union(enum) Node` with five top-level fields on `NodeWithMeta` (left, right, root, cache, kind) plus ref_count, so MultiArrayList yields six dense columns. Hot navigation (`getNodesAtDepth`, `setNodes`, `setNodesAtDepth`, `getNode`, `setNode`, `truncateAfterIndex`, `fillToLength`, `DepthIterator.next`) now reads only `kind` (1 B) and one child Id (4 B) per visit — beating the pre-fecab80f layout's footprint and recovering the 2.8x regression plus extra headroom. The `cache: ?*anyopaque` column holds `*Slab.Storage` for slab nodes; future variants (validators struct cache, sync committee cache, etc.) reuse the same column via type-erased pointer cast — no schema changes needed. Free-list link reuses the `left` column (free slots have no other use for it), keeping the column count at 6. Bench (process_epoch ReleaseFast, fulu, 2.18M validators): before_process_epoch: 282ms -> 169ms (-40%) epoch_total: 328ms -> 215ms (-35%) vs upstream main cd45847 (leaf): epoch_total: 394ms -> 215ms (-46%, 1.84x faster) before_process_epoch: 190ms -> 169ms (-11%, faster than leaf) inactivity_updates: 54.5ms -> 5.3ms (-90%) rewards_and_penalties: 74.6ms -> 17.9ms (-76%) effective_balance_updates: 58.9ms -> 6.5ms (-89%) process_block ReleaseFast: equivalent to upstream main within noise (~33ms standalone, ~214ms segmented). No path regression. All invariants preserved: - unfinalized_parents_buf null-init in setNodes/setNodesAtDepth (UAF fix from ca5ac32) - .free => @Panic in getRoot (defense vs ReleaseFast UB elimination) - lazy_sentinel for branch/slab uncomputed roots (from 17ded4a) Tests: 42/42 PMT, 198/198 SSZ, 85/85 state_transition (0 leaks), 276/38skip fork shard mainnet (matches baseline).
Introduces a new persistent_merkle_tree variant `branch_struct` whose
payload is a heap-allocated `BranchStructRef` (vtable + cached
deserialized struct pointer). Storage piggy-backs on the existing SoA
`cache` column (no pointer-packing) and root caching uses the existing
`lazy_sentinel` mechanism — same one-arm dispatch shape as `.slab`.
A new SSZ type `StructContainerType(ST)` and view `StructContainerTreeView(ST)`
expose this layer:
- tree.fromValue / toValue / serialize round-trip via the branch_struct
slot (one Pool node per container, regardless of field count)
- field reads/writes are O(1) struct accesses; commit lazily recomputes
the root only when needed
- get_root is dispatched through the BranchStructRef vtable so the Pool
stays type-erased
Switches phase0.Validator from FixedContainerType → StructContainerType.
`validatorsSlice` now reads directly from the cached struct array instead
of walking 2M × 8-leaf merkle subtrees.
Bench (process_epoch ReleaseFast, fulu, 2.18M validators):
before_process_epoch: 169 ms -> 68 ms (-60%)
epoch_total: 215 ms -> 110 ms (-49%)
vs upstream main cd45847 (no slab, no struct cache):
epoch_total: 394 ms -> 110 ms (-72%, 3.6x faster)
before_process_epoch: 190 ms -> 68 ms (-64%, 2.8x faster)
inactivity_updates: 54.5 ms -> 5.1 ms (10.7x)
rewards_and_penalties: 74.6 ms -> 16.2 ms (4.6x)
effective_balance_updates: 58.9 ms -> 6.0 ms (9.8x)
Bench (process_block ReleaseFast):
block_total (segmented): 214 ms -> 111 ms (-48%)
process_block standalone: 33.5 ms -> 31.6 ms (~6% faster within noise)
Tests: 42/42 PMT, 199/199 SSZ (+1 StructContainer test), 85/85
state_transition (0 leaks), 276/38skip fork shard mainnet (matches baseline).
Design notes (vs PR #232 reference):
- Stores BranchStructRef* in the SoA `cache` column rather than packing
into the Id columns — cleaner, no bit-packing.
- Reuses lazy_sentinel for root caching (one-arm dispatch like slab) vs
PR 232's two-state branch_struct_lazy/computed enum split.
- Omits `to_tree` materialization vtable slot (no current consumers
need single-leaf proofs through Validator subfields; trivial to add
later).
…d-root leak Two related leaks identified by Codex review of the chunked-leaf branch: 1. `Pool.deinit` only freed the MultiArrayList columns. Slots with `.slab` or `.branch_struct` kind own a heap-allocated cache pointer (`*Slab.Storage` or `*BranchStructRef`); when callers tear down the pool without first unref'ing every root, those payloads were leaked. Walk all slots and free the cache payload for any kind that owns one before destroying the column buffers. 2. `StructContainerTreeView.getFieldRoot` allocated a fresh PMT slot via `ChildST.tree.fromValue(pool, &field_value)` just to read the cached root, but never unref'ed it. Each call leaked one Pool slot for the pool's lifetime — composite child types would also leak their heap payloads. Add a per-view `field_root_cache: [chunk_count][32]u8` stable backing store, copy the hash into it, and unref the temporary slot immediately so the returned `*const [32]u8` stays valid without leaking. Verified: PMT 42/42, SSZ 199/199, state_transition 85/85 (0 leaks).
Opaque PMT nodes (.branch_struct from StructContainerType, .slab from
chunked-leaf packing) terminate `getLeft`/`getRight` traversal because
their children are not stored as separate Ids. This blocked single- and
compact-multi proofs through any path that crosses validators[i].field
or balances[i] / inactivity_scores[i] / participation[i].
Add `Pool.materializeBranchStruct` and `Pool.materializeSlab`, both of
which lazily build a temporary plain PMT subtree from the cached payload
that proof traversal can navigate. Extend the BranchStructRef vtable
with `to_tree`, implemented for StructContainerType.WrappedT via
`FixedCT.tree.fromValue`. proof.zig gains an `isOpaqueNode` /
`materializeOpaque` pair plus a deferred-unref ArrayList for the
compact-multi recursive walk; createSingleProof keeps a single optional
materialized root and rejects nested opaques with `error.NestedOpaque`.
Verified by two new regression tests (`single proof: validators[0].
withdrawal_credentials` and `single proof: balances[0]`) in
src/fork_types/any_beacon_state.zig — both fail before this commit
with `Error.InvalidNode` from `getRight` on the opaque node and pass
after.
Why spec tests didn't catch this:
- EF `merkle_proof` shard only covers
`BeaconBlockBody.blob_kzg_commitment_merkle_proof` (4 cases)
- EF `light_client` shard has BeaconState proofs (sync_committee,
finality_root) but lodestar-z has no `light_client` runner yet
(separate follow-up). Even with a runner, those paths don't descend
into list elements (`validators[i]`/`balances[i]`), so they wouldn't
hit our opaque-node bug.
- The included tests construct the exact `validators.0.X` and
`balances.0` paths that branch_struct and slab respectively block.
Tests: PMT 42/42, SSZ 199/199, fork_types 9/9 (incl. 2 new), state_transition
85/85 (0 leaks), spec_tests fork mainnet 276/38skip (matches baseline).
Bench process_epoch ReleaseFast: 121ms / 77ms (within current variance,
no regression — opaque materialization only fires inside proof
generation, never on epoch hot paths).
Replace `pub const NodeWithMeta = struct {...}` with file-level fields
and `const Node = @this();`. `MultiArrayList(NodeWithMeta)` becomes
`MultiArrayList(Node)`. Pure syntactic refactor — no behavior change,
no perf change. Aligns with main branch's idiom and removes the awkward
`NodeWithMeta` vs `Node` naming distinction.
Replace separate `kind: NodeKind` (u8) and `ref_count: u32` columns with a single packed `state: State` u32 column. Layout matches main: - bit 31 set → free slot; bits 0..30 = next-free Id - bit 31 clear → in-use; bits 28..30 = kind, bits 0..27 = ref_count Free-list link migrates from the `left` column into `state` itself (matching the legacy union encoding). Branch-free `kind()` decoder avoids a conditional on every navigation visit. Saves 1 byte/node and consolidates the ref/unref hot path: a single column read replaces touching both kind and ref_count. External callers (chunks.zig, list.zig, vector.zig, proof.zig) updated to read `state[idx].kind()` / `state[idx].refCount()` accordingly. Bench impact: epoch -5%, block -5%, sync_aggregate -30% vs baseline (plus 1 B/node memory savings).
Collapse three columns (`left: Id`, `right: Id`, `cache: ?*anyopaque`) into a single `payload: u64` overloaded by kind: - branch : low 32 = left Id, high 32 = right Id - slab : full 64-bit `*Slab.Storage` pointer - branch_struct : full 64-bit `*BranchStructRef` pointer - free : low 32 = next-free Id (high 32 unused) Final SoA layout: 3 columns total (payload + root + state), down from 6. Memory savings vs baseline: 9 B/node total (1 B from State pack, 8 B here from cache + one of left/right). New helpers `packChildren`/`unpackLeft`/`unpackRight`/`payloadAsPtr`/ `ptrAsPayload` keep the encode/decode local. `slabStorage` and `branchStructRef` now read from the payload column. External slab CoW path in `chunks.zig` uses the new public `Id.getSlabStorageMut` helper instead of poking the column directly. Bench impact: roughly perf-neutral vs the State-only commit on process_epoch / process_block; significant wins on slab-heavy paths (set_nodes_randomly -28%). Validator-iteration (branch_struct heavy) holds steady because the pointer fits in a single u64 read. Note: payload-merged design assumes 64-bit usize. lodestar-z (and all ETH consensus clients) target only 64-bit platforms.
`Id.getState(pool)` now returns the packed `State` value (matching main's API) instead of a wrapper struct. The State enum already exposes all predicates (`isFree`, `kind`, `refCount`, `nextFree`); the StateView indirection became redundant after C2 re-introduced packed State. Add `isZero/isLeaf/isBranch/isSlab/isBranchStruct` to State for parity with main. `isBranchLazy` / `isBranchComputed` stay on Id because they read both state and root columns. Net -24 lines: removes a struct that just forwarded calls to State.
The "slab" name was borrowed from kernel slab-allocator terminology and conflicted with the unrelated `state` field on TreeViewState (`self.state` in chunks.zig). Rename to `ChunkedLeaf` (matching the existing branch and design-doc terminology) so the data structure self-documents: a single Node holding K=1024 packed leaf chunks instead of K individual leaf Nodes. Mechanical rename across ~290 sites: - type Slab → ChunkedLeaf - NodeKind.slab → .chunked_leaf - TypeOpts.slab → .chunked_leaf - methods: createSlab → createChunkedLeaf, setSlabChunk(s) → setChunkedLeafChunk(s), getSlabChunks/Len/StorageMut → getChunkedLeaf*, materializeSlab → materializeChunkedLeaf, isSlab → isChunkedLeaf - file renames: slab.zig → chunked_leaf.zig, slab_test.zig → chunked_leaf_test.zig, bench/ssz/list_slab.zig → list_chunked_leaf.zig - bench target: bench_list_slab → bench_list_chunked_leaf Verified perf-neutral via 5 paired alternating bench runs (paired t = 0.26, well below 95% CI threshold).
- Convert ChunkedLeaf to file-as-struct (Node.zig precedent), rename to ChunkedLeaf.zig, merge chunked_leaf_test.zig into it. Drop allocZero/ destroy wrappers; callers use allocator.create directly (production paths immediately overwrite the chunks blob, so the explicit zero-init was dead work). - Add Node.truncateAfterIndexWithLeafOffset so the chunked_leaf-mode sliceTo zeroes right siblings using ZeroHash[depthi + k_log2] instead of ZeroHash[depthi]; original truncateAfterIndex now delegates with offset=0. Asserts depth + leaf_offset <= max_depth. - ListBasicTreeView: implement iteratorReadonly and sliceTo for chunked_leaf via two-level (chunked_leaf_idx, intra_chunk) navigation. iteratorReadonly caches the current ChunkedLeaf chunks pointer and handles .zero-sentinel boundaries; sliceTo fast-paths .zero boundaries (no extra alloc) and only materializes a trimmed ChunkedLeaf when the cut falls inside a real one. Use getChunkedLeafChunks (const) for the read paths in sliceTo and chunks.zig CoW. - Compile-time guard: chunked_leaf=true requires limit (or length) >= K * items_per_chunk so chunk_depth >= ChunkedLeaf.k_log2; otherwise chunked_leaf_depth underflows. Update TypeOpts.chunked_leaf doc to reflect the new sliceTo/iteratorReadonly support. Tests: 8 chunked_leaf tests covering happy paths + 2 defensive tests that synthesize zero-sentinel boundary trees via FillWithContentsIterator to exercise the .zero branches in next() and sliceTo. Verified against ListTLeaf reference: every chunked_leaf sliceTo root matches the equivalent leaf-path tree.
`branch_struct` named the variant after its tree-shape role (a "branch" that wraps a struct), but it always backs a Container/StructContainer SSZ type. `container_struct` matches the actual semantic. Also tighten two doc comments: - `lazy_sentinel`: now lists all three variants that use it (branch, chunked_leaf, container_struct) and explains why the lazy/computed flag lives in the root column rather than in `State.kind` (keeps structural type orthogonal to cache validity, so each new kind costs one slot in `kind` and `getRoot` toggles laziness without touching State). - `ChunkedLeaf.computeRoot`: explain why we don't reuse `hashing.merkleize` (it requires mutable input; `chunks` is *const because the heap blob is shared across CoW siblings) and what the 16 KB scratch `buf` is doing in each round. No behavior change; PMT/SSZ/consensus_types/fork_types/state_transition unit tests + zig fmt clean.
…order Pool was placed below Id (lines 1062..1479) after the SoA refactor; main keeps the canonical order State → Pool → Id which reads top-down: low constants → storage → handle/methods. Moves the Pool block (~420 lines) back to that slot. Also drops the stale "Defense-in-depth" comment block on Id.getRoot's `.free` arm — the @Panic message itself ("getRoot called on .free slot — use-after-free") is sufficient. No behavior change; PMT/SSZ/state_transition tests + zig fmt clean.
…AtDepth Both `setNodes` and `setNodesAtDepth` re-bound `states` and `payloads` at the end of each loop iteration with a comment claiming `pool.unref` or `pool.rebind` "may have grown the pool indirectly via subsequent calls". Neither path can grow the MAL: - `pool.rebind` writes existing slots and calls `refUnsafe` (rc++ only). - `pool.unref` decrements rc, transitions slots to free, and calls `allocator.destroy` on chunked_leaf / container_struct heap blobs. The `container_struct.deinit` callback signature is `fn(ptr, Allocator) void` — it doesn't get a `*Pool` and so can't trigger `Pool.alloc`/`preheat`. The only path that grows the underlying MultiArrayList is `Pool.alloc` -> `Pool.preheat` (via free-list exhaustion), and that call site already refreshes the slices when it returns true. The end-of-iter refresh is dead code; removing it has no behavior change. Also tighten the initial-bind comment to match: only `pool.alloc` needs to be followed by a refresh. PMT/SSZ/state_transition tests + zig fmt clean.
After the file-as-struct refactor `ChunkedLeaf.Storage` no longer exists — `ChunkedLeaf` itself is the struct. The leftover `Storage` naming is misleading. - `Id.getChunkedLeafStorageMut` -> `Id.getChunkedLeafPtr` (also drops the Rust-style `Mut` suffix for Zig stdlib's `Ptr` convention; cf. `HashMap.getPtr`). - Private helper `chunkedLeafStorage` -> `chunkedLeafPtr`. - Doc comments referring to "the Storage invariant" / "Heap Storage cloned" updated to "trailing-zero invariant" / "Heap blob cloned". - Drop two redundant inline comments that explained well-understood Zig idioms (slice-after-MAL-bind once, hot-loop bytes-per-visit micro-detail). Local variable names `storage` / `old_storage` / `new_storage` kept — they describe the role of the pointed-to heap blob, not the type. PMT/SSZ tests + zig fmt clean.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant performance improvements to the persistent merkle tree, primarily targeting state transition efficiency. By restructuring node storage and introducing an opt-in chunked-leaf packing mechanism for basic lists, the changes drastically reduce memory overhead and cache misses during deep-tree traversals. Additionally, the PR includes necessary API extensions for these new node types and addresses critical memory safety bugs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request optimizes the Persistent Merkle Tree and SSZ implementations by introducing chunked leaves and struct-backed containers, supported by a refactor of the Node and Pool structures into a compact SoA layout. Feedback identifies several safety and style guide violations, most notably the use of large stack allocations (up to 32KB) which risk stack overflow. The reviewer also highlighted potential reference count corruption in createBranch and rebind under error conditions, the use of forbidden recursion in getRoot, and suggested panicking on use-after-free in unref to maintain the project's safety standards.
Addresses Gemini code review on PR #346: Stack growth (TigerStyle: no large stack frames). 7 sites previously held 16-32 KB on the stack inside hot or recursive paths. Replaced with heap-backed alternatives: - `ChunkedLeaf.computeRoot` now takes a caller-supplied `*[K/2][32]u8` scratch (16 KB). New `computeRootAllocating(allocator, out)` wraps it with a per-call `allocator.alignedAlloc` + free; OOM panics. `Id.getRoot`'s chunked_leaf arm uses this; bench shows ~3% slowdown on `fromValue+getRoot` from per-call alloc, acceptable for keeping scratch off-stack. - New `Pool.createChunkedLeafEmpty(len)` returns a zeroed heap blob; caller fills via `getChunkedLeafPtr`. Replaces the `var chunked_leaf_buf: [K][32]u8 align(64)` 32 KB stack scratch + `pool.createChunkedLeaf(&buf, ...)` memcpy pattern at: - `chunks.zig` set `.zero` materialization - `list_basic.zig` sliceTo boundary (already used the nullable + defer ownership pattern) - `list.zig` fromValue (byte-stream + item paths) - `vector.zig` fromValue (byte-stream + item paths) Refcount ordering in `createBranch` and `rebind`. Both wrote parent metadata (kind=.branch, payload=children) BEFORE acquiring child refs. If `refUnsafe(right)` fails, parent is `.branch` pointing at a child whose rc was never incremented; caller's `unref(parent)` then recursively dec's the un-ref'd child, corrupting its rc. Bug exists on main too but never triggered there (no heap-pointer kinds → no SEGV from dangling payloads). Fix: ref both children first with errdefer rollback, then write parent. `unref(.free)` now panics. Was `silently continue`. The masked bug was caught by `node_test.test.Node.State predicates` which had a double-unref pattern (`createBranch(leaf, leaf)` refs leaf twice; test had both `defer unref(leaf)` and `defer unref(branch)` — branch's destruction cascade frees leaf, then explicit unref(leaf) hits .free). Fixed the test by removing the unnecessary `defer unref(leaf)`; createBranch transitively owns the leaf. Ownership transfer pattern at 5 fromValue sites and chunks.zig set: `var id_opt: ?Node.Id = try ...createChunkedLeafEmpty(...)` + `errdefer if (id_opt) |id| pool.unref(id)` + `id_opt = null` after `it.append`. Cleanly handles the early-`InvalidLength` failure mode of append. (append's mid-loop createBranch failure is a pre-existing leak in `FillWithContentsIterator` itself; out of scope here. A future `appendAssumeCapacity`-style refactor would eliminate it.) `chunks.zig` `.zero` materialization now passes `len = intra_chunk + 1` instead of `0` so the chunked_leaf trailing-zero invariant holds. Verified: PMT/SSZ/state_transition tests + zig fmt clean. Bench `bench_list_chunked_leaf` shows fromValue+getRoot 17.8 → 18.4 ms (+3%, per-call alloc cost), fromValue/toValue within noise, no regression. Not addressed in this commit (deferred to follow-ups): - `Id.getRoot` recursion → iterative (Gemini MEDIUM; bounded by max_depth=64, real but not urgent). - `FillWithContentsIterator.append` mid-loop leak (needs API rework to appendAssumeCapacity model).
Pool now holds two allocators routed by allocation kind:
- `page_allocator` (default `std.heap.page_allocator`): MAL columns +
chunked_leaf 32 KB blobs + scratch. Page-aligned, infrequent.
- `allocator` (default `std.heap.c_allocator`): per-node small heap blobs
(ContainerStructRef ~32 B, WrappedT ~150 B). Page-per-alloc on these
costs ~70 GB virtual address space at 2.18 M validators on macOS arm64;
bucket allocator packs them densely.
binding `serializeValidators` regression at mainnet scale: 24097 ms →
469 ms (51× speedup, equal to main).
`Pool.init` switched to options-struct shape with sensible production
defaults; callers can write `.init(.{})` for production or override any
combination of {page_allocator, allocator, pool_size}. `pool_size`
default sized for mainnet fulu BeaconState (~10 M nodes; touches only
the `state` column at init = ~10 MB physical, rest lazily committed).
Bench harness adopts the binding pattern: `DebugAllocator` in Debug,
`c_allocator` otherwise. Removes the previous `DebugAllocator`-everywhere
behavior that artificially inflated absolute numbers and made
`sync_aggregate` look 2.2× regressed.
`bindings/napi/pool.zig` collapses to a single `Pool.init(.{})` call.
Two CI failures from c120b5f: 1. `node_bench.zig:91` used `Pool.init(allocator, 50_000_000)` without a leading `.` (direct `Pool` import), so the sed pattern `\.Pool\.init(` skipped it. Fixed manually to the opts-struct form. 2. `test/spec/writer/{finality,random,sanity}.zig` embed test code as multi-line `writer.print` template strings. The previous sed expanded `Pool.init(allocator, pool_size)` → `Pool.init(.{ .allocator = ..., .pool_size = ... })` inside those templates, but the `{` and `}` braces inside the template are interpreted as `writer.print` format placeholders. Escaping them to `{{` and `}}` so they render as literal braces in the generated spec test files.
Adds two helpers to skip the 263 MB clone in `validatorsSlice`-style read passes over a `List<StructContainerType<T>>`: - `StructContainerType.tree.getValuePtr(node, pool) -> *const Type` reads the wrapped value directly from a `.container_struct` node's payload, no copy. - `ListCompositeTreeView.ReadonlyIterator.nextValuePtr() -> *const T` uses it to hand out per-element pointers during iteration. Compile- time guard rejects element types that don't implement `getValuePtr`. Migrating `EpochTransitionCache.init`'s validator loop (the largest consumer) drops `before_process_epoch` 51.4 ms → 40.5 ms (-21%) and `epoch_total` 86.5 ms → 75.4 ms (-13%) at mainnet fulu (2.18 M validators). The pointer is valid only while the node's refcount holds and no CoW mutation replaces it, so this is restricted to transient read passes.
…tive to ptr iter Drops the full 263 MB validators clone in this binding hot path. The function only reads `validator.slashed` per element so the iterator's per-element pointer access is exactly what's needed. Mainnet binding test (`getEffectiveBalanceIncrementsZeroInactive`): 762 ms -> 40 ms (19× faster).
…lone) Adds a pointer-slice variant of `validatorsSlice` that hands out `*const Validator.Type` into the pool's container_struct payloads — 17 MB pointer slice vs 263 MB value clone at mainnet (2.18 M validators, ~16x lower). Migrates `EpochCache.init` (state load hot path), `upgrade_state_to_altair`, and the test utility `generate_state` to the pointer-slice form. The callers' loops barely change because Zig auto-derefs `*const T` for field access (`v.effective_balance` works whether `v` is `T` or `*const T`). `syncPubkeys`/`syncPubkeysParallel` and `sumTargetUnslashedBalanceIncrements` take `[]const *const Validator` instead of `[]const Validator` so the parallel worker fan-out keeps random-access semantics on a slice of pointers. Pointers stay valid only while the validators list isn't mutated. Mutation goes through the standard `tree.set(i, &new_val)` path — copy out the value, edit, write back; the original pointer is then invalid for the mutated slot.
`SlashingsCache.initFromValidators` only reads `validator.slashed` per element — perfect zero-copy candidate. Switch its `validators` parameter from `[]const Validator` to `[]const *const Validator`, and update `buildFromStateIfNeeded` to call `state.validatorsPtrSlice` instead of `getAllReadonlyValues`. Saves the 263 MB validators clone every time `buildFromStateIfNeeded` runs (which happens lazily on first slashing-cache lookup per state).
…Slice `syncPubkeys` parameter changed to `[]const *const Validator` in 9aeab18; the bench harness was the last untouched caller.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25d18710ff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const chunked_leaf_count = (chunk_count + ChunkedLeaf.K - 1) / ChunkedLeaf.K; | ||
| const chunked_leaf_ids = try self.state.allocator.alloc(Node.Id, chunked_leaf_count); | ||
| defer self.state.allocator.free(chunked_leaf_ids); | ||
| try self.state.root.getNodesAtDepth(self.state.pool, chunked_leaf_depth, 0, chunked_leaf_ids); |
There was a problem hiding this comment.
Read chunked list nodes through staged child cache
In the use_chunked_leaf branch, getAllInto reads chunked-leaf IDs directly from self.state.root, so it bypasses TreeViewState.children_nodes/changed where uncommitted set/push updates are staged. This means ListBasicTreeView.getAll/getAllInto can return stale pre-mutation values until commit() is called, unlike the non-chunked path which resolves nodes via state.getChildNode and includes staged edits.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Codex Review: Didn't find any major issues. You're on a roll.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Reported by codex review on PR #346. The chunked_leaf branch of `ListBasicTreeView.getAllInto` (`chunks.zig:165`) walked `self.state.root` directly via `getNodesAtDepth`, bypassing `TreeViewState.children_nodes` / `changed`. Uncommitted `set` / `push` writes are staged in `children_nodes` until `commitNodes()` rolls them into the root, so callers reading without an intermediate `commit()` got the pre-mutation values. Fix: after the bulk root walk, override each chunked_leaf id with the staged entry from `children_nodes` if present. The non-chunked path already does the equivalent via `populateAllNodes` + `getChildNode`. Adds two regression tests covering staged `set` (boundary across two chunked_leaves) and staged `push`. Both fail without the fix.
The readonly iterator walks `state.root` directly via DepthIterator and bypasses `state.children_nodes`, so uncommitted `set`/`push` would be invisible. TS's read-only iteration paths (`getAllReadonly`/`forEach`) respect pending writes via `getReadonly(i)`; this Zig fast-path takes the opposite trade-off (perf over implicit pending visibility). Adds a `std.debug.assert(state.changed.count() == 0)` at both `iteratorReadonly` entry points (list_basic, list_composite) so violations panic in Debug/ReleaseSafe; ReleaseFast still pays nothing. Doc comment on the public method makes the contract explicit. All existing callers already commit before iterating; full module test suite passes unchanged.
|
|
Motivation
State transition is dominated by PMT operations on
BeaconState's large basic-element lists (Balances,EpochParticipation,InactivityScores, ~1.4M items each) and per-field tree access for struct-shaped containers (Validator). On a mainnet fulu state with 2.18M validators, these account for the bulk ofprocessEpochandprocessBlockruntime.Description
Five PMT-level changes that compose:
u64payload column — collapses every node kind's payload (branch left/right Ids, chunked_leaf pointer, container_struct vtable pointer, free-list link) into one machine word. State packs[free_bit:1 | kind:3 | ref_count:28]inu32. Cache validity moves out ofkindinto a0xFF…sentinel in therootcolumn. Hot tree-walk visits touch onlystate(1 B) + 4-8 B frompayload, fitting one cache line.chunked_leaf— opt-in viaopts.chunked_leaf=trueonFixedListType/FixedVectorType. Bottomk_log2 = 10levels of the chunks subtree fold into one*ChunkedLeafheap blob holding K=1024 chunks. For 1M-itemList<u64>pool metadata drops ~1000× (256K Node.Id → 256 ChunkedLeaves + 256 heap blobs). Bulk read/write get SIMD-batched root recomputation and amortized CoW (one 32 KB memcpy per dirty leaf instead of 10 path clones per dirty chunk).container_struct(originally feat: model phase0 Validator as struct #232) — a node kind whose payload is*ContainerStructRef(vtable + caller-allocatedT). BacksStructContainerTypeforValidatoretc. Field access = O(1) struct read instead of per-field tree walk;hashTreeRootcalls type's cachedget_rootdirectly.Pool dual-allocator —
Poolnow keeps two allocators routed by allocation kind:page_allocatorfor MAL columns + chunked_leaf 32 KB blobs (page-aligned, infrequent), andallocatorfor per-node small heap blobs (ContainerStructRef+WrappedT). Page-per-alloc on the small lane wastes ~70 GB of virtual address space at 2.18M validators on macOS arm64 and thrashes the TLB; bucket allocator (defaultc_allocator) packs them densely.Pool.initswitched to options-struct shape (Pool.init(.{})for production defaults). This unblocksserializeValidators/getEffectiveBalanceIncrementsZeroInactive/getSingleProofbinding tests at mainnet scale (24 s → ~500 ms, 50× speedup, equal to main).Zero-copy validator access — completes the container_struct value chain. PR feat: model phase0 Validator as struct #232 added
pool.getStructPtr(node, T)but no list-iteration API was built on top, so callers still cloned the full 263 MB validators slice per epoch transition. This PR adds:StructContainerType.tree.getValuePtr(node, pool) -> *const T— direct typed pointer into the pool's container_struct payload.ListCompositeTreeView.ReadonlyIterator.nextValuePtr() -> *const Element.Type— list iteration that hands out per-element pointers as the depth-iterator walks the tree.BeaconState.validatorsPtrSlice(allocator) -> []*const Validator.Type— random-access pointer slice for callers that need sort / parallel workers / multi-pass.The two APIs are complementary: iterator wins for single forward read passes (
epoch_transition_cache.init,getEffectiveBalanceIncrementsZeroInactive); pointer slice wins for sort + random index access + parallel workers (epoch_cache.initcallingsyncPubkeys,slashings_cache.buildFromStateIfNeeded,upgrade_state_to_altair). 8 of 9 hot callers migrated; the last (upgrade_state_to_electra) keeps the value slice because its mutate-then-reread pattern would invalidate pointers.Bench
bench_process_epochandbench_process_blockon mainnet era, fulu fork, slot 13336576 (2.18M validators), ReleaseFast, Apple Silicon. Both branches run with the same bench harness usingc_allocator(no DebugAllocator overhead) for apples-to-apples comparison.Process epoch (segmented breakdown, ms/run averaged over 50 runs)
before_process_epochinactivity_updatesrewards_and_penaltieseffective_balance_updatesproposer_lookaheadbefore_process_epoch(EpochTransitionCache.init) drops 4.89×: container_struct gives O(1) per-field reads on validators, and the newnextValuePtriterator skips the 263 MB clone thatvalidatorsSliceused to do every epoch.inactivity_updates,rewards_and_penalties,effective_balance_updatesget 5-13× from chunked_leaf making bulk reads/writes onBalances/InactivityScores/EpochParticipationSIMD-friendly + amortized CoW.Process block (segmented breakdown, ms/run averaged over 50 runs)
operationsblock_headerwithdrawalsexecution_payloadrandaosync_aggregateoperations(bulk of block processing) gets 2.60× — chunked_leaf on the balance writes plus zero-copy validator reads inslashings_cache.buildFromStateIfNeeded.sync_aggregateis unchanged: the chunked_leaf CoW cost from scattered sync-committee balance writes (~290 µs of memcpy at mainnet sync committee size) is dominated by BLS aggregate verification (~1.1 ms fixed cost, identical across branches).Binding tests at mainnet scale
getEffectiveBalanceIncrementsZeroInactiveserializeValidators(broken w/ page_allocator)getSingleProof(169)(broken)Linux verification (AMD EPYC 9V74, 16 vCPU codespace, ReleaseFast)
Same fixture (mainnet era, fulu fork, slot 13336576, 2.18M validators), 50 runs/step. Speedup ratios reproduce on Linux/x86; absolute numbers are higher than Apple Silicon due to per-core differences.
Process epoch (segmented breakdown, ms/run)
before_process_epochinactivity_updatesrewards_and_penaltieseffective_balance_updatesproposer_lookaheadProcess block (segmented breakdown, ms/run)
operationsProcess block (end-to-end fused, ms/run)
process_block(with BLS)process_block_no_sigprocess_block_no_sig(BLS bypassed) drops 3.98× — the optimizations land cleanly on the non-BLS portion. The fused 1.27× reflects ~36-46 ms going to BLS aggregate signature verification per block, which is unaffected by PMT changes.