feat(FlatDB): long finality support with persisted snapshots by asdacap · Pull Request #11663 · NethermindEth/nethermind

asdacap · 2026-05-18T23:51:13Z

Summary

Adds a long-finality storage tier to Nethermind.State.Flat so the in-memory
snapshot ring (today bounded by MaxInMemoryReorgDepth, default 256) can be
extended to ~90,000 blocks (configurable via LongFinalityReorgDepth) without
the per-snapshot RAM cost growing linearly. Snapshots that fall off the
in-memory tail are converted into immutable, mmap-backed persisted
snapshots stored as HSST blobs in arena files, then logarithmically merged
("Linked" compaction) so the snapshot count stays bounded.

The new tier is opt-in (FlatDb.EnableLongFinality = false by default) and
sits underneath the existing in-memory snapshot ring; the persistence
pipeline, RPC/sync surface and consensus paths are unchanged.

Motivation

Today's flat DB caps reorg depth at MaxInMemoryReorgDepth (~256). Every
in-memory snapshot keeps its diff in RAM, so raising that ceiling is
proportionally expensive in heap.
Use cases that need to serve historical-tip queries or replay across a
large reorg window (long-finality CL designs, archive-lite serving)
need a way to keep many more snapshots cheap to retain and cheap to
read.
The persisted tier is an mmap-backed columnar format: paged access on
the read path means cold snapshots cost only file-descriptor + index
metadata until queried, and a PageResidencyTracker per arena drives
madvise(DONTNEED) so the resident set is bounded by a configured
page-cache budget rather than by total snapshot bytes on disk.

High-level design

Mirrors the in-memory `Snapshot` pipeline

A PersistedSnapshot is the on-disk twin of an in-memory Snapshot:
same (From, To) semantics, same StateId-keyed lookups, same
hierarchical compaction policy. The in-memory compactor in
SnapshotCompactor keys merge decisions off the lowest-set-bit of the
block number (blockNumber & -blockNumber), giving 1×, 2×, 4×, …,
CompactSize× layers; the persisted-tier compactor uses the same
lowest-set-bit rule (PersistenceManager.cs:118-128) so the
persisted-snapshot stack is just a continuation of the in-memory
snapshot stack across the tier boundary. As a result the number of
live persisted-snapshot layers is logarithmic in the configured
reorg depth rather than linear, with the same merge bookkeeping the
in-memory pipeline already implements. Code paths for reading
(GetFinalizedSnapshotAtBlockNumber falls through from in-memory
Snapshot to PersistedSnapshot at the same block/root if the
in-memory window has rotated past it), eviction, and compaction all
follow this mirror.

In-memory bloom filters skip cold persisted snapshots

Each persisted snapshot ships with a tiny in-RAM bloom filter
(PersistedSnapshotBloom, PersistedSnapshotBloomBitsPerKey
default 14.0 bpk). A single filter per snapshot covers
address/slot/self-destruct keys plus state-trie and storage-trie node
paths. Lookups iterate the snapshot stack from newest to oldest;
snapshots whose bloom says "definitely not here" are skipped without
touching the mmap'd HSST at all. This keeps the cost of walking a
deep snapshot stack proportional to the number of snapshots that
actually hold the key, not the total stack depth — important because
the long-finality stack is, by design, much deeper than today's
in-memory ring.

Custom directly-indexable on-disk format (HSST)

Persisted snapshots are not RocksDB-shaped. They use a purpose-built
columnar format (HSST)
designed so that a single HSST blob can be point-queried by key
without loading the whole blob: readers walk an embedded B-tree
index from the tail and pull only the pages they need. The format
also supports nesting (a value can itself be an HSST), which is what
makes the per-address / per-addressHash inner tables possible inside
a single outer snapshot file. Multiple specialised index variants
(B-tree, packed-array, dense-byte-index, two-byte-slot-value, …) are
picked per column by the writer to minimise blob size and seek cost;
the index-type byte at the blob tail selects the variant. Combined
with mmap + PageResidencyTracker, a cold snapshot costs only
file-descriptor + a few trailer pages of memory until something
actually queries it.

Trie RLP split out into separate blob arenas

The single biggest size win comes from not storing trie-node RLP
bytes inside the persisted-snapshot HSST. Each persisted snapshot's
HSST only stores 6-byte NodeRefs (BlobArenaId: u16,
RlpDataOffset: i32) where the actual RLP would have lived; the RLP
bytes themselves live in append-only blob arena files addressed by
those refs. In practice this shrinks the per-snapshot HSST by roughly
20× vs. storing RLP inline, because the trie-node payload
dominates the snapshot otherwise.

This split has knock-on benefits:

The HSST is small enough that compaction can rewrite it cheaply
without ever rewriting the RLP. Compacted snapshots' NodeRefs
still point at the original blob arena files.
Blob arena files are refcounted per file: every HSST that
references them holds a lease, and a file is only deleted when no
live snapshot still references its BlobArenaId. The
NodeRef-based deduplication that comes for free lets long-lived
trie nodes survive arbitrarily many snapshot compactions without
being copied.
16-bit BlobArenaId × 2 GiB-per-file (32-bit offset) caps each tier
at ~128 TiB of trie RLP.

Two-tier persisted snapshots

Small tier — short-range snapshots whose (To - From) < CompactSize.
This is the tier whose HSSTs are the small, mmap-backed, seekable
index files driven by PageResidencyTracker. Page-cache budget:
PersistedSnapshotSmallArenaPageCacheBytes (default 1 GiB resident).
Large tier — snapshots merged from ≥ CompactSize source
snapshots. Same HSST shape, larger files; page-cache budget:
PersistedSnapshotLargeArenaPageCacheBytes (default 8 GiB resident).
Hierarchical compactor merges adjacent snapshots into larger ones
with (To - From) doubling at each level, bounded by
PersistedSnapshotMaxCompactSize (default 1024) and capped at
PersistedSnapshotMaxCompactedSourceBytes source bytes (default 2
GiB) to keep merged HSSTs below int.MaxValue and the arena
ceiling. Compaction only rewrites HSSTs; blob arenas are untouched
and their RLP is inherited through the NodeRef indirection.
HSST arena files (ArenaFileSizeBytes, default 1 GiB) are
append-only; reservations are tagged for per-tag count/byte metrics.

HSST — Hierarchical Static Sorted Table

A new self-describing immutable on-disk format (Nethermind.State.Flat/Hsst/).
Top-level layout and rationale are documented in
Hsst/FORMAT.md;
the columnar persisted-snapshot vocabulary lives in
PersistedSnapshotTags.cs.

Index variants implemented (the trailing index-type byte selects):

0x01 BTree — variable-length values, value-first entries, tail B-tree
index. Optional 4 KiB leading padding lets the merger 4 KiB-align hot
blobs without rebuilding the entry.
0x02 PackedArray — fixed key/value array with a recursive summary
index and an optional hash table; SIMD floor-scan over summary levels.
0x04 DenseByteIndex — single-byte-keyed map indexed directly by tag.
Used for per-address sub-tag walks (one pinned trailer read resolves
any sub-tag).
0x05/0x06 TwoByteSlotValue / TwoByteSlotValueLarge — fixed
2-byte key map for slot-suffix layer (u16 / u24 offsets).
0x07 BTreeKeyFirst — key-first entry layout for outer columns
whose values are large nested HSSTs (parallel to the inner HSST's
keys-first index).

Columnar layout (outer HSST → per-key inner HSSTs):

0x00 Metadata
0x01 Address (raw 20 bytes) → {SlotSubTag, AccountSubTag, SelfDestructSubTag}
0x02 AddressHash → {StorageTopSubTag, StorageCompactSubTag, StorageFallbackSubTag}
0x03 StateNode (compact path, 6-15 bytes)
0x05 StateTopNode (path 0-5)
0x06 StateNodeFallback (path 16+)

Read-path engineering

WholeReadSession over mmap'd HSST arenas, routed through a
tracker-aware ArenaByteReader so every read participates in
PageResidencyTracker clock-eviction accounting.
Shared bloom-filter manager across tiers (filters described in the
bloom subsection above); bloom memory exposed via gauge metrics.
Direct-mapped per-snapshot address-bound cache (Vector512<long>
inline set) cuts repeated outer-HSST descents on hot addresses.

Lifecycle / GC

On Snapshot → PersistedSnapshot conversion, trie nodes are marked
persisted and pruned from the in-memory trie.
PageResidencyTracker per arena drives madvise(DONTNEED) and
(opt-in via PersistedSnapshotFadviseOnPageEviction)
posix_fadvise(POSIX_FADV_DONTNEED). Memory is reported to the
GC and exposed as per-tier gauge metrics.
LongFinalityReorgDepth is the hard tail — once exceeded, the
oldest HSST snapshot is force-persisted into the RocksDB tier.

Config surface

All new keys on IFlatDbConfig (default-off feature, conservative defaults):

Key	Default	Purpose
`EnableLongFinality`	`false`	Feature gate.
`LongFinalityReorgDepth`	`90000`	Total in-memory + persisted snapshot count.
`PersistedSnapshotPath`	`"snapshots"`	Arena directory under data dir.
`ArenaFileSizeBytes`	`1 GiB`	Arena file ceiling.
`PersistedSnapshotSmallArenaPageCacheBytes`	`1 GiB`	Resident-page budget, small tier.
`PersistedSnapshotLargeArenaPageCacheBytes`	`8 GiB`	Resident-page budget, large tier.
`PersistedSnapshotFadviseOnPageEviction`	`false`	Benchmark-only: also fadvise on tracker eviction.
`PersistedSnapshotMaxCompactSize`	`1024`	Hierarchical-compactor ceiling (`(To-From)` doubling stops here).
`PersistedSnapshotMaxCompactedSourceBytes`	`2 GiB`	Per-merge source-byte cap (keeps merged HSST below `int.MaxValue`).
`PersistedSnapshotBloomBitsPerKey`	`14.0`	Per-snapshot bloom precision (0 = disabled).
`ValidatePersistedSnapshot`	`false`	Debug-only: validate persisted vs in-memory after conversion.

Observability

Per-tier gauges: arena-file count, mapped bytes, page-tracker memory,
bloom memory, per-tag reservation count/bytes.
Per-tier PageResidencyTracker eviction counters.

Performance

Graph is default, min reorg 300, 4k, 16k, 90k. 16k and 90k have longer run so that the reorg depth is reached.
Memory systemd limited to 48GB.
Note, for 16K around 4.5GB is neded for the snapshots and 25GB for the 90K. This means they are largely in RAM except for the trie rlps.

Notes:

The 300-block point is intentionally inside today's in-memory
reorg window, so the persisted tier is effectively idle — confirms
the feature flag adds no measurable overhead when not exercised.
The 4k / 16k plateau (~20% on both metrics) reflects the steady
cost of walking the deeper snapshot stack for reads plus the
background work of bloom construction, HSST writes, and
hierarchical compaction.
The 90k step drops blocks/sec more than MGas/sec, suggesting the
fixed per-block overhead (lookups, bloom checks, HSST descent) is
the dominant cost at very long stacks rather than per-gas state
access cost.
The practical ceiling on LongFinalityReorgDepth is more likely
to be the blob-arena address space than CPU: per tier the
NodeRef encoding caps total trie-RLP storage at
ushort.MaxValue files × 2 GiB ≈ 128 TiB, and the small-tier
blob arena fills proportionally to the snapshot count rather than
to compaction.

Compatibility

Feature is disabled by default; with EnableLongFinality=false
none of the persisted-snapshot code runs. RocksDB column-family
layout is preserved (two retained empty column ordinals documented
in FlatDbColumns.cs).
On-disk format version (PersistedSnapshotTags.MetadataFormatVersion)
is 0x02. The SnapshotCatalog v2→v3 schema bump trips a
"wipe and resync" error if pointed at an older directory.
No protocol/JSON-RPC surface changes. No consensus-path changes.

Scope

423 commits, 166 files (+29,595 / -341).
Touched projects: Nethermind.State.Flat (104), Nethermind.State.Flat.Test (42),
Nethermind.Core (4 — NativeMemoryList, Leb128), Nethermind.Benchmark (4),
Nethermind.Trie/Db/Init/Runner (small wiring).
Commit-prefix breakdown: 163 refactor, 132 perf, 38 feat,
36 fix, 13 test, 4 docs, 3 revert, 3 config.

Review aids

Suggested reading order if reviewing top-down:

Hsst/FORMAT.md — wire format.
NodeRef.cs +
Storage/BlobArenaManager.cs —
the trie-RLP indirection that makes the HSSTs small.
PersistedSnapshotTags.cs — columnar vocabulary.
PersistedSnapshot.cs, PersistedSnapshotReader.cs — read path.
PersistedSnapshotBuilder.cs, PersistedSnapshotMerger.cs,
PersistedSnapshotCompactor.cs — write/compaction path.
Storage/ArenaManager.cs, Storage/PageResidencyTracker.cs,
Storage/WholeReadSession.cs — HSST arena lifecycle.
PersistenceManager.cs, FlatWorldStateModule.cs — wiring into
the existing flat persistence pipeline.

Types of changes

Testing

Requires testing

Yes

If yes, did you write tests?

Yes

Notes on testing

42 new test files in Nethermind.State.Flat.Test cover HSST round-trips
(all index variants), persisted-snapshot build/read/merge/compact paths,
page-tracker eviction, bloom-filter correctness, and N-way merge cursor.
ValidatePersistedSnapshot=true cross-checks every persisted snapshot
against the source in-memory snapshot at conversion time; intended for
diagnostic runs but available for review.
Benchmarks: HsstReaderBenchmark, PersistedSnapshotBenchmark,
PersistedSnapshotCompactBenchmark in Nethermind.Benchmark/State/.
Recommended local validation: enable EnableLongFinality=true,
ValidatePersistedSnapshot=true, sync mainnet from snapshot for at
least a window > MaxInMemoryReorgDepth and confirm:
- persisted-snapshot count grows then plateaus once compaction kicks in,
- PageResidencyTracker resident bytes track the configured budget,
- no Invalid Block / Exception lines in logs.

Documentation

Requires documentation update

Yes
No (new format spec ships in-tree as Hsst/FORMAT.md; user-facing
docs page can come in a follow-up once the feature flips default)

Requires explanation in Release Notes

Yes — "FlatDB: opt-in long-finality persisted-snapshot tier
(FlatDb.EnableLongFinality). See config keys above."

Remarks

This PR is intentionally large because the persisted-snapshot tier
is meaningful only end-to-end (format + write path + read path +
compactor + lifecycle + observability). Splitting it further would
introduce dead, unwired code into master between PRs.
All commits are on origin (flat/long-finality) and the branch is
fast-forward-clean against master.

claude · 2026-05-18T23:55:32Z

Claude finished @asdacap's task in 8m 56s —— View job

PR Review: feat(FlatDB): long finality support with persisted snapshots

Todo List:

Read repository rules and guidelines
Analyze core HSST format and data structures
Review write path (builder, merger, compactor)
Review read path and memory management
Review persistence manager and wiring
Review safety, security, and robustness
Review tests coverage
Compile findings and post summary

Overview

This is an impressively engineered PR: a custom mmap-backed columnar format (HSST), a logarithmic compaction hierarchy, per-snapshot bloom filters, a clock-based page-residency tracker, and a blob-arena indirection layer to keep HSST sizes small. The HSST format spec (FORMAT.md) is thorough and the unsafe code is carefully justified throughout. Test coverage (42 new files, ValidatePersistedSnapshot cross-check mode) is solid.

The three findings below block merge in my assessment. The rest are improvements that should be made before or shortly after landing.

Critical / High

1. `EnableLongFinality = false` does not actually disable persisted-snapshot code [High]

EnableLongFinality is declared in IFlatDbConfig and set to false in FlatDbConfig, but it is never read in any logic path — PersistenceManager, FlatDbManager, and FlatWorldStateModule do not reference it.

Consequences:

FlatWorldStateModule.PerTierState always constructs real ArenaManager / BlobArenaManager / PersistedSnapshotRepository instances, regardless of the flag. Arena directories are created on disk unconditionally.
PersistenceManager.DetermineSnapshotAction can return a snapshotLevelToConvert and trigger ConvertSnapshotToPersistedSnapshot (writing HSST files to disk) whenever the in-memory snapshot window exceeds MaxInMemoryReorgDepth + CompactSize blocks with stalled finality — even with EnableLongFinality = false.

The PR description states: "with EnableLongFinality=false none of the persisted-snapshot code runs" — this is not currently true. Either wire NullPersistedSnapshotRepository when the flag is false, or gate DetermineSnapshotAction's conversion path on the flag, or update the documentation to reflect actual behaviour. See inline comment on IFlatDbConfig.cs:53

2. `.Wait()` on async channel write in `ProcessCompactBatch` [High]

_boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait() (PersistenceManager.cs:137) synchronously blocks a thread-pool thread while the bounded channel (capacity 16) is full. This violates the robustness rule against .Wait() inside async contexts and can cause thread-pool starvation. ProcessCompactBatch should be made async Task and the write should be awaited with the existing CancellationToken. See inline comment

3. No runtime guard in `ReadBlobArenaRlp` when `totalLength > bytesRead` [High]

PersistedSnapshot.ReadBlobArenaRlp (PersistedSnapshot.cs:524-535) does buf[..totalLength].CopyTo(result) without checking that totalLength <= bytesRead. If the RLP header yields a length larger than the 568-byte read buffer (or if a short read occurs), this either throws ArgumentOutOfRangeException or silently copies uninitialized bytes into the returned node RLP — both are worse outcomes than a clear error. The 568-byte estimate is sound for standard MPT nodes, but should be guarded at runtime. See inline comment

Medium

M1. No-op LINQ .Select() in PersistSnapshot hot path — PersistenceManager.cs:545 and :571 both call .Select(v => ((Hash256, TreePath))v) on a List<(Hash256, TreePath)>, which is an identity cast that allocates an iterator object per persistence call. Replace with a plain foreach. Inline comments at L545 and L571

M2. Parallel.ForEach ignores CancellationToken — PersistenceManager.cs:382. The snapshot-conversion Parallel.ForEach does not pass a CancellationToken, so it cannot be interrupted during shutdown. Pass new ParallelOptions { CancellationToken = _cancelTokenSource.Token }. Inline comment

M3. NodeRef.RlpDataOffset signed int — no overflow guard in writer — The signed 32-bit offset caps blob-arena files at 2 GiB by design, but if ArenaFileSizeBytes is misconfigured above 2 GiB (or a future refactor changes the cap), the offset silently wraps to negative, silently corrupting all subsequent trie-node reads from that file. Add an explicit guard in BlobArenaWriter before the offset is committed. Inline comment

M4. NativeMemoryList<byte> for a 568-byte scratch buffer — PersistedSnapshot.ReadBlobArenaRlp allocates an off-heap buffer via NativeMemory.Alloc + NativeMemory.Free on every trie-node RLP read. A stackalloc byte[MaxTrieNodeRlpBytes] eliminates the two native allocation calls for this fixed-size case. Inline comment

Low

L1. LongFinalityReorgDepth config description missing unit — "Total max reorg depth" should state "in blocks". Inline comment

L2. Commented-out code block in TryGetAddressBound — Lines 284–291 of PersistedSnapshot.cs are a commented-out warm-up block. Either add a // TODO: with a tracking issue or remove. Inline comment

L3. TryGetSnapshotFrom is O(n) over all persisted snapshots — PersistedSnapshotRepository.cs:300-315 iterates both ConcurrentDictionarys to find by From key. Only reached in the exceptional force-persist path (TryGetForcePersistedSnapshot), so the practical impact is bounded, but keying a secondary index by From would be cleaner at this scale.

L4. PersistedSnapshotBloom.CreateAlwaysTrue direct field write — sentinel._leases.Value = long.MaxValue / 2 bypasses the normal lease API to make a sentinel immortal. A dedicated constructor parameter or a static factory flag would be safer and clearer.

Positive notes

The HSST format is well-designed for point-queries on cold mmapped files — value-first layout, B-tree tail index, per-index-type dispatch, all coherent.
The PageResidencyTracker 8-way set-associative clock cache with lock-free hot path and 1-bit spinlock on the miss path is a textbook-quality implementation.
The per-snapshot inline address-bound cache (Vector512<long> reinterpreted as 8 long slots) is clever and saves the outer HSST descent for repeated hot-address lookups.
The refcount-based ArenaFile / BlobArenaFile lifecycle, PersistOnShutdown flag, and SweepUnreferenced orphan cleanup collectively make the shutdown/restart correctness story solid.
42 test files with round-trip, merger, compactor, and bloom correctness coverage is excellent for a format this new.

claude · 2026-05-19T00:02:15Z

+            Parallel.ForEach(kv.Value, state => _smallCompactor.DoCompactSnapshot(state));
+
+        foreach (StateId boundary in boundaries)
+            _boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait();


High: Blocking .Wait() on async channel write — potential thread-pool stall

WriteAsync on a bounded Channel (capacity 16) returns an incomplete ValueTask when the channel is full. .AsTask().Wait() then synchronously blocks the current thread-pool thread — exactly the pattern the robustness rules prohibit (Never call .Wait() inside an async method).

ProcessCompactBatch runs inside RunPersistedCompactor's await foreach, so the current thread is a pool thread. If more than 16 boundary-block states accumulate in one batch, this thread stalls until a boundary compactor drains a slot, preventing other pool tasks from running.

Suggested change

_boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait();

foreach (StateId boundary in boundaries)

await _boundaryCompactJobs.Writer.WriteAsync(boundary, cancellationToken);

ProcessCompactBatch needs to become async Task ProcessCompactBatch(ArrayPoolList<StateId> batch, CancellationToken ct) and the call site in RunPersistedCompactor needs the corresponding await.

claude · 2026-05-19T00:02:26Z

+        using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes);
+        Span<byte> buf = rented.AsSpan();
+        int bytesRead = file.RandomRead(offset, buf);
+        Rlp.ValueDecoderContext ctx = new(buf[..bytesRead]);


High: No runtime guard on totalLength > MaxTrieNodeRlpBytes

ctx.PeekNextRlpLength() returns the full RLP length decoded from the header. If totalLength > 568 (or totalLength > bytesRead on a short read), then buf[..totalLength] throws ArgumentOutOfRangeException, or copies zero-filled tail bytes into result when totalLength fits in the buffer but the read was short.

The 568-byte cap is a reasonable estimate for Ethereum's worst-case branch node (~532 bytes), but there is no runtime assertion guarding the invariant. A defensive check here costs one compare on the cold path:

Suggested change

Rlp.ValueDecoderContext ctx = new(buf[..bytesRead]);

int totalLength = ctx.PeekNextRlpLength();

if (totalLength > bytesRead)

{

// Re-read with an exact-size buffer. Should never happen for valid trie nodes,

// but guards against unexpected node formats or partial reads at arena boundaries.

byte[] full = new byte[totalLength];

buf[..bytesRead].CopyTo(full);

file.RandomRead(offset + bytesRead, full.AsSpan(bytesRead));

return full;

}

byte[] result = new byte[totalLength];

buf[..totalLength].CopyTo(result);

return result;

claude · 2026-05-19T00:02:38Z

    bool VerifyWithTrie { get; set; }
+
+    [ConfigItem(Description = "Enable long finality support with persisted snapshots", DefaultValue = "false")]
+    bool EnableLongFinality { get; set; }


High: EnableLongFinality flag is never read in any logic path

grep -rn "EnableLongFinality" across the Nethermind.State.Flat and Nethermind.Init source trees returns only the interface definition, the FlatDbConfig default, and a test asserting the default. The flag is never referenced in PersistenceManager, FlatDbManager, FlatWorldStateModule, or any compactor/repository.

As a result:

Real PersistedSnapshotRepository/ArenaManager instances are always constructed, regardless of the flag (see FlatWorldStateModule.PerTierState factory).

PersistenceManager.DetermineSnapshotAction can return snapshotLevelToConvert (triggering ConvertSnapshotToPersistedSnapshot) once the in-memory window exceeds MaxInMemoryReorgDepth + CompactSize when finality is stalled, even with EnableLongFinality = false.

The PR description states "with EnableLongFinality=false none of the persisted-snapshot code runs" — this is not true. Please either:

Add a guard in FlatWorldStateModule that wires NullPersistedSnapshotRepository when EnableLongFinality = false, OR

Add the flag check in PersistenceManager.DetermineSnapshotAction to skip the conversion path, OR

Correct the documentation to reflect the actual behaviour.

claude · 2026-05-19T00:02:45Z

@@ -268,7 +542,7 @@ internal void PersistSnapshot(Snapshot snapshot)

            long stateNodesSize = 0;
            // foreach (var tn in snapshot.TrieNodes)
-            foreach ((Hash256, TreePath) k in _trieNodesSortBuffer)
+            foreach ((Hash256, TreePath) k in _trieNodesSortBuffer.Select(v => ((Hash256, TreePath))v))


Medium: Pointless LINQ .Select() in the persistence hot path

_trieNodesSortBuffer is List<(Hash256, TreePath)> and the Select just recasts each element to the same type it already is. This allocates an iterator object on every call through the persistence path.

Suggested change

foreach ((Hash256, TreePath) k in _trieNodesSortBuffer.Select(v => ((Hash256, TreePath))v))

foreach ((Hash256, TreePath) k in _trieNodesSortBuffer)

claude · 2026-05-19T00:02:51Z

@@ -296,7 +571,7 @@ internal void PersistSnapshot(Snapshot snapshot)



Medium: Same no-op LINQ .Select() on the second sort buffer

Same pattern as line 545 — the cast is a no-op and the LINQ iterator is a needless allocation.

Suggested change

foreach ((Hash256, TreePath) k in _trieNodesSortBuffer)

claude · 2026-05-19T00:03:00Z

+            addressBound = new Bound(lebOffset - valueLength, valueLength);
+            useSpanReader = addressBound.Length <= AddressBoundWarmupBytes;
+            // if (useSpanReader)
+            // {


Low: Commented-out code block — dead code left in

Lines 284–291 (the if (useSpanReader) warm-up block) are commented out with a note that REF bits need to be re-armed on cache hits. If this is intentionally deferred, a // TODO: with a tracking issue is clearer. If the behaviour was decided against, the block should be removed to keep the reader from reasoning about it.

claude · 2026-05-19T00:03:09Z

+    private byte[] ReadBlobArenaRlp(ushort blobArenaId, int offset)
+    {
+        BlobArenaFile file = _blobManager.GetFile(blobArenaId);
+        using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes);


Medium: NativeMemoryList<byte> for a fixed 568-byte scratch buffer — prefer stackalloc

NativeMemoryList<byte> calls NativeMemory.Alloc + NativeMemory.Free on every trie-node RLP resolve. For the 568-byte case, a plain stackalloc avoids the two P/Invoke syscalls and keeps the buffer on the stack:

Suggested change

using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes);

Span<byte> buf = stackalloc byte[MaxTrieNodeRlpBytes];

The using scope and its Dispose can be removed. The buffer is stack-allocated and the file.RandomRead call fills it in-place just as before. Since PersistedSnapshot is a class (not a ref struct), stackalloc is legal here.

claude · 2026-05-19T00:03:17Z

+    /// 32-bit caps a single blob arena file at 2 GiB. <see cref="BlobArenaWriter"/>
+    /// enforces this on append; <see cref="BlobArenaManager.CreateWriter"/> picks
+    /// a fresh file when the estimate exceeds the current file's headroom.
+    /// </summary>


Medium: Signed int offset silently wraps for files > 2 GiB

RlpDataOffset is int (signed), so the maximum addressable offset is int.MaxValue = 2,147,483,647 bytes ≈ 2 GiB. The comment correctly documents this cap, but BlobArenaWriter should enforce it explicitly before writing. If a blob arena file grows past 2 GiB (e.g., due to a misconfigured ArenaFileSizeBytes), the offset silently wraps to a negative value, producing silent data corruption rather than a clear error.

Consider adding an explicit guard in BlobArenaWriter.Complete / BlobArenaWriter.Write:

if (offset > int.MaxValue) throw new InvalidOperationException( $"BlobArena offset {offset} exceeds int.MaxValue; per-file cap is 2 GiB.");

claude · 2026-05-19T00:03:26Z

+                }
+
+                // Parallel base conversion across the whole batch
+                Parallel.ForEach(allStateIds, state =>


Medium: Parallel.ForEach has no CancellationToken — does not honour shutdown

The Parallel.ForEach call uses the default ParallelOptions, so there is no way to cancel it when _cancelTokenSource fires on node shutdown. If the loop is in the middle of a large batch during shutdown, it will run to completion before DisposeAsync can progress.

Suggested change

Parallel.ForEach(allStateIds, state =>

Parallel.ForEach(allStateIds,

new ParallelOptions { CancellationToken = _cancelTokenSource.Token },

state =>

{

if (_snapshotRepository.TryLeaseState(state, out Snapshot? snapshot))

{

long sw = Stopwatch.GetTimestamp();

_smallRepo.ConvertSnapshotToPersistedSnapshot(snapshot);

_persistedSnapshotConvertTime.WithLabels("base").Observe(Stopwatch.GetTimestamp() - sw);

snapshot.Dispose();

}

});

claude · 2026-05-19T00:03:34Z

+    bool EnableLongFinality { get; set; }
+
+    [ConfigItem(Description = "Total max reorg depth (in-memory + persisted). When exceeded, force-persist oldest HSST snapshot to RocksDB.", DefaultValue = "90000")]
+    int LongFinalityReorgDepth { get; set; }


Low: Config description missing units

The description "Total max reorg depth (in-memory + persisted)" doesn't state the unit. Per the coding style rules, config items must document their units. Should be "Total max reorg depth in blocks (in-memory + persisted)."

asdacap · 2026-05-19T00:16:51Z

                Assembly[] releaseAssemblies = additionalJobAssemblies
                    .Union(simpleJobAssemblies)
-                    .Append(typeof(KeccakBenchmark).Assembly)
+                    // Precompile benchmark assembly disabled: PrecompileBenchmarkBase.Inputs


asdacap · 2026-05-19T00:17:11Z

+/// N inputs — exercising the hot paths the optimisation targets.
+/// </summary>
+[MemoryDiagnoser]
+public class PersistedSnapshotCompactBenchmark : IDisposable


asdacap · 2026-05-19T00:17:33Z

+  <ItemGroup>
+    <!-- Pre-existing breakage from c8a5fbcff3 (long-finality refactor); excluded so the
+         project still builds for runnable benchmarks. -->
+    <Compile Remove="State/PersistedSnapshotBenchmark.cs" />


Remove clause, reset to master

asdacap · 2026-05-19T00:17:56Z

+    /// empty tree even if the blocknumber is not -1. This does not work with flat. We will ignore it for now.
+    /// </summary>
+    /// <param name="flatDbManager"></param>
+    private class FlatDbManagerTestCompat(IFlatDbManager flatDbManager) : IFlatDbManager


This is probably not needed anymore.

asdacap · 2026-05-19T00:20:51Z

        "resolved": "1.8.5",
        "contentHash": "EaCgmntbH1sOzemRTqyXSqYjB6pLH7VCYHhhDYZ59guHSD5qPwhIYa7kfy0QUlmTRt9IXhaXdFhNuBUArp70Ng=="
      },
-      "prometheus-net": {


Reset whole file to master.

Bundles the PR-review fixes for the persisted-snapshot tier: - Gate persisted-snapshot conversion paths on EnableLongFinality. When the flag is false, DetermineSnapshotAction skips both long-finality branches and falls through to the existing finalized-snapshot-to-RocksDB flow, matching pre-long-finality behaviour. FlatWorldStateModule now wires NullPersistedSnapshotRepository/Compactor when the flag is false so no arena/blob directories are created on startup. - async ProcessCompactBatch: replace .AsTask().Wait() on a bounded channel write with an awaited WriteAsync that honours _cancelTokenSource.Token. - ReadBlobArenaRlp: stackalloc the 568-byte scratch (was NativeMemoryList) and throw InvalidDataException if the RLP header declares more bytes than the pread returned, instead of slicing past the read window. - Parallel.ForEach in AddToPersistence now takes ParallelOptions with the shutdown CancellationToken. - Drop two identity-cast LINQ Select calls on _trieNodesSortBuffer in PersistSnapshot (per-snapshot iterator allocation). - Add unit ("in blocks") to LongFinalityReorgDepth's ConfigItem description. - Delete a dead commented-out warm-up branch in TryGetAddressBound. - PersistedSnapshotBloom.CreateAlwaysTrue routes the sentinel lease count through a private constructor overload instead of a post-construction field write. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

All six 'ArrayPoolList<X> list = new(N, N); X[] arr = list.UnsafeGetInternalArray();' sites in PersistedSnapshotMerger.cs simplify to: ArrayPoolList<X> list = new(N, N); Span<X> arr = list.AsSpan(); The downstream consumers were already in span form (cursor ctor takes Span<TSource>, helpers take Span<TSource>, etc.) and the redundant .AsSpan(0, N) calls collapse because list.AsSpan() already returns a span of length Count == N. The innerSources path keeps the [..innerN] slice since innerN varies per outer iteration. No need to access the raw rented array — UnsafeGetInternalArray was just a way to get a Span back, which AsSpan() already provides without the GuardDispose checks accumulating (we cache the span once at the top of each method). Verified: 0/0 warnings/errors prod + test; 876/876 + 7 skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…or, doc trim - Inline trivial single-call-site shims in HsstBTreeBuilder (MinBytesFor, WriteUInt64LE, KeySliceLength, FlushPendingAsEntries) - Make BTreeNodeWriter.HeaderSize a const - Consolidate Add / TryAddAligned via a shared private AddImpl - Switch BTreeNodeReader to a primary constructor; extract NodeMetadata to its own file (was nested) and inline DecodeValueSize into its sole caller - Trim 8 restate-only XML docstrings on private helpers; keep the ones that document invariants / triggers / spec-derived layout No behaviour change. Verified: full Nethermind.slnx builds clean; Nethermind.State.Flat.Test passes (869 / 7 pre-existing skips). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

asdacap · 2026-06-17T04:18:29Z

@claude re-review

claude · 2026-06-17T04:18:45Z

Claude finished @asdacap's task in 7m 32s —— View job

Claude finished @asdacap's re-review —

PR Re-Review: feat(FlatDB): long finality support with persisted snapshots

Todo List:

Read repository rules and guidelines
Inspect current branch state vs prior review (what changed)
Re-examine prior High findings — check if they were addressed
Review new/changed files in detail
Post inline comments
Compile updated findings and post summary

Status of Prior Findings

All three prior High findings are addressed:

#	Finding	Status
H1	`EnableLongFinality = false` did not disable persisted code	✅ Fixed — `NullSnapshotCatalog` / `NullPersistedSnapshotLoader` / `NullPersistedSnapshotCompactor` wired in `FlatWorldStateModule`; `PersistenceManager.DetermineSnapshotAction` returns `(null,null,null)` early when flag is off.
H2	`.Wait()` on async channel write in `ProcessCompactBatch`	✅ Fixed — `_boundaryCompactJobs.Writer.WriteAsync` now uses `await` with a `CancellationToken` (line 157). However, a new `.Wait()` on `_compactPersistedJobs.Writer.WriteAsync` has appeared in `PersistedSnapshotCompactor.Enqueue` — see new High below.
H3	No runtime guard in `ReadBlobArenaRlp` when `totalLength > bytesRead`	✅ Fixed — throws `InvalidDataException`; `stackalloc` replaces `NativeMemoryList`.

Medium findings:

#	Finding	Status
M1	No-op LINQ `.Select()` in hot path	✅ Fixed — code refactored
M2	`Parallel.ForEach` ignores `CancellationToken`	⚠️ Still present in `PersistenceManager.cs:238` — carry-over from prior review
M3	`NodeRef.RlpDataOffset` overflow guard missing	✅ Fixed — guard in `BlobArenaWriter.cs:107`
M4	`NativeMemoryList<byte>` for scratch buffer	✅ Fixed — `stackalloc` used

New Findings

High — `PersistedSnapshotCompactor.Enqueue` line 65 → Fix this

_compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait() blocks a thread-pool thread when the bounded channel (capacity 16, line 52) is full. The call chain is:

FlatDbManager.RunPersistence → NotifyWhenSlow → Task.Run → PersistIfNeeded → AddToPersistence → ConvertCompactedRange / ConvertSingleBase → compactor.Enqueue()

This is the same robustness violation as prior High finding H2, now at a different call site. If the compactor falls behind during a burst (e.g., catch-up sync producing many conversion candidates), the Task.Run thread stalls and thread-pool saturation follows. See inline comment on PersistedSnapshotCompactor.cs:65.

Medium — `SnapshotCatalog.ReadEntry` line 134

(SnapshotTier)span[100] casts without validating that the byte is a persisted-tier value (2–5). An on-disk value of 0, 1, or ≥6 silently produces wrong bucket assignment (IsPersisted() = false → entry ignored, or MetricTierLabel throws ArgumentOutOfRangeException at the next metric site). The format-version check guards schema changes but not per-entry corruption. One range check surfaces corruption immediately. See inline comment on SnapshotCatalog.cs:134.

Medium — `Parallel.ForEach` in `PersistedSnapshotCompactor.cs:145`

New instance of the same carry-over M2 pattern: the compactor's own _cancelTokenSource.Token is not passed to ParallelOptions. A large compaction batch running at shutdown time will block DisposeAsync until it completes. See inline comment on PersistedSnapshotCompactor.cs:145.

Positive Notes on New Work

The NullSnapshotCatalog / NullPersistedSnapshotLoader Null-object wiring is clean and correctly placed after the concrete registrations so the last AddSingleton wins.
The SnapshotCatalog key layout (48-byte big-endian key for RocksDB lexicographic ordering, 101-byte little-endian value) is consistent and well-documented.
The PersistedLargeCompacted tier split and the accompanying bucket in SnapshotRepository are well-structured; the tier ordering invariant (tier >= PersistedBase = persisted) is maintained.
The BlobArenaWriter overflow guard (_written + rlp.Length > int.MaxValue → throw) and the ReadBlobArenaRlp guard (totalLength > bytesRead → throw) are both exactly right.
The ValidatePersistedSnapshot + Environment.Exit for git-bisect compatibility is a pragmatic debug tool and is intentional per the commit message and code comment.

Verdict: not yet mergeable — one new High finding (Enqueue().Wait()) and two Medium findings (Parallel.ForEach without cancellation ×2, SnapshotCatalog enum cast) remain unaddressed.

claude · 2026-06-17T04:25:25Z

+    public void Enqueue(ArrayPoolList<StateId> batch)
+    {
+        EnsureStarted();
+        _compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait();


High: .Wait() on bounded-channel write — same anti-pattern as the prior review's High finding, now in a different location

_compactPersistedJobs is a BoundedChannel of capacity 16 (line 52). When it is full, WriteAsync returns an incomplete ValueTask; .AsTask().Wait() then synchronously blocks the current thread-pool thread until a slot opens.

Enqueue is called from PersistenceManager.ConvertCompactedRange / ConvertSingleBase, which runs inside FlatDbManager.NotifyWhenSlow → Task.Run. If the compactor falls behind and the channel fills (e.g., during a rapid burst of block conversions), this stalls one or more thread-pool threads indefinitely — exactly the thread-pool starvation scenario that the robustness rules prohibit.

The callers already have no CancellationToken to pass, but the real fix is to avoid blocking at all — either make Enqueue fire-and-forget by using TryWrite and logging on drop, or restructure the caller to await WriteAsync by propagating the CancellationToken:

Suggested change

_compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait();

public void Enqueue(ArrayPoolList<StateId> batch)

{

EnsureStarted();

// TryWrite is non-blocking; drop the batch (with a warning) if the compactor is behind

// rather than stalling the thread-pool thread that processes blocks.

if (!_compactPersistedJobs.Writer.TryWrite(batch))

{

if (_logger.IsWarn) _logger.Warn($"Persisted compactor job queue full — dropping batch of {batch.Count} states.");

batch.Dispose();

}

}

Alternatively, make the channel unbounded so the write always succeeds immediately and backpressure is handled by memory pressure instead.

claude · 2026-06-17T04:25:35Z

+        int arenaId = BinaryPrimitives.ReadInt32LittleEndian(span[80..]);
+        long offset = BinaryPrimitives.ReadInt64LittleEndian(span[84..]);
+        long size = BinaryPrimitives.ReadInt64LittleEndian(span[92..]);
+        SnapshotTier tier = (SnapshotTier)span[100];


Medium: Unchecked enum cast from on-disk byte — silent misclassification on corruption

(SnapshotTier)span[100] performs a direct cast without checking that the byte falls within the set of persisted-tier values {2, 3, 4, 5}. The catalog comment says it only ever stores persisted tiers, but a single corrupted byte (e.g., value 0 = InMemoryBase or 6 = undefined) silently produces a wrong classification:

InMemoryBase/InMemoryCompacted (0 or 1): IsPersisted() returns false, so the entry is added to the wrong bucket and later operations that expect a persisted snapshot get an in-memory tier handle.

A value ≥ 6: MetricTierLabel throws ArgumentOutOfRangeException at the next metric callsite, which is confusing to diagnose.

The format-version check (line 71) guards against schema changes but not per-entry field corruption. A small guard costs nothing and surfaces corruption immediately:

Suggested change

SnapshotTier tier = (SnapshotTier)span[100];

SnapshotTier tier = (SnapshotTier)span[100];

if (!tier.IsPersisted())

throw new InvalidOperationException(

$"Persisted snapshot catalog contains non-persisted tier value {(byte)span[100]} at entry. " +

"The persisted_snapshot/ directory may be corrupted — wipe and resync.");

claude · 2026-06-17T04:25:42Z


-        Snapshot? snapshotToPersist;
+            Parallel.ForEach(


Medium: Parallel.ForEach ignores shutdown cancellation (carry-over from prior review M2)

This Parallel.ForEach has no ParallelOptions and therefore ignores _cancelTokenSource (or whichever token guards node shutdown). If this loop is processing a large CompactSize-wide batch during shutdown, DisposeAsync in the enclosing FlatDbManager / compactor will block until every item in allStateIds has been converted.

Suggested change

Parallel.ForEach(

Parallel.ForEach(

allStateIds,

new ParallelOptions { CancellationToken = loader.CancellationToken },

state =>

{

(Assuming IPersistedSnapshotLoader exposes the token, or thread it from PersistenceManager's own CTS.) If the token isn't easily available here, at minimum document that this loop is intentionally uncooperative with cancellation and that shutdown may be delayed by up to one full CompactSize-wide batch.

claude · 2026-06-17T04:25:45Z

+        // Ascending bucket order: each sub-CompactSize layer's inputs (the previous layer's
+        // outputs) exist before it runs.
+        foreach (KeyValuePair<int, List<StateId>> kv in buckets)
+            Parallel.ForEach(kv.Value, state => DoCompactSnapshot(state));


Medium: Parallel.ForEach ignores shutdown cancellation

Same pattern as PersistenceManager.cs line 238 (prior review M2): the compactor's own _cancelTokenSource is not passed as ParallelOptions.CancellationToken. If shutdown fires while a bucket's state list is large (e.g., at a high compaction level), the loop runs to completion before DisposeAsync can drain.

Suggested change

Parallel.ForEach(kv.Value, state => DoCompactSnapshot(state));

Parallel.ForEach(kv.Value,

new ParallelOptions { CancellationToken = _cancelTokenSource.Token },

state => DoCompactSnapshot(state));

Address review feedback on the long-finality concurrency paths: - SnapshotCatalog.ReadEntry now rejects a non-persisted tier byte with the file's standard "wipe and resync" error instead of silently misclassifying a corrupt entry. - IPersistedSnapshotCompactor.Enqueue becomes EnqueueAsync: it awaits a free bounded-queue slot (backpressure without blocking a thread) and takes the producer's cancellation token rather than owning a CancellationTokenSource. Its background workers and in-flight Parallel.ForEach observe process-exit directly; graceful disposal completes and drains the channels in stage order. - PersistenceManager's persistence drain is now async (SemaphoreSlim mutex so it can await Enqueue under the lock) and its conversion Parallel.ForEach observes a process-exit-linked token; FlushToPersistence stays synchronous. - FlatDbManager.PersistIfNeeded awaits the now-async AddToPersistence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the test-only IArenaManager wrapper with the actual ArenaManager over a temporary directory, built via a new TestFixtureHelpers.CreateArenaManager factory that applies the same test config (page tracker disabled, arena size floored to one OS page). Fixtures own the temp dir and clean it up in TearDown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Halve PersistedSnapshotArenaPageCacheBytes from 8 GiB to 4 GiB (and the matching ConfigItem DefaultValue), bounding the mmap'd arena page cache and the tracker's own metadata (~17 MiB -> ~8.5 MiB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Resolve the conflict from #11967 (force-persist the head-reachable fork): our branch replaced DetermineSnapshotToPersist with the two-phase DetermineSnapshotAction, so the fix is ported onto our path rather than master's methods: - Keep master's SnapshotRepository committed-head infra (SetLastCommittedStateId/ GetLastCommittedStateId) and the FlatDbManager.AddSnapshot caller; drop master's now-superseded DetermineSnapshotToPersist/GetHeadAncestorAtBlockNumber/ GetFinalizedSnapshotAtBlockNumber/TryFindAncestorStateAtBlock. - DetermineSnapshotAction backstop and FlushToPersistence now seed from the committed head (GetLastCommittedStateId), which our ancestral FindSnapshotToPersist walk turns into "persist the head-reachable fork" — so a forced persist no longer follows a longer non-canonical fork. - Port master's two DetermineSnapshotToPersist fork tests to DetermineSnapshotAction; adapt master's two FlushToPersistence fork tests to our repo-owned snapshot model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-06-17T08:40:13Z

Claude finished @asdacap's task in 5m 6s —— View job

Todo List:

Load rules files and gather context
Inspect current state of prior High finding: PersistedSnapshotCompactor.Enqueue .Wait()
Inspect carry-over Mediums: Parallel.ForEach without CancellationToken (×2)
Inspect new Medium: SnapshotCatalog.ReadEntry unchecked enum cast
Scan recently changed files for new issues
Post findings and inline comments

PR Re-Review (3rd pass): feat(FlatDB): long finality support with persisted snapshots

All findings from the prior two reviews are resolved. The cooperative-shutdown refactor (4cc76142) is particularly thorough — it addresses the High .Wait() finding, both Parallel.ForEach cancellation findings, and the catalog enum-cast issue in a single coherent commit.

Status of All Prior Findings

Review	#	Finding	Status
1st	H1	`EnableLongFinality=false` did not disable persisted code	✅ Fixed — `NullSnapshotCatalog` / `NullPersistedSnapshotLoader` / `NullPersistedSnapshotCompactor` wired in `FlatWorldStateModule`; `DetermineSnapshotAction` also gates Phase 2 on `_enableLongFinality`.
1st	H2	`.Wait()` on async channel write in `ProcessCompactBatch`	✅ Fixed — `await _boundaryCompactJobs.Writer.WriteAsync(boundary, cancellationToken)` at line 173.
1st	H3	No runtime guard in `ReadBlobArenaRlp`	✅ Fixed (prior re-review).
1st	M1	No-op LINQ `.Select()` in persistence hot path	✅ Fixed — plain `foreach` over `_trieNodesSortBuffer`.
1st	M2	`Parallel.ForEach` ignores `CancellationToken` in `PersistenceManager`	✅ Fixed — `new ParallelOptions { CancellationToken = _cts.Token }` at line 258.
1st	M3	`NodeRef.RlpDataOffset` signed overflow guard	✅ Fixed (prior re-review).
1st	M4	`NativeMemoryList<byte>` for scratch buffer	✅ Fixed — `stackalloc` (prior re-review).
2nd	H	`PersistedSnapshotCompactor.Enqueue` `.Wait()`	✅ Fixed — method is now `EnqueueAsync(ValueTask)` with proper `await` and caller-supplied `CancellationToken`.
2nd	M	`SnapshotCatalog.ReadEntry` unchecked enum cast	✅ Fixed — `SnapshotCatalog.cs:135-138` now throws `InvalidOperationException` with "wipe and resync" message on non-persisted tier byte.
2nd	M	`Parallel.ForEach` without cancellation in `PersistedSnapshotCompactor`	✅ Fixed — `new ParallelOptions { CancellationToken = cancellationToken }` at line 161.
2nd	M	`Parallel.ForEach` without cancellation in `PersistenceManager:238`	✅ Fixed (same commit as M2 above).
1st	L2	Commented-out warm-up block in `TryGetAddressBound`	✅ Fixed — properly implemented and documented.
1st	L3	`TryGetSnapshotFrom` O(n) iteration	✅ No longer applicable — method removed.
1st	L4	`BloomFilter.AlwaysTrue` direct `_leases` field write	✅ Replaced by a clean `new(capacity:1, bitsPerKey:1.0)` + `Fill(0xFF)` that doesn't bypass the API.

New Observations (No New Blockers)

The bb3bf8d5 commit (force-persist head-reachable fork) correctly replaces GetFirstSnapshotAtBlockNumber with a BFS-backed GetHeadAncestorAtBlockNumber, properly guarded by SetLastCommittedStateId / GetLastCommittedStateId (both under a Lock.Scope, no race). No issues found there.

The two remaining .Wait() calls in FlatDbManager (RunCompactJobSync().Wait() and _compactorJobs.Writer.WriteAsync().AsTask().Wait()) are both in AddSnapshot — a synchronous method called from the dedicated block-processing thread, not the thread pool. The AsTask().Wait() is an intentional backpressure mechanism with a timeout loop and a warn log, which is documented in the comment alongside it. Neither is a problem.

Verdict

Mergeable. All Critical/High/Medium findings from both prior reviews are addressed. The Null-object wiring, cooperative cancellation, and catalog corruption guard are all clean.

github-actions · 2026-06-17T09:03:00Z

EXPB Benchmark Comparison

Run: View workflow run

superblocks

Scenario: nethermind-flat-superblocks-flat-long-finality-delay0s

Client Processing (SSE)

Metric	PR	Master (cached)	Delta
AVG (ms)	1175.24	1171.09	+0.35%
MEDIAN (ms)	1056.0	1045.7	+0.98%
P90 (ms)	1727.1	1670.0	+3.42%
P95 (ms)	2053.2	1973.7	+4.03%
P99 (ms)	4426.4	4667.3	-5.16%
MIN (ms)	755.3	734.0	+2.90%
MAX (ms)	4426.4	4667.3	-5.16%

K6 TTFB

Metric	PR	Master (cached)	Delta
AVG (ms)	1483.83	1386.10	+7.05%
MEDIAN (ms)	1241.67	1241.88	-0.02%
P90 (ms)	1913.58	1700.56	+12.53%
P95 (ms)	2342.88	2125.31	+10.24%
P99 (ms)	5197.89	4789.64	+8.52%
MIN (ms)	875.56	848.04	+3.25%
MAX (ms)	9375.28	5406.95	+73.39%

realblocks

Scenario: nethermind-flat-realblocks-flat-long-finality-delay0s

Client Processing (SSE)

Metric	PR	Master (cached)	Delta
AVG (ms)	28.24	26.52	+6.49%
MEDIAN (ms)	24.3	23.2	+4.74%
P90 (ms)	46.3	43.6	+6.19%
P95 (ms)	53.9	50.3	+7.16%
P99 (ms)	106.8	101.2	+5.53%
MIN (ms)	0.4	0.4	+0.00%
MAX (ms)	201.2	224.9	-10.54%

K6 TTFB

Metric	PR	Master (cached)	Delta
AVG (ms)	32.39	30.36	+6.69%
MEDIAN (ms)	28.60	27.23	+5.03%
P90 (ms)	50.65	47.67	+6.25%
P95 (ms)	62.16	56.35	+10.31%
P99 (ms)	110.58	105.69	+4.63%
MIN (ms)	1.01	1.16	-12.93%
MAX (ms)	205.26	229.00	-10.37%

kamilchodola · 2026-06-17T09:14:10Z

@asdacap need to confirm above numbers but seems like this comes with a big performance hit?

kamilchodola · 2026-06-17T09:14:42Z

@claude any idea why expb shows such a per degradation? can it be expb changes or this PR have some inefficiencies?

claude · 2026-06-17T09:15:06Z

Claude finished @kamilchodola's task in 8m 34s —— View job

Analyzing EXPB performance regression

Todo List:

Load tools
Examine hot-path changes in FlatDbManager.cs, PersistenceManager.cs, SnapshotRepository.cs
Check what overhead exists when EnableLongFinality=true but persisted tier is idle
Examine ReadOnlySnapshotBundle, PersistedSnapshotStack, PersistedSnapshotBucket
Check benchmark scenario configuration (delay0s, what it compares)
Analyze and post findings

Performance Regression Analysis

Is this EXPB tooling or PR overhead?

PR overhead. The scenario flat-long-finality-delay0s ran on both the PR branch and master. On master, the config key EnableLongFinality doesn't exist, so Nethermind runs the standard flat-DB path. On this branch, EnableLongFinality=true is active — the persisted-snapshot tier is fully initialized (4 PersistedSnapshotBuckets, catalog, arena/blob managers) but no persisted snapshots have been written yet (the benchmark doesn't run enough blocks to fill MaxInMemoryBaseSnapshotCount = LongFinalityReorgDepth / CompactSize ≈ 2812).

The ~1% overhead on superblocks vs ~6% on realblocks is the diagnostic signal. If the overhead were proportional to state-access volume (trie reads, etc.), superblocks — which do far more per block — would show more overhead. Instead, the inverse relationship points to a fixed per-block or per-persist overhead that's amortized away by the ~1000ms superblock processing time but visible against the ~24ms realblock baseline.

Identified overhead sources

1. HasPersistedForkAt always runs on every persist (most likely main cause)

In SnapshotRepository.RemoveSiblingAndDescendents:

// Master:
if (!HasForkAt(canonicalBlock)) return;

// Branch:
if (!HasForkAt(canonicalBlock) && !HasPersistedForkAt(canonicalStateId)) return;

In normal operation (no reorgs), HasForkAt always returns false, which makes HasPersistedForkAt always execute. That method:

Allocates a HashSet<StateId> (heap allocation)
Calls CollectRange on all 4 persisted buckets, each acquiring a Lock and doing a SortedSet.GetViewBetween
Allocates an ArrayPoolList

This runs every ~CompactSize blocks (every persistence event, triggered by Phase 1). Similarly, GetLastSnapshotId (also called inside RemoveSiblingAndDescendents) now acquires 4 extra Locks (one per PersistedSnapshotBucket.Max) even when all buckets are empty.

Quick fix: short-circuit in HasPersistedForkAt when no persisted snapshots exist:

private bool HasPersistedForkAt(in StateId canonicalStateId)
{
    if (PersistedSnapshotCount == 0) return false; // nothing to check
    ...
}

2. PersistedSnapshotStack.TryLoadStateRlp / TryLoadStorageRlp compute bloom keys unconditionally

// TryLoadStateRlp — no Count guard:
ulong statePathBloomKey = PersistedSnapshotBloomBuilder.StatePathKey(in path);  // always runs
for (int i = _snapshots.Count - 1; i >= 0; i--) { ... }  // never executes when empty

// TryLoadStorageRlp — no Count guard:
ValueHash256 addressHash = address.ValueHash256;
ulong storageBloomKey = PersistedSnapshotBloomBuilder.StorageNodeKey(in addressHash, in path);  // always runs

Compare with TryGetAccount / TryGetSlot which correctly guard with if (_snapshots.Count > 0). StatePathKey does a stackalloc byte[33] + path encoding + 4 XOR ops on every trie-node RLP load from the flat DB. Unlike TryGetAccount, this runs even when Count == 0.

Quick fix: add an early return at the top of both methods:

if (_snapshots.Count == 0) { rlp = null; return false; }

3. RemoveAndReleaseInMemoryKnownState (base tier) reads SortedSet.Max on every removal

// Branch (every base-snapshot removal):
StateId? newMax;
using (_sortedSnapshotStateIds.EnterWriteLock(out SortedSet<StateId> sortedSnapshots))
{
    sortedSnapshots.Remove(stateId);
    newMax = sortedSnapshots.Count == 0 ? null : sortedSnapshots.Max;  // O(log n) — always
}
using (_lastRegisteredLock.EnterScope())
    if (_lastRegisteredState == stateId) _lastRegisteredState = newMax;

Master's RemoveAndReleaseKnownState just called sortedSnapshots.Remove(stateId) with no .Max query. Each persistence event removes CompactSize snapshots, so this runs 32 × O(log 256) ≈ 200 extra tree ops per persist cycle, plus 32 extra _lastRegisteredLock acquisitions.

4. WalkAndAssemble BFS tries 6 tier-edges per node instead of 2

The new shared BFS tries all 6 tiers (PersistedLargeCompacted, PersistedCompactSized, InMemoryCompacted, InMemoryBase, PersistedSmallCompacted, PersistedBase) per node, each doing a ConcurrentDictionary.TryGetValue. When the persisted tier is empty, 4 of those 6 calls return immediately, but at 1800 bundle builds/sec the extra calls add up. Impact is bounded by the bundle cache (cache hits skip the BFS entirely), but after ClearReadOnlyBundleCache calls the BFS runs more.

Is the overhead acceptable?

If the EXPB scenario is intentionally testing EnableLongFinality=true (i.e., the overhead is the cost of running the feature even before any persisted snapshots are written), then the ~6% realblock overhead from fixed bookkeeping work on empty data structures is likely reducible to near-zero with the guards above. Items 1 and 2 are zero-work paths that can be made truly free with a single branch.

The author's own benchmark showed no overhead "at 300 blocks" vs the feature disabled, but that compared PR-enabled vs PR-disabled. The EXPB comparison is PR-enabled vs master's simpler code — which is why the overhead appears here but wasn't visible in the in-PR benchmark.

Recommended before merge: apply the two guard fixes (items 1 and 2) and re-run the EXPB benchmark to confirm the realblock regression closes. Items 3 and 4 are lower priority but addressable.

…mpty ReadOnlySnapshotBundle caches the persisted-snapshot count once (immutable for the bundle's life) and gates every persisted-tier probe — GetAccount, GetSlot, DetermineSelfDestructSnapshotIdx, TryLoadStateRlp, TryLoadStorageRlp — on it being > 0. When no persisted snapshots exist (long finality disabled, or none persisted yet) the read path skips the persisted query entirely instead of calling into an empty PersistedSnapshotStack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Split the force-persist backstop depth into two configs: MaxReorgDepth (restored to the non-long-finality default of 256) and a new LongFinalityMaxReorgDepth (default 90000). PersistenceManager selects the backstop at construction — LongFinalityMaxReorgDepth when EnableLongFinality is on (the persisted tier serves deep reorgs), otherwise MaxReorgDepth — so a non-long-finality node force-persists much sooner instead of holding up to 90000 in-memory snapshots when finality stalls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Default the hierarchical persisted-layer compaction ceiling from 8192 to 1048576 (1Mi) blocks; keep the ConfigItem DefaultValue in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Log a Warn when the reorg-depth backstop (LongFinalityMaxReorgDepth / MaxReorgDepth) is exceeded and a persist is actually forced — restoring visibility lost when DetermineSnapshotToPersist was replaced by DetermineSnapshotAction. The warning is gated on the persist actually happening, so it does not fire when the backstop seed finds no candidate and the call falls through to a Phase 2 persisted-snapshot conversion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fallback Remove the flat/long-finality-only LastRegisteredState mechanism (field, lock, property, AddStateId setter, tip-removal reset, interface member, and its test) and align the DetermineSnapshotAction backstop fallback with master: GetLastCommittedStateId() ?? GetLastSnapshotId() ?? latestSnapshot. The committed-head primary handles the common case, so only the rarely-hit fallback changes — from call-order (LastRegisteredState) to the longest-chain tip (GetLastSnapshotId). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Compute the backstop depth as Max(configured backstop, MinReorgDepth + CompactSize) so MinReorgDepth can be configured at or above MaxReorgDepth/LongFinalityMaxReorgDepth without the finalized-persistence trigger and the backstop colliding at the same depth. The CompactSize margin gives the finalized trigger room to persist before the backstop force-persists. Normal configs (MinReorgDepth far below the backstop) are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot added database dependencies Pull requests that update a dependency file docs new feature performance is good refactoring state+storage sync trie labels May 18, 2026

claude Bot reviewed May 19, 2026

View reviewed changes

asdacap commented May 19, 2026

View reviewed changes

github-actions Bot removed the sync label May 27, 2026

asdacap and others added 3 commits May 27, 2026 15:48

Merge branch 'reduce-btree-code-size' into flat/long-finality

8696843

claude Bot reviewed Jun 17, 2026

View reviewed changes

asdacap and others added 4 commits June 17, 2026 13:33

asdacap marked this pull request as ready for review June 17, 2026 08:39

asdacap requested review from LukaszRozmej, MarekM25, benaadams, damian-orzechowski, flcl42 and rubo as code owners June 17, 2026 08:39

kamilchodola added performance is good and removed performance is good labels Jun 17, 2026

asdacap and others added 6 commits June 17, 2026 17:42

config(flat): raise PersistedSnapshotMaxCompactSize to 1Mi blocks

a55e0c5

Default the hierarchical persisted-layer compaction ceiling from 8192 to 1048576 (1Mi) blocks; keep the ConfigItem DefaultValue in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

	_boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait();
	foreach (StateId boundary in boundaries)
	await _boundaryCompactJobs.Writer.WriteAsync(boundary, cancellationToken);

-        Rlp.ValueDecoderContext ctx = new(buf[..bytesRead]);
+        int totalLength = ctx.PeekNextRlpLength();
+        if (totalLength > bytesRead)
+        {
+            // Re-read with an exact-size buffer. Should never happen for valid trie nodes,
+            // but guards against unexpected node formats or partial reads at arena boundaries.
+            byte[] full = new byte[totalLength];
+            buf[..bytesRead].CopyTo(full);
+            file.RandomRead(offset + bytesRead, full.AsSpan(bytesRead));
+            return full;
+        }
+        byte[] result = new byte[totalLength];
+        buf[..totalLength].CopyTo(result);
+        return result;

	foreach ((Hash256, TreePath) k in _trieNodesSortBuffer.Select(v => ((Hash256, TreePath))v))
	foreach ((Hash256, TreePath) k in _trieNodesSortBuffer)

		@@ -296,7 +571,7 @@ internal void PersistSnapshot(Snapshot snapshot)

	using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes);
	Span<byte> buf = stackalloc byte[MaxTrieNodeRlpBytes];

-                Parallel.ForEach(allStateIds, state =>
+                Parallel.ForEach(allStateIds,
+                    new ParallelOptions { CancellationToken = _cancelTokenSource.Token },
+                    state =>
+                    {
+                        if (_snapshotRepository.TryLeaseState(state, out Snapshot? snapshot))
+                        {
+                            long sw = Stopwatch.GetTimestamp();
+                            _smallRepo.ConvertSnapshotToPersistedSnapshot(snapshot);
+                            _persistedSnapshotConvertTime.WithLabels("base").Observe(Stopwatch.GetTimestamp() - sw);
+                            snapshot.Dispose();
+                        }
+                    });

-        _compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait();
+    public void Enqueue(ArrayPoolList<StateId> batch)
+    {
+        EnsureStarted();
+        // TryWrite is non-blocking; drop the batch (with a warning) if the compactor is behind
+        // rather than stalling the thread-pool thread that processes blocks.
+        if (!_compactPersistedJobs.Writer.TryWrite(batch))
+        {
+            if (_logger.IsWarn) _logger.Warn($"Persisted compactor job queue full — dropping batch of {batch.Count} states.");
+            batch.Dispose();
+        }
+    }

-        SnapshotTier tier = (SnapshotTier)span[100];
+        SnapshotTier tier = (SnapshotTier)span[100];
+        if (!tier.IsPersisted())
+            throw new InvalidOperationException(
+                $"Persisted snapshot catalog contains non-persisted tier value {(byte)span[100]} at entry. " +
+                "The persisted_snapshot/ directory may be corrupted — wipe and resync.");

Conversation

asdacap commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

High-level design

Mirrors the in-memory Snapshot pipeline

In-memory bloom filters skip cold persisted snapshots

Custom directly-indexable on-disk format (HSST)

Trie RLP split out into separate blob arenas

Two-tier persisted snapshots

HSST — Hierarchical Static Sorted Table

Read-path engineering

Lifecycle / GC

Config surface

Observability

Performance

Compatibility

Scope

Review aids

Types of changes

Testing

Requires testing

If yes, did you write tests?

Notes on testing

Documentation

Requires documentation update

Requires explanation in Release Notes

Remarks

Uh oh!

claude Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(FlatDB): long finality support with persisted snapshots

Overview

Critical / High

1. EnableLongFinality = false does not actually disable persisted-snapshot code [High]

2. .Wait() on async channel write in ProcessCompactBatch [High]

3. No runtime guard in ReadBlobArenaRlp when totalLength > bytesRead [High]

Medium

Low

Positive notes

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

asdacap May 19, 2026

Choose a reason for hiding this comment

Uh oh!

asdacap May 19, 2026

Choose a reason for hiding this comment

Uh oh!

asdacap May 19, 2026

Choose a reason for hiding this comment

asdacap commented May 18, 2026 •

edited

Loading

Mirrors the in-memory `Snapshot` pipeline

claude Bot commented May 18, 2026 •

edited

Loading

1. `EnableLongFinality = false` does not actually disable persisted-snapshot code [High]

2. `.Wait()` on async channel write in `ProcessCompactBatch` [High]

3. No runtime guard in `ReadBlobArenaRlp` when `totalLength > bytesRead` [High]

claude Bot commented Jun 17, 2026 •

edited

Loading

High — `PersistedSnapshotCompactor.Enqueue` line 65 → Fix this

Medium — `SnapshotCatalog.ReadEntry` line 134

Medium — `Parallel.ForEach` in `PersistedSnapshotCompactor.cs:145`

claude Bot commented Jun 17, 2026 •

edited

Loading

claude Bot commented Jun 17, 2026 •

edited

Loading