Skip to content

feat(FlatDB): long finality support with persisted snapshots#11663

Open
asdacap wants to merge 758 commits into
masterfrom
flat/long-finality
Open

feat(FlatDB): long finality support with persisted snapshots#11663
asdacap wants to merge 758 commits into
masterfrom
flat/long-finality

Conversation

@asdacap

@asdacap asdacap commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a long-finality storage tier to Nethermind.State.Flat so the in-memory
snapshot ring (today bounded by MaxInMemoryReorgDepth, default 256) can be
extended to ~90,000 blocks (configurable via LongFinalityReorgDepth) without
the per-snapshot RAM cost growing linearly. Snapshots that fall off the
in-memory tail are converted into immutable, mmap-backed persisted
snapshots
stored as HSST blobs in arena files, then logarithmically merged
("Linked" compaction) so the snapshot count stays bounded.

The new tier is opt-in (FlatDb.EnableLongFinality = false by default) and
sits underneath the existing in-memory snapshot ring; the persistence
pipeline, RPC/sync surface and consensus paths are unchanged.

Motivation

  • Today's flat DB caps reorg depth at MaxInMemoryReorgDepth (~256). Every
    in-memory snapshot keeps its diff in RAM, so raising that ceiling is
    proportionally expensive in heap.
  • Use cases that need to serve historical-tip queries or replay across a
    large reorg window (long-finality CL designs, archive-lite serving)
    need a way to keep many more snapshots cheap to retain and cheap to
    read.
  • The persisted tier is an mmap-backed columnar format: paged access on
    the read path means cold snapshots cost only file-descriptor + index
    metadata until queried, and a PageResidencyTracker per arena drives
    madvise(DONTNEED) so the resident set is bounded by a configured
    page-cache budget rather than by total snapshot bytes on disk.

High-level design

Mirrors the in-memory Snapshot pipeline

A PersistedSnapshot is the on-disk twin of an in-memory Snapshot:
same (From, To) semantics, same StateId-keyed lookups, same
hierarchical compaction policy. The in-memory compactor in
SnapshotCompactor keys merge decisions off the lowest-set-bit of the
block number (blockNumber & -blockNumber), giving 1×, 2×, 4×, …,
CompactSize× layers; the persisted-tier compactor uses the same
lowest-set-bit rule (PersistenceManager.cs:118-128) so the
persisted-snapshot stack is just a continuation of the in-memory
snapshot stack across the tier boundary. As a result the number of
live persisted-snapshot layers is logarithmic in the configured
reorg depth
rather than linear, with the same merge bookkeeping the
in-memory pipeline already implements. Code paths for reading
(GetFinalizedSnapshotAtBlockNumber falls through from in-memory
Snapshot to PersistedSnapshot at the same block/root if the
in-memory window has rotated past it), eviction, and compaction all
follow this mirror.

In-memory bloom filters skip cold persisted snapshots

Each persisted snapshot ships with a tiny in-RAM bloom filter
(PersistedSnapshotBloom, PersistedSnapshotBloomBitsPerKey
default 14.0 bpk). A single filter per snapshot covers
address/slot/self-destruct keys plus state-trie and storage-trie node
paths. Lookups iterate the snapshot stack from newest to oldest;
snapshots whose bloom says "definitely not here" are skipped without
touching the mmap'd HSST at all. This keeps the cost of walking a
deep snapshot stack proportional to the number of snapshots that
actually hold the key, not the total stack depth — important because
the long-finality stack is, by design, much deeper than today's
in-memory ring.

Custom directly-indexable on-disk format (HSST)

Persisted snapshots are not RocksDB-shaped. They use a purpose-built
columnar format (HSST)
designed so that a single HSST blob can be point-queried by key
without loading the whole blob
: readers walk an embedded B-tree
index from the tail and pull only the pages they need. The format
also supports nesting (a value can itself be an HSST), which is what
makes the per-address / per-addressHash inner tables possible inside
a single outer snapshot file. Multiple specialised index variants
(B-tree, packed-array, dense-byte-index, two-byte-slot-value, …) are
picked per column by the writer to minimise blob size and seek cost;
the index-type byte at the blob tail selects the variant. Combined
with mmap + PageResidencyTracker, a cold snapshot costs only
file-descriptor + a few trailer pages of memory until something
actually queries it.

Trie RLP split out into separate blob arenas

The single biggest size win comes from not storing trie-node RLP
bytes inside the persisted-snapshot HSST. Each persisted snapshot's
HSST only stores 6-byte NodeRefs (BlobArenaId: u16,
RlpDataOffset: i32) where the actual RLP would have lived; the RLP
bytes themselves live in append-only blob arena files addressed by
those refs. In practice this shrinks the per-snapshot HSST by roughly
20× vs. storing RLP inline, because the trie-node payload
dominates the snapshot otherwise.

This split has knock-on benefits:

  • The HSST is small enough that compaction can rewrite it cheaply
    without ever rewriting the RLP. Compacted snapshots' NodeRefs
    still point at the original blob arena files.
  • Blob arena files are refcounted per file: every HSST that
    references them holds a lease, and a file is only deleted when no
    live snapshot still references its BlobArenaId. The
    NodeRef-based deduplication that comes for free lets long-lived
    trie nodes survive arbitrarily many snapshot compactions without
    being copied.
  • 16-bit BlobArenaId × 2 GiB-per-file (32-bit offset) caps each tier
    at ~128 TiB of trie RLP.

Two-tier persisted snapshots

  • Small tier — short-range snapshots whose (To - From) < CompactSize.
    This is the tier whose HSSTs are the small, mmap-backed, seekable
    index files driven by PageResidencyTracker. Page-cache budget:
    PersistedSnapshotSmallArenaPageCacheBytes (default 1 GiB resident).
  • Large tier — snapshots merged from ≥ CompactSize source
    snapshots. Same HSST shape, larger files; page-cache budget:
    PersistedSnapshotLargeArenaPageCacheBytes (default 8 GiB resident).
  • Hierarchical compactor merges adjacent snapshots into larger ones
    with (To - From) doubling at each level, bounded by
    PersistedSnapshotMaxCompactSize (default 1024) and capped at
    PersistedSnapshotMaxCompactedSourceBytes source bytes (default 2
    GiB) to keep merged HSSTs below int.MaxValue and the arena
    ceiling. Compaction only rewrites HSSTs; blob arenas are untouched
    and their RLP is inherited through the NodeRef indirection.
  • HSST arena files (ArenaFileSizeBytes, default 1 GiB) are
    append-only; reservations are tagged for per-tag count/byte metrics.

HSST — Hierarchical Static Sorted Table

A new self-describing immutable on-disk format (Nethermind.State.Flat/Hsst/).
Top-level layout and rationale are documented in
Hsst/FORMAT.md;
the columnar persisted-snapshot vocabulary lives in
PersistedSnapshotTags.cs.

Index variants implemented (the trailing index-type byte selects):

  • 0x01 BTree — variable-length values, value-first entries, tail B-tree
    index. Optional 4 KiB leading padding lets the merger 4 KiB-align hot
    blobs without rebuilding the entry.
  • 0x02 PackedArray — fixed key/value array with a recursive summary
    index and an optional hash table; SIMD floor-scan over summary levels.
  • 0x04 DenseByteIndex — single-byte-keyed map indexed directly by tag.
    Used for per-address sub-tag walks (one pinned trailer read resolves
    any sub-tag).
  • 0x05/0x06 TwoByteSlotValue / TwoByteSlotValueLarge — fixed
    2-byte key map for slot-suffix layer (u16 / u24 offsets).
  • 0x07 BTreeKeyFirst — key-first entry layout for outer columns
    whose values are large nested HSSTs (parallel to the inner HSST's
    keys-first index).

Columnar layout (outer HSST → per-key inner HSSTs):

  • 0x00 Metadata
  • 0x01 Address (raw 20 bytes) → {SlotSubTag, AccountSubTag, SelfDestructSubTag}
  • 0x02 AddressHash → {StorageTopSubTag, StorageCompactSubTag, StorageFallbackSubTag}
  • 0x03 StateNode (compact path, 6-15 bytes)
  • 0x05 StateTopNode (path 0-5)
  • 0x06 StateNodeFallback (path 16+)

Read-path engineering

  • WholeReadSession over mmap'd HSST arenas, routed through a
    tracker-aware ArenaByteReader so every read participates in
    PageResidencyTracker clock-eviction accounting.
  • Shared bloom-filter manager across tiers (filters described in the
    bloom subsection above); bloom memory exposed via gauge metrics.
  • Direct-mapped per-snapshot address-bound cache (Vector512<long>
    inline set) cuts repeated outer-HSST descents on hot addresses.

Lifecycle / GC

  • On Snapshot → PersistedSnapshot conversion, trie nodes are marked
    persisted and pruned from the in-memory trie.
  • PageResidencyTracker per arena drives madvise(DONTNEED) and
    (opt-in via PersistedSnapshotFadviseOnPageEviction)
    posix_fadvise(POSIX_FADV_DONTNEED). Memory is reported to the
    GC and exposed as per-tier gauge metrics.
  • LongFinalityReorgDepth is the hard tail — once exceeded, the
    oldest HSST snapshot is force-persisted into the RocksDB tier.

Config surface

All new keys on IFlatDbConfig (default-off feature, conservative defaults):

Key Default Purpose
EnableLongFinality false Feature gate.
LongFinalityReorgDepth 90000 Total in-memory + persisted snapshot count.
PersistedSnapshotPath "snapshots" Arena directory under data dir.
ArenaFileSizeBytes 1 GiB Arena file ceiling.
PersistedSnapshotSmallArenaPageCacheBytes 1 GiB Resident-page budget, small tier.
PersistedSnapshotLargeArenaPageCacheBytes 8 GiB Resident-page budget, large tier.
PersistedSnapshotFadviseOnPageEviction false Benchmark-only: also fadvise on tracker eviction.
PersistedSnapshotMaxCompactSize 1024 Hierarchical-compactor ceiling ((To-From) doubling stops here).
PersistedSnapshotMaxCompactedSourceBytes 2 GiB Per-merge source-byte cap (keeps merged HSST below int.MaxValue).
PersistedSnapshotBloomBitsPerKey 14.0 Per-snapshot bloom precision (0 = disabled).
ValidatePersistedSnapshot false Debug-only: validate persisted vs in-memory after conversion.

Observability

  • Per-tier gauges: arena-file count, mapped bytes, page-tracker memory,
    bloom memory, per-tag reservation count/bytes.
  • Per-tier PageResidencyTracker eviction counters.

Performance

image
  • Graph is default, min reorg 300, 4k, 16k, 90k. 16k and 90k have longer run so that the reorg depth is reached.
  • Memory systemd limited to 48GB.
  • Note, for 16K around 4.5GB is neded for the snapshots and 25GB for the 90K. This means they are largely in RAM except for the trie rlps.

Notes:

  • The 300-block point is intentionally inside today's in-memory
    reorg window, so the persisted tier is effectively idle — confirms
    the feature flag adds no measurable overhead when not exercised.
  • The 4k / 16k plateau (~20% on both metrics) reflects the steady
    cost of walking the deeper snapshot stack for reads plus the
    background work of bloom construction, HSST writes, and
    hierarchical compaction.
  • The 90k step drops blocks/sec more than MGas/sec, suggesting the
    fixed per-block overhead (lookups, bloom checks, HSST descent) is
    the dominant cost at very long stacks rather than per-gas state
    access cost.
  • The practical ceiling on LongFinalityReorgDepth is more likely
    to be the blob-arena address space than CPU: per tier the
    NodeRef encoding caps total trie-RLP storage at
    ushort.MaxValue files × 2 GiB ≈ 128 TiB, and the small-tier
    blob arena fills proportionally to the snapshot count rather than
    to compaction.

Compatibility

  • Feature is disabled by default; with EnableLongFinality=false
    none of the persisted-snapshot code runs. RocksDB column-family
    layout is preserved (two retained empty column ordinals documented
    in FlatDbColumns.cs).
  • On-disk format version (PersistedSnapshotTags.MetadataFormatVersion)
    is 0x02. The SnapshotCatalog v2→v3 schema bump trips a
    "wipe and resync" error if pointed at an older directory.
  • No protocol/JSON-RPC surface changes. No consensus-path changes.

Scope

  • 423 commits, 166 files (+29,595 / -341).
  • Touched projects: Nethermind.State.Flat (104), Nethermind.State.Flat.Test (42),
    Nethermind.Core (4 — NativeMemoryList, Leb128), Nethermind.Benchmark (4),
    Nethermind.Trie/Db/Init/Runner (small wiring).
  • Commit-prefix breakdown: 163 refactor, 132 perf, 38 feat,
    36 fix, 13 test, 4 docs, 3 revert, 3 config.

Review aids

Suggested reading order if reviewing top-down:

  1. Hsst/FORMAT.md — wire format.
  2. NodeRef.cs +
    Storage/BlobArenaManager.cs
    the trie-RLP indirection that makes the HSSTs small.
  3. PersistedSnapshotTags.cs — columnar vocabulary.
  4. PersistedSnapshot.cs, PersistedSnapshotReader.cs — read path.
  5. PersistedSnapshotBuilder.cs, PersistedSnapshotMerger.cs,
    PersistedSnapshotCompactor.cs — write/compaction path.
  6. Storage/ArenaManager.cs, Storage/PageResidencyTracker.cs,
    Storage/WholeReadSession.cs — HSST arena lifecycle.
  7. PersistenceManager.cs, FlatWorldStateModule.cs — wiring into
    the existing flat persistence pipeline.

Types of changes

  • Bugfix
  • New feature
  • Breaking change
  • Optimization
  • Refactoring
  • Documentation update (HSST format spec, columnar vocabulary docs)
  • Build-related changes
  • Other

Testing

Requires testing

  • Yes

If yes, did you write tests?

  • Yes

Notes on testing

  • 42 new test files in Nethermind.State.Flat.Test cover HSST round-trips
    (all index variants), persisted-snapshot build/read/merge/compact paths,
    page-tracker eviction, bloom-filter correctness, and N-way merge cursor.
  • ValidatePersistedSnapshot=true cross-checks every persisted snapshot
    against the source in-memory snapshot at conversion time; intended for
    diagnostic runs but available for review.
  • Benchmarks: HsstReaderBenchmark, PersistedSnapshotBenchmark,
    PersistedSnapshotCompactBenchmark in Nethermind.Benchmark/State/.
  • Recommended local validation: enable EnableLongFinality=true,
    ValidatePersistedSnapshot=true, sync mainnet from snapshot for at
    least a window > MaxInMemoryReorgDepth and confirm:
    • persisted-snapshot count grows then plateaus once compaction kicks in,
    • PageResidencyTracker resident bytes track the configured budget,
    • no Invalid Block / Exception lines in logs.

Documentation

Requires documentation update

  • Yes
  • No (new format spec ships in-tree as Hsst/FORMAT.md; user-facing
    docs page can come in a follow-up once the feature flips default)

Requires explanation in Release Notes

  • Yes — "FlatDB: opt-in long-finality persisted-snapshot tier
    (FlatDb.EnableLongFinality). See config keys above."

Remarks

  • This PR is intentionally large because the persisted-snapshot tier
    is meaningful only end-to-end (format + write path + read path +
    compactor + lifecycle + observability). Splitting it further would
    introduce dead, unwired code into master between PRs.
  • All commits are on origin (flat/long-finality) and the branch is
    fast-forward-clean against master.

@claude

claude Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Claude finished @asdacap's task in 8m 56s —— View job


PR Review: feat(FlatDB): long finality support with persisted snapshots

Todo List:

  • Read repository rules and guidelines
  • Analyze core HSST format and data structures
  • Review write path (builder, merger, compactor)
  • Review read path and memory management
  • Review persistence manager and wiring
  • Review safety, security, and robustness
  • Review tests coverage
  • Compile findings and post summary

Overview

This is an impressively engineered PR: a custom mmap-backed columnar format (HSST), a logarithmic compaction hierarchy, per-snapshot bloom filters, a clock-based page-residency tracker, and a blob-arena indirection layer to keep HSST sizes small. The HSST format spec (FORMAT.md) is thorough and the unsafe code is carefully justified throughout. Test coverage (42 new files, ValidatePersistedSnapshot cross-check mode) is solid.

The three findings below block merge in my assessment. The rest are improvements that should be made before or shortly after landing.


Critical / High

1. EnableLongFinality = false does not actually disable persisted-snapshot code [High]

EnableLongFinality is declared in IFlatDbConfig and set to false in FlatDbConfig, but it is never read in any logic pathPersistenceManager, FlatDbManager, and FlatWorldStateModule do not reference it.

Consequences:

  • FlatWorldStateModule.PerTierState always constructs real ArenaManager / BlobArenaManager / PersistedSnapshotRepository instances, regardless of the flag. Arena directories are created on disk unconditionally.
  • PersistenceManager.DetermineSnapshotAction can return a snapshotLevelToConvert and trigger ConvertSnapshotToPersistedSnapshot (writing HSST files to disk) whenever the in-memory snapshot window exceeds MaxInMemoryReorgDepth + CompactSize blocks with stalled finality — even with EnableLongFinality = false.

The PR description states: "with EnableLongFinality=false none of the persisted-snapshot code runs" — this is not currently true. Either wire NullPersistedSnapshotRepository when the flag is false, or gate DetermineSnapshotAction's conversion path on the flag, or update the documentation to reflect actual behaviour. See inline comment on IFlatDbConfig.cs:53

2. .Wait() on async channel write in ProcessCompactBatch [High]

_boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait() (PersistenceManager.cs:137) synchronously blocks a thread-pool thread while the bounded channel (capacity 16) is full. This violates the robustness rule against .Wait() inside async contexts and can cause thread-pool starvation. ProcessCompactBatch should be made async Task and the write should be awaited with the existing CancellationToken. See inline comment

3. No runtime guard in ReadBlobArenaRlp when totalLength > bytesRead [High]

PersistedSnapshot.ReadBlobArenaRlp (PersistedSnapshot.cs:524-535) does buf[..totalLength].CopyTo(result) without checking that totalLength <= bytesRead. If the RLP header yields a length larger than the 568-byte read buffer (or if a short read occurs), this either throws ArgumentOutOfRangeException or silently copies uninitialized bytes into the returned node RLP — both are worse outcomes than a clear error. The 568-byte estimate is sound for standard MPT nodes, but should be guarded at runtime. See inline comment


Medium

M1. No-op LINQ .Select() in PersistSnapshot hot pathPersistenceManager.cs:545 and :571 both call .Select(v => ((Hash256, TreePath))v) on a List<(Hash256, TreePath)>, which is an identity cast that allocates an iterator object per persistence call. Replace with a plain foreach. Inline comments at L545 and L571

M2. Parallel.ForEach ignores CancellationTokenPersistenceManager.cs:382. The snapshot-conversion Parallel.ForEach does not pass a CancellationToken, so it cannot be interrupted during shutdown. Pass new ParallelOptions { CancellationToken = _cancelTokenSource.Token }. Inline comment

M3. NodeRef.RlpDataOffset signed int — no overflow guard in writer — The signed 32-bit offset caps blob-arena files at 2 GiB by design, but if ArenaFileSizeBytes is misconfigured above 2 GiB (or a future refactor changes the cap), the offset silently wraps to negative, silently corrupting all subsequent trie-node reads from that file. Add an explicit guard in BlobArenaWriter before the offset is committed. Inline comment

M4. NativeMemoryList<byte> for a 568-byte scratch bufferPersistedSnapshot.ReadBlobArenaRlp allocates an off-heap buffer via NativeMemory.Alloc + NativeMemory.Free on every trie-node RLP read. A stackalloc byte[MaxTrieNodeRlpBytes] eliminates the two native allocation calls for this fixed-size case. Inline comment


Low

L1. LongFinalityReorgDepth config description missing unit — "Total max reorg depth" should state "in blocks". Inline comment

L2. Commented-out code block in TryGetAddressBound — Lines 284–291 of PersistedSnapshot.cs are a commented-out warm-up block. Either add a // TODO: with a tracking issue or remove. Inline comment

L3. TryGetSnapshotFrom is O(n) over all persisted snapshotsPersistedSnapshotRepository.cs:300-315 iterates both ConcurrentDictionarys to find by From key. Only reached in the exceptional force-persist path (TryGetForcePersistedSnapshot), so the practical impact is bounded, but keying a secondary index by From would be cleaner at this scale.

L4. PersistedSnapshotBloom.CreateAlwaysTrue direct field writesentinel._leases.Value = long.MaxValue / 2 bypasses the normal lease API to make a sentinel immortal. A dedicated constructor parameter or a static factory flag would be safer and clearer.


Positive notes

  • The HSST format is well-designed for point-queries on cold mmapped files — value-first layout, B-tree tail index, per-index-type dispatch, all coherent.
  • The PageResidencyTracker 8-way set-associative clock cache with lock-free hot path and 1-bit spinlock on the miss path is a textbook-quality implementation.
  • The per-snapshot inline address-bound cache (Vector512<long> reinterpreted as 8 long slots) is clever and saves the outer HSST descent for repeated hot-address lookups.
  • The refcount-based ArenaFile / BlobArenaFile lifecycle, PersistOnShutdown flag, and SweepUnreferenced orphan cleanup collectively make the shutdown/restart correctness story solid.
  • 42 test files with round-trip, merger, compactor, and bloom correctness coverage is excellent for a format this new.

Parallel.ForEach(kv.Value, state => _smallCompactor.DoCompactSnapshot(state));

foreach (StateId boundary in boundaries)
_boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: Blocking .Wait() on async channel write — potential thread-pool stall

WriteAsync on a bounded Channel (capacity 16) returns an incomplete ValueTask when the channel is full. .AsTask().Wait() then synchronously blocks the current thread-pool thread — exactly the pattern the robustness rules prohibit (Never call .Wait() inside an async method).

ProcessCompactBatch runs inside RunPersistedCompactor's await foreach, so the current thread is a pool thread. If more than 16 boundary-block states accumulate in one batch, this thread stalls until a boundary compactor drains a slot, preventing other pool tasks from running.

Suggested change
_boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait();
foreach (StateId boundary in boundaries)
await _boundaryCompactJobs.Writer.WriteAsync(boundary, cancellationToken);

ProcessCompactBatch needs to become async Task ProcessCompactBatch(ArrayPoolList<StateId> batch, CancellationToken ct) and the call site in RunPersistedCompactor needs the corresponding await.

using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes);
Span<byte> buf = rented.AsSpan();
int bytesRead = file.RandomRead(offset, buf);
Rlp.ValueDecoderContext ctx = new(buf[..bytesRead]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: No runtime guard on totalLength > MaxTrieNodeRlpBytes

ctx.PeekNextRlpLength() returns the full RLP length decoded from the header. If totalLength > 568 (or totalLength > bytesRead on a short read), then buf[..totalLength] throws ArgumentOutOfRangeException, or copies zero-filled tail bytes into result when totalLength fits in the buffer but the read was short.

The 568-byte cap is a reasonable estimate for Ethereum's worst-case branch node (~532 bytes), but there is no runtime assertion guarding the invariant. A defensive check here costs one compare on the cold path:

Suggested change
Rlp.ValueDecoderContext ctx = new(buf[..bytesRead]);
int totalLength = ctx.PeekNextRlpLength();
if (totalLength > bytesRead)
{
// Re-read with an exact-size buffer. Should never happen for valid trie nodes,
// but guards against unexpected node formats or partial reads at arena boundaries.
byte[] full = new byte[totalLength];
buf[..bytesRead].CopyTo(full);
file.RandomRead(offset + bytesRead, full.AsSpan(bytesRead));
return full;
}
byte[] result = new byte[totalLength];
buf[..totalLength].CopyTo(result);
return result;

bool VerifyWithTrie { get; set; }

[ConfigItem(Description = "Enable long finality support with persisted snapshots", DefaultValue = "false")]
bool EnableLongFinality { get; set; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: EnableLongFinality flag is never read in any logic path

grep -rn "EnableLongFinality" across the Nethermind.State.Flat and Nethermind.Init source trees returns only the interface definition, the FlatDbConfig default, and a test asserting the default. The flag is never referenced in PersistenceManager, FlatDbManager, FlatWorldStateModule, or any compactor/repository.

As a result:

  • Real PersistedSnapshotRepository/ArenaManager instances are always constructed, regardless of the flag (see FlatWorldStateModule.PerTierState factory).
  • PersistenceManager.DetermineSnapshotAction can return snapshotLevelToConvert (triggering ConvertSnapshotToPersistedSnapshot) once the in-memory window exceeds MaxInMemoryReorgDepth + CompactSize when finality is stalled, even with EnableLongFinality = false.

The PR description states "with EnableLongFinality=false none of the persisted-snapshot code runs" — this is not true. Please either:

  1. Add a guard in FlatWorldStateModule that wires NullPersistedSnapshotRepository when EnableLongFinality = false, OR
  2. Add the flag check in PersistenceManager.DetermineSnapshotAction to skip the conversion path, OR
  3. Correct the documentation to reflect the actual behaviour.

@@ -268,7 +542,7 @@ internal void PersistSnapshot(Snapshot snapshot)

long stateNodesSize = 0;
// foreach (var tn in snapshot.TrieNodes)
foreach ((Hash256, TreePath) k in _trieNodesSortBuffer)
foreach ((Hash256, TreePath) k in _trieNodesSortBuffer.Select(v => ((Hash256, TreePath))v))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Pointless LINQ .Select() in the persistence hot path

_trieNodesSortBuffer is List<(Hash256, TreePath)> and the Select just recasts each element to the same type it already is. This allocates an iterator object on every call through the persistence path.

Suggested change
foreach ((Hash256, TreePath) k in _trieNodesSortBuffer.Select(v => ((Hash256, TreePath))v))
foreach ((Hash256, TreePath) k in _trieNodesSortBuffer)

@@ -296,7 +571,7 @@ internal void PersistSnapshot(Snapshot snapshot)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Same no-op LINQ .Select() on the second sort buffer

Same pattern as line 545 — the cast is a no-op and the LINQ iterator is a needless allocation.

Suggested change
foreach ((Hash256, TreePath) k in _trieNodesSortBuffer)

addressBound = new Bound(lebOffset - valueLength, valueLength);
useSpanReader = addressBound.Length <= AddressBoundWarmupBytes;
// if (useSpanReader)
// {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Commented-out code block — dead code left in

Lines 284–291 (the if (useSpanReader) warm-up block) are commented out with a note that REF bits need to be re-armed on cache hits. If this is intentionally deferred, a // TODO: with a tracking issue is clearer. If the behaviour was decided against, the block should be removed to keep the reader from reasoning about it.

private byte[] ReadBlobArenaRlp(ushort blobArenaId, int offset)
{
BlobArenaFile file = _blobManager.GetFile(blobArenaId);
using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: NativeMemoryList<byte> for a fixed 568-byte scratch buffer — prefer stackalloc

NativeMemoryList<byte> calls NativeMemory.Alloc + NativeMemory.Free on every trie-node RLP resolve. For the 568-byte case, a plain stackalloc avoids the two P/Invoke syscalls and keeps the buffer on the stack:

Suggested change
using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes);
Span<byte> buf = stackalloc byte[MaxTrieNodeRlpBytes];

The using scope and its Dispose can be removed. The buffer is stack-allocated and the file.RandomRead call fills it in-place just as before. Since PersistedSnapshot is a class (not a ref struct), stackalloc is legal here.

/// 32-bit caps a single blob arena file at 2 GiB. <see cref="BlobArenaWriter"/>
/// enforces this on append; <see cref="BlobArenaManager.CreateWriter"/> picks
/// a fresh file when the estimate exceeds the current file's headroom.
/// </summary>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Signed int offset silently wraps for files > 2 GiB

RlpDataOffset is int (signed), so the maximum addressable offset is int.MaxValue = 2,147,483,647 bytes ≈ 2 GiB. The comment correctly documents this cap, but BlobArenaWriter should enforce it explicitly before writing. If a blob arena file grows past 2 GiB (e.g., due to a misconfigured ArenaFileSizeBytes), the offset silently wraps to a negative value, producing silent data corruption rather than a clear error.

Consider adding an explicit guard in BlobArenaWriter.Complete / BlobArenaWriter.Write:

if (offset > int.MaxValue)
    throw new InvalidOperationException(
        $"BlobArena offset {offset} exceeds int.MaxValue; per-file cap is 2 GiB.");

}

// Parallel base conversion across the whole batch
Parallel.ForEach(allStateIds, state =>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Parallel.ForEach has no CancellationToken — does not honour shutdown

The Parallel.ForEach call uses the default ParallelOptions, so there is no way to cancel it when _cancelTokenSource fires on node shutdown. If the loop is in the middle of a large batch during shutdown, it will run to completion before DisposeAsync can progress.

Suggested change
Parallel.ForEach(allStateIds, state =>
Parallel.ForEach(allStateIds,
new ParallelOptions { CancellationToken = _cancelTokenSource.Token },
state =>
{
if (_snapshotRepository.TryLeaseState(state, out Snapshot? snapshot))
{
long sw = Stopwatch.GetTimestamp();
_smallRepo.ConvertSnapshotToPersistedSnapshot(snapshot);
_persistedSnapshotConvertTime.WithLabels("base").Observe(Stopwatch.GetTimestamp() - sw);
snapshot.Dispose();
}
});

bool EnableLongFinality { get; set; }

[ConfigItem(Description = "Total max reorg depth (in-memory + persisted). When exceeded, force-persist oldest HSST snapshot to RocksDB.", DefaultValue = "90000")]
int LongFinalityReorgDepth { get; set; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Config description missing units

The description "Total max reorg depth (in-memory + persisted)" doesn't state the unit. Per the coding style rules, config items must document their units. Should be "Total max reorg depth in blocks (in-memory + persisted)."

Assembly[] releaseAssemblies = additionalJobAssemblies
.Union(simpleJobAssemblies)
.Append(typeof(KeccakBenchmark).Assembly)
// Precompile benchmark assembly disabled: PrecompileBenchmarkBase.Inputs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert

/// N inputs — exercising the hot paths the optimisation targets.
/// </summary>
[MemoryDiagnoser]
public class PersistedSnapshotCompactBenchmark : IDisposable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

<ItemGroup>
<!-- Pre-existing breakage from c8a5fbcff3 (long-finality refactor); excluded so the
project still builds for runnable benchmarks. -->
<Compile Remove="State/PersistedSnapshotBenchmark.cs" />

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove clause, reset to master

/// empty tree even if the blocknumber is not -1. This does not work with flat. We will ignore it for now.
/// </summary>
/// <param name="flatDbManager"></param>
private class FlatDbManagerTestCompat(IFlatDbManager flatDbManager) : IFlatDbManager

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not needed anymore.

"resolved": "1.8.5",
"contentHash": "EaCgmntbH1sOzemRTqyXSqYjB6pLH7VCYHhhDYZ59guHSD5qPwhIYa7kfy0QUlmTRt9IXhaXdFhNuBUArp70Ng=="
},
"prometheus-net": {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset whole file to master.

asdacap added a commit that referenced this pull request May 19, 2026
Bundles the PR-review fixes for the persisted-snapshot tier:

- Gate persisted-snapshot conversion paths on EnableLongFinality. When the
  flag is false, DetermineSnapshotAction skips both long-finality branches
  and falls through to the existing finalized-snapshot-to-RocksDB flow,
  matching pre-long-finality behaviour. FlatWorldStateModule now wires
  NullPersistedSnapshotRepository/Compactor when the flag is false so no
  arena/blob directories are created on startup.
- async ProcessCompactBatch: replace .AsTask().Wait() on a bounded channel
  write with an awaited WriteAsync that honours _cancelTokenSource.Token.
- ReadBlobArenaRlp: stackalloc the 568-byte scratch (was NativeMemoryList)
  and throw InvalidDataException if the RLP header declares more bytes
  than the pread returned, instead of slicing past the read window.
- Parallel.ForEach in AddToPersistence now takes ParallelOptions with the
  shutdown CancellationToken.
- Drop two identity-cast LINQ Select calls on _trieNodesSortBuffer in
  PersistSnapshot (per-snapshot iterator allocation).
- Add unit ("in blocks") to LongFinalityReorgDepth's ConfigItem description.
- Delete a dead commented-out warm-up branch in TryGetAddressBound.
- PersistedSnapshotBloom.CreateAlwaysTrue routes the sentinel lease count
  through a private constructor overload instead of a post-construction
  field write.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot removed the sync label May 27, 2026
asdacap and others added 3 commits May 27, 2026 15:48
All six 'ArrayPoolList<X> list = new(N, N); X[] arr = list.UnsafeGetInternalArray();'
sites in PersistedSnapshotMerger.cs simplify to:
  ArrayPoolList<X> list = new(N, N);
  Span<X> arr = list.AsSpan();

The downstream consumers were already in span form (cursor ctor takes
Span<TSource>, helpers take Span<TSource>, etc.) and the redundant
.AsSpan(0, N) calls collapse because list.AsSpan() already returns a
span of length Count == N. The innerSources path keeps the [..innerN]
slice since innerN varies per outer iteration.

No need to access the raw rented array — UnsafeGetInternalArray was
just a way to get a Span back, which AsSpan() already provides
without the GuardDispose checks accumulating (we cache the span once
at the top of each method).

Verified: 0/0 warnings/errors prod + test; 876/876 + 7 skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…or, doc trim

- Inline trivial single-call-site shims in HsstBTreeBuilder
  (MinBytesFor, WriteUInt64LE, KeySliceLength, FlushPendingAsEntries)
- Make BTreeNodeWriter.HeaderSize a const
- Consolidate Add / TryAddAligned via a shared private AddImpl
- Switch BTreeNodeReader to a primary constructor; extract
  NodeMetadata to its own file (was nested) and inline DecodeValueSize
  into its sole caller
- Trim 8 restate-only XML docstrings on private helpers; keep the ones
  that document invariants / triggers / spec-derived layout

No behaviour change. Verified: full Nethermind.slnx builds clean;
Nethermind.State.Flat.Test passes (869 / 7 pre-existing skips).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@asdacap

asdacap commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

@claude re-review

@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @asdacap's task in 7m 32s —— View job


Claude finished @asdacap's re-review


PR Re-Review: feat(FlatDB): long finality support with persisted snapshots

Todo List:

  • Read repository rules and guidelines
  • Inspect current branch state vs prior review (what changed)
  • Re-examine prior High findings — check if they were addressed
  • Review new/changed files in detail
  • Post inline comments
  • Compile updated findings and post summary

Status of Prior Findings

All three prior High findings are addressed:

# Finding Status
H1 EnableLongFinality = false did not disable persisted code FixedNullSnapshotCatalog / NullPersistedSnapshotLoader / NullPersistedSnapshotCompactor wired in FlatWorldStateModule; PersistenceManager.DetermineSnapshotAction returns (null,null,null) early when flag is off.
H2 .Wait() on async channel write in ProcessCompactBatch Fixed_boundaryCompactJobs.Writer.WriteAsync now uses await with a CancellationToken (line 157). However, a new .Wait() on _compactPersistedJobs.Writer.WriteAsync has appeared in PersistedSnapshotCompactor.Enqueue — see new High below.
H3 No runtime guard in ReadBlobArenaRlp when totalLength > bytesRead Fixed — throws InvalidDataException; stackalloc replaces NativeMemoryList.

Medium findings:

# Finding Status
M1 No-op LINQ .Select() in hot path Fixed — code refactored
M2 Parallel.ForEach ignores CancellationToken ⚠️ Still present in PersistenceManager.cs:238 — carry-over from prior review
M3 NodeRef.RlpDataOffset overflow guard missing Fixed — guard in BlobArenaWriter.cs:107
M4 NativeMemoryList<byte> for scratch buffer Fixedstackalloc used

New Findings

High — PersistedSnapshotCompactor.Enqueue line 65 → Fix this

_compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait() blocks a thread-pool thread when the bounded channel (capacity 16, line 52) is full. The call chain is:

FlatDbManager.RunPersistenceNotifyWhenSlowTask.RunPersistIfNeededAddToPersistenceConvertCompactedRange / ConvertSingleBasecompactor.Enqueue()

This is the same robustness violation as prior High finding H2, now at a different call site. If the compactor falls behind during a burst (e.g., catch-up sync producing many conversion candidates), the Task.Run thread stalls and thread-pool saturation follows. See inline comment on PersistedSnapshotCompactor.cs:65.

Medium — SnapshotCatalog.ReadEntry line 134

(SnapshotTier)span[100] casts without validating that the byte is a persisted-tier value (2–5). An on-disk value of 0, 1, or ≥6 silently produces wrong bucket assignment (IsPersisted() = false → entry ignored, or MetricTierLabel throws ArgumentOutOfRangeException at the next metric site). The format-version check guards schema changes but not per-entry corruption. One range check surfaces corruption immediately. See inline comment on SnapshotCatalog.cs:134.

Medium — Parallel.ForEach in PersistedSnapshotCompactor.cs:145

New instance of the same carry-over M2 pattern: the compactor's own _cancelTokenSource.Token is not passed to ParallelOptions. A large compaction batch running at shutdown time will block DisposeAsync until it completes. See inline comment on PersistedSnapshotCompactor.cs:145.


Positive Notes on New Work

  • The NullSnapshotCatalog / NullPersistedSnapshotLoader Null-object wiring is clean and correctly placed after the concrete registrations so the last AddSingleton wins.
  • The SnapshotCatalog key layout (48-byte big-endian key for RocksDB lexicographic ordering, 101-byte little-endian value) is consistent and well-documented.
  • The PersistedLargeCompacted tier split and the accompanying bucket in SnapshotRepository are well-structured; the tier ordering invariant (tier >= PersistedBase = persisted) is maintained.
  • The BlobArenaWriter overflow guard (_written + rlp.Length > int.MaxValue → throw) and the ReadBlobArenaRlp guard (totalLength > bytesRead → throw) are both exactly right.
  • The ValidatePersistedSnapshot + Environment.Exit for git-bisect compatibility is a pragmatic debug tool and is intentional per the commit message and code comment.

Verdict: not yet mergeable — one new High finding (Enqueue().Wait()) and two Medium findings (Parallel.ForEach without cancellation ×2, SnapshotCatalog enum cast) remain unaddressed.

public void Enqueue(ArrayPoolList<StateId> batch)
{
EnsureStarted();
_compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: .Wait() on bounded-channel write — same anti-pattern as the prior review's High finding, now in a different location

_compactPersistedJobs is a BoundedChannel of capacity 16 (line 52). When it is full, WriteAsync returns an incomplete ValueTask; .AsTask().Wait() then synchronously blocks the current thread-pool thread until a slot opens.

Enqueue is called from PersistenceManager.ConvertCompactedRange / ConvertSingleBase, which runs inside FlatDbManager.NotifyWhenSlow → Task.Run. If the compactor falls behind and the channel fills (e.g., during a rapid burst of block conversions), this stalls one or more thread-pool threads indefinitely — exactly the thread-pool starvation scenario that the robustness rules prohibit.

The callers already have no CancellationToken to pass, but the real fix is to avoid blocking at all — either make Enqueue fire-and-forget by using TryWrite and logging on drop, or restructure the caller to await WriteAsync by propagating the CancellationToken:

Suggested change
_compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait();
public void Enqueue(ArrayPoolList<StateId> batch)
{
EnsureStarted();
// TryWrite is non-blocking; drop the batch (with a warning) if the compactor is behind
// rather than stalling the thread-pool thread that processes blocks.
if (!_compactPersistedJobs.Writer.TryWrite(batch))
{
if (_logger.IsWarn) _logger.Warn($"Persisted compactor job queue full — dropping batch of {batch.Count} states.");
batch.Dispose();
}
}

Alternatively, make the channel unbounded so the write always succeeds immediately and backpressure is handled by memory pressure instead.

int arenaId = BinaryPrimitives.ReadInt32LittleEndian(span[80..]);
long offset = BinaryPrimitives.ReadInt64LittleEndian(span[84..]);
long size = BinaryPrimitives.ReadInt64LittleEndian(span[92..]);
SnapshotTier tier = (SnapshotTier)span[100];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Unchecked enum cast from on-disk byte — silent misclassification on corruption

(SnapshotTier)span[100] performs a direct cast without checking that the byte falls within the set of persisted-tier values {2, 3, 4, 5}. The catalog comment says it only ever stores persisted tiers, but a single corrupted byte (e.g., value 0 = InMemoryBase or 6 = undefined) silently produces a wrong classification:

  • InMemoryBase/InMemoryCompacted (0 or 1): IsPersisted() returns false, so the entry is added to the wrong bucket and later operations that expect a persisted snapshot get an in-memory tier handle.
  • A value ≥ 6: MetricTierLabel throws ArgumentOutOfRangeException at the next metric callsite, which is confusing to diagnose.

The format-version check (line 71) guards against schema changes but not per-entry field corruption. A small guard costs nothing and surfaces corruption immediately:

Suggested change
SnapshotTier tier = (SnapshotTier)span[100];
SnapshotTier tier = (SnapshotTier)span[100];
if (!tier.IsPersisted())
throw new InvalidOperationException(
$"Persisted snapshot catalog contains non-persisted tier value {(byte)span[100]} at entry. " +
"The persisted_snapshot/ directory may be corrupted — wipe and resync.");


Snapshot? snapshotToPersist;
Parallel.ForEach(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Parallel.ForEach ignores shutdown cancellation (carry-over from prior review M2)

This Parallel.ForEach has no ParallelOptions and therefore ignores _cancelTokenSource (or whichever token guards node shutdown). If this loop is processing a large CompactSize-wide batch during shutdown, DisposeAsync in the enclosing FlatDbManager / compactor will block until every item in allStateIds has been converted.

Suggested change
Parallel.ForEach(
Parallel.ForEach(
allStateIds,
new ParallelOptions { CancellationToken = loader.CancellationToken },
state =>
{

(Assuming IPersistedSnapshotLoader exposes the token, or thread it from PersistenceManager's own CTS.) If the token isn't easily available here, at minimum document that this loop is intentionally uncooperative with cancellation and that shutdown may be delayed by up to one full CompactSize-wide batch.

// Ascending bucket order: each sub-CompactSize layer's inputs (the previous layer's
// outputs) exist before it runs.
foreach (KeyValuePair<int, List<StateId>> kv in buckets)
Parallel.ForEach(kv.Value, state => DoCompactSnapshot(state));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Parallel.ForEach ignores shutdown cancellation

Same pattern as PersistenceManager.cs line 238 (prior review M2): the compactor's own _cancelTokenSource is not passed as ParallelOptions.CancellationToken. If shutdown fires while a bucket's state list is large (e.g., at a high compaction level), the loop runs to completion before DisposeAsync can drain.

Suggested change
Parallel.ForEach(kv.Value, state => DoCompactSnapshot(state));
Parallel.ForEach(kv.Value,
new ParallelOptions { CancellationToken = _cancelTokenSource.Token },
state => DoCompactSnapshot(state));

asdacap and others added 4 commits June 17, 2026 13:33
Address review feedback on the long-finality concurrency paths:

- SnapshotCatalog.ReadEntry now rejects a non-persisted tier byte with the
  file's standard "wipe and resync" error instead of silently misclassifying a
  corrupt entry.
- IPersistedSnapshotCompactor.Enqueue becomes EnqueueAsync: it awaits a free
  bounded-queue slot (backpressure without blocking a thread) and takes the
  producer's cancellation token rather than owning a CancellationTokenSource.
  Its background workers and in-flight Parallel.ForEach observe process-exit
  directly; graceful disposal completes and drains the channels in stage order.
- PersistenceManager's persistence drain is now async (SemaphoreSlim mutex so it
  can await Enqueue under the lock) and its conversion Parallel.ForEach observes
  a process-exit-linked token; FlushToPersistence stays synchronous.
- FlatDbManager.PersistIfNeeded awaits the now-async AddToPersistence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the test-only IArenaManager wrapper with the actual ArenaManager over a
temporary directory, built via a new TestFixtureHelpers.CreateArenaManager factory
that applies the same test config (page tracker disabled, arena size floored to one
OS page). Fixtures own the temp dir and clean it up in TearDown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Halve PersistedSnapshotArenaPageCacheBytes from 8 GiB to 4 GiB (and the matching
ConfigItem DefaultValue), bounding the mmap'd arena page cache and the tracker's
own metadata (~17 MiB -> ~8.5 MiB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve the conflict from #11967 (force-persist the head-reachable fork): our
branch replaced DetermineSnapshotToPersist with the two-phase DetermineSnapshotAction,
so the fix is ported onto our path rather than master's methods:

- Keep master's SnapshotRepository committed-head infra (SetLastCommittedStateId/
  GetLastCommittedStateId) and the FlatDbManager.AddSnapshot caller; drop master's
  now-superseded DetermineSnapshotToPersist/GetHeadAncestorAtBlockNumber/
  GetFinalizedSnapshotAtBlockNumber/TryFindAncestorStateAtBlock.
- DetermineSnapshotAction backstop and FlushToPersistence now seed from the committed
  head (GetLastCommittedStateId), which our ancestral FindSnapshotToPersist walk turns
  into "persist the head-reachable fork" — so a forced persist no longer follows a
  longer non-canonical fork.
- Port master's two DetermineSnapshotToPersist fork tests to DetermineSnapshotAction;
  adapt master's two FlushToPersistence fork tests to our repo-owned snapshot model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@asdacap asdacap marked this pull request as ready for review June 17, 2026 08:39
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @asdacap's task in 5m 6s —— View job


Todo List:

  • Load rules files and gather context
  • Inspect current state of prior High finding: PersistedSnapshotCompactor.Enqueue .Wait()
  • Inspect carry-over Mediums: Parallel.ForEach without CancellationToken (×2)
  • Inspect new Medium: SnapshotCatalog.ReadEntry unchecked enum cast
  • Scan recently changed files for new issues
  • Post findings and inline comments

PR Re-Review (3rd pass): feat(FlatDB): long finality support with persisted snapshots

All findings from the prior two reviews are resolved. The cooperative-shutdown refactor (4cc76142) is particularly thorough — it addresses the High .Wait() finding, both Parallel.ForEach cancellation findings, and the catalog enum-cast issue in a single coherent commit.


Status of All Prior Findings

Review # Finding Status
1st H1 EnableLongFinality=false did not disable persisted code ✅ Fixed — NullSnapshotCatalog / NullPersistedSnapshotLoader / NullPersistedSnapshotCompactor wired in FlatWorldStateModule; DetermineSnapshotAction also gates Phase 2 on _enableLongFinality.
1st H2 .Wait() on async channel write in ProcessCompactBatch ✅ Fixed — await _boundaryCompactJobs.Writer.WriteAsync(boundary, cancellationToken) at line 173.
1st H3 No runtime guard in ReadBlobArenaRlp ✅ Fixed (prior re-review).
1st M1 No-op LINQ .Select() in persistence hot path ✅ Fixed — plain foreach over _trieNodesSortBuffer.
1st M2 Parallel.ForEach ignores CancellationToken in PersistenceManager ✅ Fixed — new ParallelOptions { CancellationToken = _cts.Token } at line 258.
1st M3 NodeRef.RlpDataOffset signed overflow guard ✅ Fixed (prior re-review).
1st M4 NativeMemoryList<byte> for scratch buffer ✅ Fixed — stackalloc (prior re-review).
2nd H PersistedSnapshotCompactor.Enqueue .Wait() ✅ Fixed — method is now EnqueueAsync(ValueTask) with proper await and caller-supplied CancellationToken.
2nd M SnapshotCatalog.ReadEntry unchecked enum cast ✅ Fixed — SnapshotCatalog.cs:135-138 now throws InvalidOperationException with "wipe and resync" message on non-persisted tier byte.
2nd M Parallel.ForEach without cancellation in PersistedSnapshotCompactor ✅ Fixed — new ParallelOptions { CancellationToken = cancellationToken } at line 161.
2nd M Parallel.ForEach without cancellation in PersistenceManager:238 ✅ Fixed (same commit as M2 above).
1st L2 Commented-out warm-up block in TryGetAddressBound ✅ Fixed — properly implemented and documented.
1st L3 TryGetSnapshotFrom O(n) iteration ✅ No longer applicable — method removed.
1st L4 BloomFilter.AlwaysTrue direct _leases field write ✅ Replaced by a clean new(capacity:1, bitsPerKey:1.0) + Fill(0xFF) that doesn't bypass the API.

New Observations (No New Blockers)

The bb3bf8d5 commit (force-persist head-reachable fork) correctly replaces GetFirstSnapshotAtBlockNumber with a BFS-backed GetHeadAncestorAtBlockNumber, properly guarded by SetLastCommittedStateId / GetLastCommittedStateId (both under a Lock.Scope, no race). No issues found there.

The two remaining .Wait() calls in FlatDbManager (RunCompactJobSync().Wait() and _compactorJobs.Writer.WriteAsync().AsTask().Wait()) are both in AddSnapshot — a synchronous method called from the dedicated block-processing thread, not the thread pool. The AsTask().Wait() is an intentional backpressure mechanism with a timeout loop and a warn log, which is documented in the comment alongside it. Neither is a problem.


Verdict

Mergeable. All Critical/High/Medium findings from both prior reviews are addressed. The Null-object wiring, cooperative cancellation, and catalog corruption guard are all clean.

@github-actions

Copy link
Copy Markdown
Contributor

EXPB Benchmark Comparison

Run: View workflow run

superblocks

Scenario: nethermind-flat-superblocks-flat-long-finality-delay0s

Client Processing (SSE)

Metric PR Master (cached) Delta
AVG (ms) 1175.24 1171.09 +0.35%
MEDIAN (ms) 1056.0 1045.7 +0.98%
P90 (ms) 1727.1 1670.0 +3.42%
P95 (ms) 2053.2 1973.7 +4.03%
P99 (ms) 4426.4 4667.3 -5.16%
MIN (ms) 755.3 734.0 +2.90%
MAX (ms) 4426.4 4667.3 -5.16%
K6 TTFB
Metric PR Master (cached) Delta
AVG (ms) 1483.83 1386.10 +7.05%
MEDIAN (ms) 1241.67 1241.88 -0.02%
P90 (ms) 1913.58 1700.56 +12.53%
P95 (ms) 2342.88 2125.31 +10.24%
P99 (ms) 5197.89 4789.64 +8.52%
MIN (ms) 875.56 848.04 +3.25%
MAX (ms) 9375.28 5406.95 +73.39%

realblocks

Scenario: nethermind-flat-realblocks-flat-long-finality-delay0s

Client Processing (SSE)

Metric PR Master (cached) Delta
AVG (ms) 28.24 26.52 +6.49%
MEDIAN (ms) 24.3 23.2 +4.74%
P90 (ms) 46.3 43.6 +6.19%
P95 (ms) 53.9 50.3 +7.16%
P99 (ms) 106.8 101.2 +5.53%
MIN (ms) 0.4 0.4 +0.00%
MAX (ms) 201.2 224.9 -10.54%
K6 TTFB
Metric PR Master (cached) Delta
AVG (ms) 32.39 30.36 +6.69%
MEDIAN (ms) 28.60 27.23 +5.03%
P90 (ms) 50.65 47.67 +6.25%
P95 (ms) 62.16 56.35 +10.31%
P99 (ms) 110.58 105.69 +4.63%
MIN (ms) 1.01 1.16 -12.93%
MAX (ms) 205.26 229.00 -10.37%

@kamilchodola

Copy link
Copy Markdown
Contributor

@asdacap need to confirm above numbers but seems like this comes with a big performance hit?

@kamilchodola

Copy link
Copy Markdown
Contributor

@claude any idea why expb shows such a per degradation? can it be expb changes or this PR have some inefficiencies?

@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @kamilchodola's task in 8m 34s —— View job


Analyzing EXPB performance regression

Todo List:

  • Load tools
  • Examine hot-path changes in FlatDbManager.cs, PersistenceManager.cs, SnapshotRepository.cs
  • Check what overhead exists when EnableLongFinality=true but persisted tier is idle
  • Examine ReadOnlySnapshotBundle, PersistedSnapshotStack, PersistedSnapshotBucket
  • Check benchmark scenario configuration (delay0s, what it compares)
  • Analyze and post findings

Performance Regression Analysis

Is this EXPB tooling or PR overhead?

PR overhead. The scenario flat-long-finality-delay0s ran on both the PR branch and master. On master, the config key EnableLongFinality doesn't exist, so Nethermind runs the standard flat-DB path. On this branch, EnableLongFinality=true is active — the persisted-snapshot tier is fully initialized (4 PersistedSnapshotBuckets, catalog, arena/blob managers) but no persisted snapshots have been written yet (the benchmark doesn't run enough blocks to fill MaxInMemoryBaseSnapshotCount = LongFinalityReorgDepth / CompactSize ≈ 2812).

The ~1% overhead on superblocks vs ~6% on realblocks is the diagnostic signal. If the overhead were proportional to state-access volume (trie reads, etc.), superblocks — which do far more per block — would show more overhead. Instead, the inverse relationship points to a fixed per-block or per-persist overhead that's amortized away by the ~1000ms superblock processing time but visible against the ~24ms realblock baseline.


Identified overhead sources

1. HasPersistedForkAt always runs on every persist (most likely main cause)

In SnapshotRepository.RemoveSiblingAndDescendents:

// Master:
if (!HasForkAt(canonicalBlock)) return;

// Branch:
if (!HasForkAt(canonicalBlock) && !HasPersistedForkAt(canonicalStateId)) return;

In normal operation (no reorgs), HasForkAt always returns false, which makes HasPersistedForkAt always execute. That method:

  • Allocates a HashSet<StateId> (heap allocation)
  • Calls CollectRange on all 4 persisted buckets, each acquiring a Lock and doing a SortedSet.GetViewBetween
  • Allocates an ArrayPoolList

This runs every ~CompactSize blocks (every persistence event, triggered by Phase 1). Similarly, GetLastSnapshotId (also called inside RemoveSiblingAndDescendents) now acquires 4 extra Locks (one per PersistedSnapshotBucket.Max) even when all buckets are empty.

Quick fix: short-circuit in HasPersistedForkAt when no persisted snapshots exist:

private bool HasPersistedForkAt(in StateId canonicalStateId)
{
    if (PersistedSnapshotCount == 0) return false; // nothing to check
    ...
}

2. PersistedSnapshotStack.TryLoadStateRlp / TryLoadStorageRlp compute bloom keys unconditionally

// TryLoadStateRlp — no Count guard:
ulong statePathBloomKey = PersistedSnapshotBloomBuilder.StatePathKey(in path);  // always runs
for (int i = _snapshots.Count - 1; i >= 0; i--) { ... }  // never executes when empty

// TryLoadStorageRlp — no Count guard:
ValueHash256 addressHash = address.ValueHash256;
ulong storageBloomKey = PersistedSnapshotBloomBuilder.StorageNodeKey(in addressHash, in path);  // always runs

Compare with TryGetAccount / TryGetSlot which correctly guard with if (_snapshots.Count > 0). StatePathKey does a stackalloc byte[33] + path encoding + 4 XOR ops on every trie-node RLP load from the flat DB. Unlike TryGetAccount, this runs even when Count == 0.

Quick fix: add an early return at the top of both methods:

if (_snapshots.Count == 0) { rlp = null; return false; }

3. RemoveAndReleaseInMemoryKnownState (base tier) reads SortedSet.Max on every removal

// Branch (every base-snapshot removal):
StateId? newMax;
using (_sortedSnapshotStateIds.EnterWriteLock(out SortedSet<StateId> sortedSnapshots))
{
    sortedSnapshots.Remove(stateId);
    newMax = sortedSnapshots.Count == 0 ? null : sortedSnapshots.Max;  // O(log n) — always
}
using (_lastRegisteredLock.EnterScope())
    if (_lastRegisteredState == stateId) _lastRegisteredState = newMax;

Master's RemoveAndReleaseKnownState just called sortedSnapshots.Remove(stateId) with no .Max query. Each persistence event removes CompactSize snapshots, so this runs 32 × O(log 256) ≈ 200 extra tree ops per persist cycle, plus 32 extra _lastRegisteredLock acquisitions.

4. WalkAndAssemble BFS tries 6 tier-edges per node instead of 2

The new shared BFS tries all 6 tiers (PersistedLargeCompacted, PersistedCompactSized, InMemoryCompacted, InMemoryBase, PersistedSmallCompacted, PersistedBase) per node, each doing a ConcurrentDictionary.TryGetValue. When the persisted tier is empty, 4 of those 6 calls return immediately, but at 1800 bundle builds/sec the extra calls add up. Impact is bounded by the bundle cache (cache hits skip the BFS entirely), but after ClearReadOnlyBundleCache calls the BFS runs more.


Is the overhead acceptable?

If the EXPB scenario is intentionally testing EnableLongFinality=true (i.e., the overhead is the cost of running the feature even before any persisted snapshots are written), then the ~6% realblock overhead from fixed bookkeeping work on empty data structures is likely reducible to near-zero with the guards above. Items 1 and 2 are zero-work paths that can be made truly free with a single branch.

The author's own benchmark showed no overhead "at 300 blocks" vs the feature disabled, but that compared PR-enabled vs PR-disabled. The EXPB comparison is PR-enabled vs master's simpler code — which is why the overhead appears here but wasn't visible in the in-PR benchmark.

Recommended before merge: apply the two guard fixes (items 1 and 2) and re-run the EXPB benchmark to confirm the realblock regression closes. Items 3 and 4 are lower priority but addressable.

asdacap and others added 6 commits June 17, 2026 17:42
…mpty

ReadOnlySnapshotBundle caches the persisted-snapshot count once (immutable for the
bundle's life) and gates every persisted-tier probe — GetAccount, GetSlot,
DetermineSelfDestructSnapshotIdx, TryLoadStateRlp, TryLoadStorageRlp — on it being
> 0. When no persisted snapshots exist (long finality disabled, or none persisted
yet) the read path skips the persisted query entirely instead of calling into an
empty PersistedSnapshotStack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Split the force-persist backstop depth into two configs: MaxReorgDepth (restored to
the non-long-finality default of 256) and a new LongFinalityMaxReorgDepth (default
90000). PersistenceManager selects the backstop at construction —
LongFinalityMaxReorgDepth when EnableLongFinality is on (the persisted tier serves
deep reorgs), otherwise MaxReorgDepth — so a non-long-finality node force-persists
much sooner instead of holding up to 90000 in-memory snapshots when finality stalls.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default the hierarchical persisted-layer compaction ceiling from 8192 to
1048576 (1Mi) blocks; keep the ConfigItem DefaultValue in sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Log a Warn when the reorg-depth backstop (LongFinalityMaxReorgDepth / MaxReorgDepth)
is exceeded and a persist is actually forced — restoring visibility lost when
DetermineSnapshotToPersist was replaced by DetermineSnapshotAction. The warning is
gated on the persist actually happening, so it does not fire when the backstop seed
finds no candidate and the call falls through to a Phase 2 persisted-snapshot
conversion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fallback

Remove the flat/long-finality-only LastRegisteredState mechanism (field, lock,
property, AddStateId setter, tip-removal reset, interface member, and its test) and
align the DetermineSnapshotAction backstop fallback with master:
GetLastCommittedStateId() ?? GetLastSnapshotId() ?? latestSnapshot. The committed-head
primary handles the common case, so only the rarely-hit fallback changes — from
call-order (LastRegisteredState) to the longest-chain tip (GetLastSnapshotId).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compute the backstop depth as Max(configured backstop, MinReorgDepth + CompactSize)
so MinReorgDepth can be configured at or above MaxReorgDepth/LongFinalityMaxReorgDepth
without the finalized-persistence trigger and the backstop colliding at the same depth.
The CompactSize margin gives the finalized trigger room to persist before the backstop
force-persists. Normal configs (MinReorgDepth far below the backstop) are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants