feat(FlatDB): long finality support with persisted snapshots#11663
feat(FlatDB): long finality support with persisted snapshots#11663asdacap wants to merge 758 commits into
Conversation
|
Claude finished @asdacap's task in 8m 56s —— View job PR Review: feat(FlatDB): long finality support with persisted snapshotsTodo List:
OverviewThis is an impressively engineered PR: a custom mmap-backed columnar format (HSST), a logarithmic compaction hierarchy, per-snapshot bloom filters, a clock-based page-residency tracker, and a blob-arena indirection layer to keep HSST sizes small. The HSST format spec ( The three findings below block merge in my assessment. The rest are improvements that should be made before or shortly after landing. Critical / High1.
|
| Parallel.ForEach(kv.Value, state => _smallCompactor.DoCompactSnapshot(state)); | ||
|
|
||
| foreach (StateId boundary in boundaries) | ||
| _boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait(); |
There was a problem hiding this comment.
High: Blocking .Wait() on async channel write — potential thread-pool stall
WriteAsync on a bounded Channel (capacity 16) returns an incomplete ValueTask when the channel is full. .AsTask().Wait() then synchronously blocks the current thread-pool thread — exactly the pattern the robustness rules prohibit (Never call .Wait() inside an async method).
ProcessCompactBatch runs inside RunPersistedCompactor's await foreach, so the current thread is a pool thread. If more than 16 boundary-block states accumulate in one batch, this thread stalls until a boundary compactor drains a slot, preventing other pool tasks from running.
| _boundaryCompactJobs.Writer.WriteAsync(boundary).AsTask().Wait(); | |
| foreach (StateId boundary in boundaries) | |
| await _boundaryCompactJobs.Writer.WriteAsync(boundary, cancellationToken); |
ProcessCompactBatch needs to become async Task ProcessCompactBatch(ArrayPoolList<StateId> batch, CancellationToken ct) and the call site in RunPersistedCompactor needs the corresponding await.
| using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes); | ||
| Span<byte> buf = rented.AsSpan(); | ||
| int bytesRead = file.RandomRead(offset, buf); | ||
| Rlp.ValueDecoderContext ctx = new(buf[..bytesRead]); |
There was a problem hiding this comment.
High: No runtime guard on totalLength > MaxTrieNodeRlpBytes
ctx.PeekNextRlpLength() returns the full RLP length decoded from the header. If totalLength > 568 (or totalLength > bytesRead on a short read), then buf[..totalLength] throws ArgumentOutOfRangeException, or copies zero-filled tail bytes into result when totalLength fits in the buffer but the read was short.
The 568-byte cap is a reasonable estimate for Ethereum's worst-case branch node (~532 bytes), but there is no runtime assertion guarding the invariant. A defensive check here costs one compare on the cold path:
| Rlp.ValueDecoderContext ctx = new(buf[..bytesRead]); | |
| int totalLength = ctx.PeekNextRlpLength(); | |
| if (totalLength > bytesRead) | |
| { | |
| // Re-read with an exact-size buffer. Should never happen for valid trie nodes, | |
| // but guards against unexpected node formats or partial reads at arena boundaries. | |
| byte[] full = new byte[totalLength]; | |
| buf[..bytesRead].CopyTo(full); | |
| file.RandomRead(offset + bytesRead, full.AsSpan(bytesRead)); | |
| return full; | |
| } | |
| byte[] result = new byte[totalLength]; | |
| buf[..totalLength].CopyTo(result); | |
| return result; |
| bool VerifyWithTrie { get; set; } | ||
|
|
||
| [ConfigItem(Description = "Enable long finality support with persisted snapshots", DefaultValue = "false")] | ||
| bool EnableLongFinality { get; set; } |
There was a problem hiding this comment.
High: EnableLongFinality flag is never read in any logic path
grep -rn "EnableLongFinality" across the Nethermind.State.Flat and Nethermind.Init source trees returns only the interface definition, the FlatDbConfig default, and a test asserting the default. The flag is never referenced in PersistenceManager, FlatDbManager, FlatWorldStateModule, or any compactor/repository.
As a result:
- Real
PersistedSnapshotRepository/ArenaManagerinstances are always constructed, regardless of the flag (seeFlatWorldStateModule.PerTierStatefactory). PersistenceManager.DetermineSnapshotActioncan returnsnapshotLevelToConvert(triggeringConvertSnapshotToPersistedSnapshot) once the in-memory window exceedsMaxInMemoryReorgDepth + CompactSizewhen finality is stalled, even withEnableLongFinality = false.
The PR description states "with EnableLongFinality=false none of the persisted-snapshot code runs" — this is not true. Please either:
- Add a guard in
FlatWorldStateModulethat wiresNullPersistedSnapshotRepositorywhenEnableLongFinality = false, OR - Add the flag check in
PersistenceManager.DetermineSnapshotActionto skip the conversion path, OR - Correct the documentation to reflect the actual behaviour.
| @@ -268,7 +542,7 @@ internal void PersistSnapshot(Snapshot snapshot) | |||
|
|
|||
| long stateNodesSize = 0; | |||
| // foreach (var tn in snapshot.TrieNodes) | |||
| foreach ((Hash256, TreePath) k in _trieNodesSortBuffer) | |||
| foreach ((Hash256, TreePath) k in _trieNodesSortBuffer.Select(v => ((Hash256, TreePath))v)) | |||
There was a problem hiding this comment.
Medium: Pointless LINQ .Select() in the persistence hot path
_trieNodesSortBuffer is List<(Hash256, TreePath)> and the Select just recasts each element to the same type it already is. This allocates an iterator object on every call through the persistence path.
| foreach ((Hash256, TreePath) k in _trieNodesSortBuffer.Select(v => ((Hash256, TreePath))v)) | |
| foreach ((Hash256, TreePath) k in _trieNodesSortBuffer) |
| @@ -296,7 +571,7 @@ internal void PersistSnapshot(Snapshot snapshot) | |||
|
|
|||
There was a problem hiding this comment.
Medium: Same no-op LINQ .Select() on the second sort buffer
Same pattern as line 545 — the cast is a no-op and the LINQ iterator is a needless allocation.
| foreach ((Hash256, TreePath) k in _trieNodesSortBuffer) |
| addressBound = new Bound(lebOffset - valueLength, valueLength); | ||
| useSpanReader = addressBound.Length <= AddressBoundWarmupBytes; | ||
| // if (useSpanReader) | ||
| // { |
There was a problem hiding this comment.
Low: Commented-out code block — dead code left in
Lines 284–291 (the if (useSpanReader) warm-up block) are commented out with a note that REF bits need to be re-armed on cache hits. If this is intentionally deferred, a // TODO: with a tracking issue is clearer. If the behaviour was decided against, the block should be removed to keep the reader from reasoning about it.
| private byte[] ReadBlobArenaRlp(ushort blobArenaId, int offset) | ||
| { | ||
| BlobArenaFile file = _blobManager.GetFile(blobArenaId); | ||
| using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes); |
There was a problem hiding this comment.
Medium: NativeMemoryList<byte> for a fixed 568-byte scratch buffer — prefer stackalloc
NativeMemoryList<byte> calls NativeMemory.Alloc + NativeMemory.Free on every trie-node RLP resolve. For the 568-byte case, a plain stackalloc avoids the two P/Invoke syscalls and keeps the buffer on the stack:
| using NativeMemoryList<byte> rented = new(MaxTrieNodeRlpBytes, MaxTrieNodeRlpBytes); | |
| Span<byte> buf = stackalloc byte[MaxTrieNodeRlpBytes]; |
The using scope and its Dispose can be removed. The buffer is stack-allocated and the file.RandomRead call fills it in-place just as before. Since PersistedSnapshot is a class (not a ref struct), stackalloc is legal here.
| /// 32-bit caps a single blob arena file at 2 GiB. <see cref="BlobArenaWriter"/> | ||
| /// enforces this on append; <see cref="BlobArenaManager.CreateWriter"/> picks | ||
| /// a fresh file when the estimate exceeds the current file's headroom. | ||
| /// </summary> |
There was a problem hiding this comment.
Medium: Signed int offset silently wraps for files > 2 GiB
RlpDataOffset is int (signed), so the maximum addressable offset is int.MaxValue = 2,147,483,647 bytes ≈ 2 GiB. The comment correctly documents this cap, but BlobArenaWriter should enforce it explicitly before writing. If a blob arena file grows past 2 GiB (e.g., due to a misconfigured ArenaFileSizeBytes), the offset silently wraps to a negative value, producing silent data corruption rather than a clear error.
Consider adding an explicit guard in BlobArenaWriter.Complete / BlobArenaWriter.Write:
if (offset > int.MaxValue)
throw new InvalidOperationException(
$"BlobArena offset {offset} exceeds int.MaxValue; per-file cap is 2 GiB.");| } | ||
|
|
||
| // Parallel base conversion across the whole batch | ||
| Parallel.ForEach(allStateIds, state => |
There was a problem hiding this comment.
Medium: Parallel.ForEach has no CancellationToken — does not honour shutdown
The Parallel.ForEach call uses the default ParallelOptions, so there is no way to cancel it when _cancelTokenSource fires on node shutdown. If the loop is in the middle of a large batch during shutdown, it will run to completion before DisposeAsync can progress.
| Parallel.ForEach(allStateIds, state => | |
| Parallel.ForEach(allStateIds, | |
| new ParallelOptions { CancellationToken = _cancelTokenSource.Token }, | |
| state => | |
| { | |
| if (_snapshotRepository.TryLeaseState(state, out Snapshot? snapshot)) | |
| { | |
| long sw = Stopwatch.GetTimestamp(); | |
| _smallRepo.ConvertSnapshotToPersistedSnapshot(snapshot); | |
| _persistedSnapshotConvertTime.WithLabels("base").Observe(Stopwatch.GetTimestamp() - sw); | |
| snapshot.Dispose(); | |
| } | |
| }); |
| bool EnableLongFinality { get; set; } | ||
|
|
||
| [ConfigItem(Description = "Total max reorg depth (in-memory + persisted). When exceeded, force-persist oldest HSST snapshot to RocksDB.", DefaultValue = "90000")] | ||
| int LongFinalityReorgDepth { get; set; } |
There was a problem hiding this comment.
Low: Config description missing units
The description "Total max reorg depth (in-memory + persisted)" doesn't state the unit. Per the coding style rules, config items must document their units. Should be "Total max reorg depth in blocks (in-memory + persisted)."
| Assembly[] releaseAssemblies = additionalJobAssemblies | ||
| .Union(simpleJobAssemblies) | ||
| .Append(typeof(KeccakBenchmark).Assembly) | ||
| // Precompile benchmark assembly disabled: PrecompileBenchmarkBase.Inputs |
| /// N inputs — exercising the hot paths the optimisation targets. | ||
| /// </summary> | ||
| [MemoryDiagnoser] | ||
| public class PersistedSnapshotCompactBenchmark : IDisposable |
| <ItemGroup> | ||
| <!-- Pre-existing breakage from c8a5fbcff3 (long-finality refactor); excluded so the | ||
| project still builds for runnable benchmarks. --> | ||
| <Compile Remove="State/PersistedSnapshotBenchmark.cs" /> |
There was a problem hiding this comment.
Remove clause, reset to master
| /// empty tree even if the blocknumber is not -1. This does not work with flat. We will ignore it for now. | ||
| /// </summary> | ||
| /// <param name="flatDbManager"></param> | ||
| private class FlatDbManagerTestCompat(IFlatDbManager flatDbManager) : IFlatDbManager |
There was a problem hiding this comment.
This is probably not needed anymore.
| "resolved": "1.8.5", | ||
| "contentHash": "EaCgmntbH1sOzemRTqyXSqYjB6pLH7VCYHhhDYZ59guHSD5qPwhIYa7kfy0QUlmTRt9IXhaXdFhNuBUArp70Ng==" | ||
| }, | ||
| "prometheus-net": { |
There was a problem hiding this comment.
Reset whole file to master.
Bundles the PR-review fixes for the persisted-snapshot tier:
- Gate persisted-snapshot conversion paths on EnableLongFinality. When the
flag is false, DetermineSnapshotAction skips both long-finality branches
and falls through to the existing finalized-snapshot-to-RocksDB flow,
matching pre-long-finality behaviour. FlatWorldStateModule now wires
NullPersistedSnapshotRepository/Compactor when the flag is false so no
arena/blob directories are created on startup.
- async ProcessCompactBatch: replace .AsTask().Wait() on a bounded channel
write with an awaited WriteAsync that honours _cancelTokenSource.Token.
- ReadBlobArenaRlp: stackalloc the 568-byte scratch (was NativeMemoryList)
and throw InvalidDataException if the RLP header declares more bytes
than the pread returned, instead of slicing past the read window.
- Parallel.ForEach in AddToPersistence now takes ParallelOptions with the
shutdown CancellationToken.
- Drop two identity-cast LINQ Select calls on _trieNodesSortBuffer in
PersistSnapshot (per-snapshot iterator allocation).
- Add unit ("in blocks") to LongFinalityReorgDepth's ConfigItem description.
- Delete a dead commented-out warm-up branch in TryGetAddressBound.
- PersistedSnapshotBloom.CreateAlwaysTrue routes the sentinel lease count
through a private constructor overload instead of a post-construction
field write.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All six 'ArrayPoolList<X> list = new(N, N); X[] arr = list.UnsafeGetInternalArray();' sites in PersistedSnapshotMerger.cs simplify to: ArrayPoolList<X> list = new(N, N); Span<X> arr = list.AsSpan(); The downstream consumers were already in span form (cursor ctor takes Span<TSource>, helpers take Span<TSource>, etc.) and the redundant .AsSpan(0, N) calls collapse because list.AsSpan() already returns a span of length Count == N. The innerSources path keeps the [..innerN] slice since innerN varies per outer iteration. No need to access the raw rented array — UnsafeGetInternalArray was just a way to get a Span back, which AsSpan() already provides without the GuardDispose checks accumulating (we cache the span once at the top of each method). Verified: 0/0 warnings/errors prod + test; 876/876 + 7 skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…or, doc trim - Inline trivial single-call-site shims in HsstBTreeBuilder (MinBytesFor, WriteUInt64LE, KeySliceLength, FlushPendingAsEntries) - Make BTreeNodeWriter.HeaderSize a const - Consolidate Add / TryAddAligned via a shared private AddImpl - Switch BTreeNodeReader to a primary constructor; extract NodeMetadata to its own file (was nested) and inline DecodeValueSize into its sole caller - Trim 8 restate-only XML docstrings on private helpers; keep the ones that document invariants / triggers / spec-derived layout No behaviour change. Verified: full Nethermind.slnx builds clean; Nethermind.State.Flat.Test passes (869 / 7 pre-existing skips). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@claude re-review |
|
Claude finished @asdacap's task in 7m 32s —— View job Claude finished @asdacap's re-review — PR Re-Review: feat(FlatDB): long finality support with persisted snapshotsTodo List:
Status of Prior FindingsAll three prior High findings are addressed:
Medium findings:
New FindingsHigh —
|
| public void Enqueue(ArrayPoolList<StateId> batch) | ||
| { | ||
| EnsureStarted(); | ||
| _compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait(); |
There was a problem hiding this comment.
High: .Wait() on bounded-channel write — same anti-pattern as the prior review's High finding, now in a different location
_compactPersistedJobs is a BoundedChannel of capacity 16 (line 52). When it is full, WriteAsync returns an incomplete ValueTask; .AsTask().Wait() then synchronously blocks the current thread-pool thread until a slot opens.
Enqueue is called from PersistenceManager.ConvertCompactedRange / ConvertSingleBase, which runs inside FlatDbManager.NotifyWhenSlow → Task.Run. If the compactor falls behind and the channel fills (e.g., during a rapid burst of block conversions), this stalls one or more thread-pool threads indefinitely — exactly the thread-pool starvation scenario that the robustness rules prohibit.
The callers already have no CancellationToken to pass, but the real fix is to avoid blocking at all — either make Enqueue fire-and-forget by using TryWrite and logging on drop, or restructure the caller to await WriteAsync by propagating the CancellationToken:
| _compactPersistedJobs.Writer.WriteAsync(batch).AsTask().Wait(); | |
| public void Enqueue(ArrayPoolList<StateId> batch) | |
| { | |
| EnsureStarted(); | |
| // TryWrite is non-blocking; drop the batch (with a warning) if the compactor is behind | |
| // rather than stalling the thread-pool thread that processes blocks. | |
| if (!_compactPersistedJobs.Writer.TryWrite(batch)) | |
| { | |
| if (_logger.IsWarn) _logger.Warn($"Persisted compactor job queue full — dropping batch of {batch.Count} states."); | |
| batch.Dispose(); | |
| } | |
| } |
Alternatively, make the channel unbounded so the write always succeeds immediately and backpressure is handled by memory pressure instead.
| int arenaId = BinaryPrimitives.ReadInt32LittleEndian(span[80..]); | ||
| long offset = BinaryPrimitives.ReadInt64LittleEndian(span[84..]); | ||
| long size = BinaryPrimitives.ReadInt64LittleEndian(span[92..]); | ||
| SnapshotTier tier = (SnapshotTier)span[100]; |
There was a problem hiding this comment.
Medium: Unchecked enum cast from on-disk byte — silent misclassification on corruption
(SnapshotTier)span[100] performs a direct cast without checking that the byte falls within the set of persisted-tier values {2, 3, 4, 5}. The catalog comment says it only ever stores persisted tiers, but a single corrupted byte (e.g., value 0 = InMemoryBase or 6 = undefined) silently produces a wrong classification:
InMemoryBase/InMemoryCompacted(0 or 1):IsPersisted()returnsfalse, so the entry is added to the wrong bucket and later operations that expect a persisted snapshot get an in-memory tier handle.- A value ≥ 6:
MetricTierLabelthrowsArgumentOutOfRangeExceptionat the next metric callsite, which is confusing to diagnose.
The format-version check (line 71) guards against schema changes but not per-entry field corruption. A small guard costs nothing and surfaces corruption immediately:
| SnapshotTier tier = (SnapshotTier)span[100]; | |
| SnapshotTier tier = (SnapshotTier)span[100]; | |
| if (!tier.IsPersisted()) | |
| throw new InvalidOperationException( | |
| $"Persisted snapshot catalog contains non-persisted tier value {(byte)span[100]} at entry. " + | |
| "The persisted_snapshot/ directory may be corrupted — wipe and resync."); |
|
|
||
| Snapshot? snapshotToPersist; | ||
| Parallel.ForEach( |
There was a problem hiding this comment.
Medium: Parallel.ForEach ignores shutdown cancellation (carry-over from prior review M2)
This Parallel.ForEach has no ParallelOptions and therefore ignores _cancelTokenSource (or whichever token guards node shutdown). If this loop is processing a large CompactSize-wide batch during shutdown, DisposeAsync in the enclosing FlatDbManager / compactor will block until every item in allStateIds has been converted.
| Parallel.ForEach( | |
| Parallel.ForEach( | |
| allStateIds, | |
| new ParallelOptions { CancellationToken = loader.CancellationToken }, | |
| state => | |
| { |
(Assuming IPersistedSnapshotLoader exposes the token, or thread it from PersistenceManager's own CTS.) If the token isn't easily available here, at minimum document that this loop is intentionally uncooperative with cancellation and that shutdown may be delayed by up to one full CompactSize-wide batch.
| // Ascending bucket order: each sub-CompactSize layer's inputs (the previous layer's | ||
| // outputs) exist before it runs. | ||
| foreach (KeyValuePair<int, List<StateId>> kv in buckets) | ||
| Parallel.ForEach(kv.Value, state => DoCompactSnapshot(state)); |
There was a problem hiding this comment.
Medium: Parallel.ForEach ignores shutdown cancellation
Same pattern as PersistenceManager.cs line 238 (prior review M2): the compactor's own _cancelTokenSource is not passed as ParallelOptions.CancellationToken. If shutdown fires while a bucket's state list is large (e.g., at a high compaction level), the loop runs to completion before DisposeAsync can drain.
| Parallel.ForEach(kv.Value, state => DoCompactSnapshot(state)); | |
| Parallel.ForEach(kv.Value, | |
| new ParallelOptions { CancellationToken = _cancelTokenSource.Token }, | |
| state => DoCompactSnapshot(state)); |
Address review feedback on the long-finality concurrency paths: - SnapshotCatalog.ReadEntry now rejects a non-persisted tier byte with the file's standard "wipe and resync" error instead of silently misclassifying a corrupt entry. - IPersistedSnapshotCompactor.Enqueue becomes EnqueueAsync: it awaits a free bounded-queue slot (backpressure without blocking a thread) and takes the producer's cancellation token rather than owning a CancellationTokenSource. Its background workers and in-flight Parallel.ForEach observe process-exit directly; graceful disposal completes and drains the channels in stage order. - PersistenceManager's persistence drain is now async (SemaphoreSlim mutex so it can await Enqueue under the lock) and its conversion Parallel.ForEach observes a process-exit-linked token; FlushToPersistence stays synchronous. - FlatDbManager.PersistIfNeeded awaits the now-async AddToPersistence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the test-only IArenaManager wrapper with the actual ArenaManager over a temporary directory, built via a new TestFixtureHelpers.CreateArenaManager factory that applies the same test config (page tracker disabled, arena size floored to one OS page). Fixtures own the temp dir and clean it up in TearDown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Halve PersistedSnapshotArenaPageCacheBytes from 8 GiB to 4 GiB (and the matching ConfigItem DefaultValue), bounding the mmap'd arena page cache and the tracker's own metadata (~17 MiB -> ~8.5 MiB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve the conflict from #11967 (force-persist the head-reachable fork): our branch replaced DetermineSnapshotToPersist with the two-phase DetermineSnapshotAction, so the fix is ported onto our path rather than master's methods: - Keep master's SnapshotRepository committed-head infra (SetLastCommittedStateId/ GetLastCommittedStateId) and the FlatDbManager.AddSnapshot caller; drop master's now-superseded DetermineSnapshotToPersist/GetHeadAncestorAtBlockNumber/ GetFinalizedSnapshotAtBlockNumber/TryFindAncestorStateAtBlock. - DetermineSnapshotAction backstop and FlushToPersistence now seed from the committed head (GetLastCommittedStateId), which our ancestral FindSnapshotToPersist walk turns into "persist the head-reachable fork" — so a forced persist no longer follows a longer non-canonical fork. - Port master's two DetermineSnapshotToPersist fork tests to DetermineSnapshotAction; adapt master's two FlushToPersistence fork tests to our repo-owned snapshot model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Claude finished @asdacap's task in 5m 6s —— View job Todo List:
PR Re-Review (3rd pass): feat(FlatDB): long finality support with persisted snapshotsAll findings from the prior two reviews are resolved. The cooperative-shutdown refactor ( Status of All Prior Findings
New Observations (No New Blockers)The The two remaining VerdictMergeable. All Critical/High/Medium findings from both prior reviews are addressed. The Null-object wiring, cooperative cancellation, and catalog corruption guard are all clean. |
EXPB Benchmark ComparisonRun: View workflow run superblocksScenario: Client Processing (SSE)
K6 TTFB
realblocksScenario: Client Processing (SSE)
K6 TTFB
|
|
@asdacap need to confirm above numbers but seems like this comes with a big performance hit? |
|
@claude any idea why expb shows such a per degradation? can it be expb changes or this PR have some inefficiencies? |
|
Claude finished @kamilchodola's task in 8m 34s —— View job Analyzing EXPB performance regression Todo List:
Performance Regression AnalysisIs this EXPB tooling or PR overhead?PR overhead. The scenario The ~1% overhead on superblocks vs ~6% on realblocks is the diagnostic signal. If the overhead were proportional to state-access volume (trie reads, etc.), superblocks — which do far more per block — would show more overhead. Instead, the inverse relationship points to a fixed per-block or per-persist overhead that's amortized away by the ~1000ms superblock processing time but visible against the ~24ms realblock baseline. Identified overhead sources1. In // Master:
if (!HasForkAt(canonicalBlock)) return;
// Branch:
if (!HasForkAt(canonicalBlock) && !HasPersistedForkAt(canonicalStateId)) return;In normal operation (no reorgs),
This runs every Quick fix: short-circuit in private bool HasPersistedForkAt(in StateId canonicalStateId)
{
if (PersistedSnapshotCount == 0) return false; // nothing to check
...
}2. // TryLoadStateRlp — no Count guard:
ulong statePathBloomKey = PersistedSnapshotBloomBuilder.StatePathKey(in path); // always runs
for (int i = _snapshots.Count - 1; i >= 0; i--) { ... } // never executes when empty
// TryLoadStorageRlp — no Count guard:
ValueHash256 addressHash = address.ValueHash256;
ulong storageBloomKey = PersistedSnapshotBloomBuilder.StorageNodeKey(in addressHash, in path); // always runsCompare with Quick fix: add an early return at the top of both methods: if (_snapshots.Count == 0) { rlp = null; return false; }3. // Branch (every base-snapshot removal):
StateId? newMax;
using (_sortedSnapshotStateIds.EnterWriteLock(out SortedSet<StateId> sortedSnapshots))
{
sortedSnapshots.Remove(stateId);
newMax = sortedSnapshots.Count == 0 ? null : sortedSnapshots.Max; // O(log n) — always
}
using (_lastRegisteredLock.EnterScope())
if (_lastRegisteredState == stateId) _lastRegisteredState = newMax;Master's 4. The new shared BFS tries all 6 tiers ( Is the overhead acceptable?If the EXPB scenario is intentionally testing The author's own benchmark showed no overhead "at 300 blocks" vs the feature disabled, but that compared PR-enabled vs PR-disabled. The EXPB comparison is PR-enabled vs master's simpler code — which is why the overhead appears here but wasn't visible in the in-PR benchmark. Recommended before merge: apply the two guard fixes (items 1 and 2) and re-run the EXPB benchmark to confirm the realblock regression closes. Items 3 and 4 are lower priority but addressable. |
…mpty ReadOnlySnapshotBundle caches the persisted-snapshot count once (immutable for the bundle's life) and gates every persisted-tier probe — GetAccount, GetSlot, DetermineSelfDestructSnapshotIdx, TryLoadStateRlp, TryLoadStorageRlp — on it being > 0. When no persisted snapshots exist (long finality disabled, or none persisted yet) the read path skips the persisted query entirely instead of calling into an empty PersistedSnapshotStack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Split the force-persist backstop depth into two configs: MaxReorgDepth (restored to the non-long-finality default of 256) and a new LongFinalityMaxReorgDepth (default 90000). PersistenceManager selects the backstop at construction — LongFinalityMaxReorgDepth when EnableLongFinality is on (the persisted tier serves deep reorgs), otherwise MaxReorgDepth — so a non-long-finality node force-persists much sooner instead of holding up to 90000 in-memory snapshots when finality stalls. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Default the hierarchical persisted-layer compaction ceiling from 8192 to 1048576 (1Mi) blocks; keep the ConfigItem DefaultValue in sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Log a Warn when the reorg-depth backstop (LongFinalityMaxReorgDepth / MaxReorgDepth) is exceeded and a persist is actually forced — restoring visibility lost when DetermineSnapshotToPersist was replaced by DetermineSnapshotAction. The warning is gated on the persist actually happening, so it does not fire when the backstop seed finds no candidate and the call falls through to a Phase 2 persisted-snapshot conversion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fallback Remove the flat/long-finality-only LastRegisteredState mechanism (field, lock, property, AddStateId setter, tip-removal reset, interface member, and its test) and align the DetermineSnapshotAction backstop fallback with master: GetLastCommittedStateId() ?? GetLastSnapshotId() ?? latestSnapshot. The committed-head primary handles the common case, so only the rarely-hit fallback changes — from call-order (LastRegisteredState) to the longest-chain tip (GetLastSnapshotId). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compute the backstop depth as Max(configured backstop, MinReorgDepth + CompactSize) so MinReorgDepth can be configured at or above MaxReorgDepth/LongFinalityMaxReorgDepth without the finalized-persistence trigger and the backstop colliding at the same depth. The CompactSize margin gives the finalized trigger room to persist before the backstop force-persists. Normal configs (MinReorgDepth far below the backstop) are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Summary
Adds a long-finality storage tier to
Nethermind.State.Flatso the in-memorysnapshot ring (today bounded by
MaxInMemoryReorgDepth, default 256) can beextended to ~90,000 blocks (configurable via
LongFinalityReorgDepth) withoutthe per-snapshot RAM cost growing linearly. Snapshots that fall off the
in-memory tail are converted into immutable, mmap-backed persisted
snapshots stored as HSST blobs in arena files, then logarithmically merged
("Linked" compaction) so the snapshot count stays bounded.
The new tier is opt-in (
FlatDb.EnableLongFinality = falseby default) andsits underneath the existing in-memory snapshot ring; the persistence
pipeline, RPC/sync surface and consensus paths are unchanged.
Motivation
MaxInMemoryReorgDepth(~256). Everyin-memory snapshot keeps its diff in RAM, so raising that ceiling is
proportionally expensive in heap.
large reorg window (long-finality CL designs, archive-lite serving)
need a way to keep many more snapshots cheap to retain and cheap to
read.
the read path means cold snapshots cost only file-descriptor + index
metadata until queried, and a
PageResidencyTrackerper arena drivesmadvise(DONTNEED)so the resident set is bounded by a configuredpage-cache budget rather than by total snapshot bytes on disk.
High-level design
Mirrors the in-memory
SnapshotpipelineA
PersistedSnapshotis the on-disk twin of an in-memorySnapshot:same
(From, To)semantics, sameStateId-keyed lookups, samehierarchical compaction policy. The in-memory compactor in
SnapshotCompactorkeys merge decisions off the lowest-set-bit of theblock number (
blockNumber & -blockNumber), giving 1×, 2×, 4×, …,CompactSize× layers; the persisted-tier compactor uses the samelowest-set-bit rule (
PersistenceManager.cs:118-128) so thepersisted-snapshot stack is just a continuation of the in-memory
snapshot stack across the tier boundary. As a result the number of
live persisted-snapshot layers is logarithmic in the configured
reorg depth rather than linear, with the same merge bookkeeping the
in-memory pipeline already implements. Code paths for reading
(
GetFinalizedSnapshotAtBlockNumberfalls through from in-memorySnapshottoPersistedSnapshotat the same block/root if thein-memory window has rotated past it), eviction, and compaction all
follow this mirror.
In-memory bloom filters skip cold persisted snapshots
Each persisted snapshot ships with a tiny in-RAM bloom filter
(
PersistedSnapshotBloom,PersistedSnapshotBloomBitsPerKeydefault 14.0 bpk). A single filter per snapshot covers
address/slot/self-destruct keys plus state-trie and storage-trie node
paths. Lookups iterate the snapshot stack from newest to oldest;
snapshots whose bloom says "definitely not here" are skipped without
touching the mmap'd HSST at all. This keeps the cost of walking a
deep snapshot stack proportional to the number of snapshots that
actually hold the key, not the total stack depth — important because
the long-finality stack is, by design, much deeper than today's
in-memory ring.
Custom directly-indexable on-disk format (HSST)
Persisted snapshots are not RocksDB-shaped. They use a purpose-built
columnar format (HSST)
designed so that a single HSST blob can be point-queried by key
without loading the whole blob: readers walk an embedded B-tree
index from the tail and pull only the pages they need. The format
also supports nesting (a value can itself be an HSST), which is what
makes the per-address / per-addressHash inner tables possible inside
a single outer snapshot file. Multiple specialised index variants
(B-tree, packed-array, dense-byte-index, two-byte-slot-value, …) are
picked per column by the writer to minimise blob size and seek cost;
the index-type byte at the blob tail selects the variant. Combined
with
mmap+PageResidencyTracker, a cold snapshot costs onlyfile-descriptor + a few trailer pages of memory until something
actually queries it.
Trie RLP split out into separate blob arenas
The single biggest size win comes from not storing trie-node RLP
bytes inside the persisted-snapshot HSST. Each persisted snapshot's
HSST only stores 6-byte
NodeRefs (BlobArenaId: u16,RlpDataOffset: i32) where the actual RLP would have lived; the RLPbytes themselves live in append-only blob arena files addressed by
those refs. In practice this shrinks the per-snapshot HSST by roughly
20× vs. storing RLP inline, because the trie-node payload
dominates the snapshot otherwise.
This split has knock-on benefits:
without ever rewriting the RLP. Compacted snapshots'
NodeRefsstill point at the original blob arena files.
references them holds a lease, and a file is only deleted when no
live snapshot still references its
BlobArenaId. TheNodeRef-based deduplication that comes for free lets long-livedtrie nodes survive arbitrarily many snapshot compactions without
being copied.
BlobArenaId× 2 GiB-per-file (32-bit offset) caps each tierat ~128 TiB of trie RLP.
Two-tier persisted snapshots
(To - From) < CompactSize.This is the tier whose HSSTs are the small, mmap-backed, seekable
index files driven by
PageResidencyTracker. Page-cache budget:PersistedSnapshotSmallArenaPageCacheBytes(default 1 GiB resident).≥ CompactSizesourcesnapshots. Same HSST shape, larger files; page-cache budget:
PersistedSnapshotLargeArenaPageCacheBytes(default 8 GiB resident).with
(To - From)doubling at each level, bounded byPersistedSnapshotMaxCompactSize(default 1024) and capped atPersistedSnapshotMaxCompactedSourceBytessource bytes (default 2GiB) to keep merged HSSTs below
int.MaxValueand the arenaceiling. Compaction only rewrites HSSTs; blob arenas are untouched
and their RLP is inherited through the
NodeRefindirection.ArenaFileSizeBytes, default 1 GiB) areappend-only; reservations are tagged for per-tag count/byte metrics.
HSST — Hierarchical Static Sorted Table
A new self-describing immutable on-disk format (
Nethermind.State.Flat/Hsst/).Top-level layout and rationale are documented in
Hsst/FORMAT.md;the columnar persisted-snapshot vocabulary lives in
PersistedSnapshotTags.cs.Index variants implemented (the trailing index-type byte selects):
0x01BTree— variable-length values, value-first entries, tail B-treeindex. Optional 4 KiB leading padding lets the merger 4 KiB-align hot
blobs without rebuilding the entry.
0x02PackedArray— fixed key/value array with a recursive summaryindex and an optional hash table; SIMD floor-scan over summary levels.
0x04DenseByteIndex— single-byte-keyed map indexed directly by tag.Used for per-address sub-tag walks (one pinned trailer read resolves
any sub-tag).
0x05/0x06TwoByteSlotValue/TwoByteSlotValueLarge— fixed2-byte key map for slot-suffix layer (
u16/u24offsets).0x07BTreeKeyFirst— key-first entry layout for outer columnswhose values are large nested HSSTs (parallel to the inner HSST's
keys-first index).
Columnar layout (outer HSST → per-key inner HSSTs):
0x00Metadata0x01Address (raw 20 bytes) → {SlotSubTag, AccountSubTag, SelfDestructSubTag}0x02AddressHash → {StorageTopSubTag, StorageCompactSubTag, StorageFallbackSubTag}0x03StateNode (compact path, 6-15 bytes)0x05StateTopNode (path 0-5)0x06StateNodeFallback (path 16+)Read-path engineering
WholeReadSessionovermmap'd HSST arenas, routed through atracker-aware
ArenaByteReaderso every read participates inPageResidencyTrackerclock-eviction accounting.bloom subsection above); bloom memory exposed via gauge metrics.
Vector512<long>inline set) cuts repeated outer-HSST descents on hot addresses.
Lifecycle / GC
Snapshot → PersistedSnapshotconversion, trie nodes are markedpersisted and pruned from the in-memory trie.
PageResidencyTrackerper arena drivesmadvise(DONTNEED)and(opt-in via
PersistedSnapshotFadviseOnPageEviction)posix_fadvise(POSIX_FADV_DONTNEED). Memory is reported to theGC and exposed as per-tier gauge metrics.
LongFinalityReorgDepthis the hard tail — once exceeded, theoldest HSST snapshot is force-persisted into the RocksDB tier.
Config surface
All new keys on
IFlatDbConfig(default-off feature, conservative defaults):EnableLongFinalityfalseLongFinalityReorgDepth90000PersistedSnapshotPath"snapshots"ArenaFileSizeBytes1 GiBPersistedSnapshotSmallArenaPageCacheBytes1 GiBPersistedSnapshotLargeArenaPageCacheBytes8 GiBPersistedSnapshotFadviseOnPageEvictionfalsePersistedSnapshotMaxCompactSize1024(To-From)doubling stops here).PersistedSnapshotMaxCompactedSourceBytes2 GiBint.MaxValue).PersistedSnapshotBloomBitsPerKey14.0ValidatePersistedSnapshotfalseObservability
bloom memory, per-tag reservation count/bytes.
PageResidencyTrackereviction counters.Performance
Notes:
reorg window, so the persisted tier is effectively idle — confirms
the feature flag adds no measurable overhead when not exercised.
cost of walking the deeper snapshot stack for reads plus the
background work of bloom construction, HSST writes, and
hierarchical compaction.
fixed per-block overhead (lookups, bloom checks, HSST descent) is
the dominant cost at very long stacks rather than per-gas state
access cost.
LongFinalityReorgDepthis more likelyto be the blob-arena address space than CPU: per tier the
NodeRefencoding caps total trie-RLP storage atushort.MaxValuefiles × 2 GiB ≈ 128 TiB, and the small-tierblob arena fills proportionally to the snapshot count rather than
to compaction.
Compatibility
EnableLongFinality=falsenone of the persisted-snapshot code runs. RocksDB column-family
layout is preserved (two retained empty column ordinals documented
in
FlatDbColumns.cs).PersistedSnapshotTags.MetadataFormatVersion)is
0x02. TheSnapshotCatalogv2→v3 schema bump trips a"wipe and resync" error if pointed at an older directory.
Scope
+29,595 / -341).Nethermind.State.Flat(104),Nethermind.State.Flat.Test(42),Nethermind.Core(4 —NativeMemoryList,Leb128),Nethermind.Benchmark(4),Nethermind.Trie/Db/Init/Runner(small wiring).refactor, 132perf, 38feat,36
fix, 13test, 4docs, 3revert, 3config.Review aids
Suggested reading order if reviewing top-down:
Hsst/FORMAT.md— wire format.NodeRef.cs+Storage/BlobArenaManager.cs—the trie-RLP indirection that makes the HSSTs small.
PersistedSnapshotTags.cs— columnar vocabulary.PersistedSnapshot.cs,PersistedSnapshotReader.cs— read path.PersistedSnapshotBuilder.cs,PersistedSnapshotMerger.cs,PersistedSnapshotCompactor.cs— write/compaction path.Storage/ArenaManager.cs,Storage/PageResidencyTracker.cs,Storage/WholeReadSession.cs— HSST arena lifecycle.PersistenceManager.cs,FlatWorldStateModule.cs— wiring intothe existing flat persistence pipeline.
Types of changes
Testing
Requires testing
If yes, did you write tests?
Notes on testing
Nethermind.State.Flat.Testcover HSST round-trips(all index variants), persisted-snapshot build/read/merge/compact paths,
page-tracker eviction, bloom-filter correctness, and N-way merge cursor.
ValidatePersistedSnapshot=truecross-checks every persisted snapshotagainst the source in-memory snapshot at conversion time; intended for
diagnostic runs but available for review.
HsstReaderBenchmark,PersistedSnapshotBenchmark,PersistedSnapshotCompactBenchmarkinNethermind.Benchmark/State/.EnableLongFinality=true,ValidatePersistedSnapshot=true, sync mainnet from snapshot for atleast a window >
MaxInMemoryReorgDepthand confirm:PageResidencyTrackerresident bytes track the configured budget,Invalid Block/Exceptionlines in logs.Documentation
Requires documentation update
Hsst/FORMAT.md; user-facingdocs page can come in a follow-up once the feature flips default)
Requires explanation in Release Notes
(
FlatDb.EnableLongFinality). See config keys above."Remarks
is meaningful only end-to-end (format + write path + read path +
compactor + lifecycle + observability). Splitting it further would
introduce dead, unwired code into
masterbetween PRs.flat/long-finality) and the branch isfast-forward-clean against
master.