Refresh snap-sync account storage root via verified GetAccountRange#12082
Refresh snap-sync account storage root via verified GetAccountRange#12082asdacap wants to merge 2 commits into
Conversation
|
Claude finished @asdacap's task in 7m 23s —— View job Code Review2 Medium · 2 Low · 0 Critical · 0 High SummaryThe design is sound. Replacing the blind-trust Two issues need attention before merge. Medium1. Exception catch is too narrow in 2. Low3. Misleading comment in SnapSyncDownloader.cs:46 4. Other observations (no action needed)
|
| account = _accountDecoder.Decode(ref ctx); | ||
| return account is null ? RefreshVerifyResult.NotFound : RefreshVerifyResult.Verified; | ||
| } | ||
| catch (Exception e) when (e is TrieException or TrieNodeException or RlpException) |
There was a problem hiding this comment.
Medium — exception scope too narrow for untrusted P2P data
The filter catches TrieException, TrieNodeException, and RlpException, but malformed proof bytes from a malicious peer could also trigger NullReferenceException, ArgumentException, or IndexOutOfRangeException inside trie traversal or RLP decode — none of which are caught here. These would propagate through RefreshAccounts into SnapSyncFeed.HandleResponse (which has no catch block) and crash the sync task.
Per robustness.md: "Validate data from untrusted sources (P2P peers, …) before use." The safest fix is to broaden the catch to cover all exceptions from this throwaway trie:
| catch (Exception e) when (e is TrieException or TrieNodeException or RlpException) | |
| catch (Exception) | |
| { | |
| return RefreshVerifyResult.InvalidProof; | |
| } |
There was a problem hiding this comment.
Done in 920ce83 — broadened to catch (Exception). The verification trie is a throwaway over untrusted P2P proof data, so any failure assembling/decoding it now returns InvalidProof instead of propagating into HandleResponse.
(agent comment — Claude Opus 4.8 (1M context))
|
|
||
| Rlp.ValueDecoderContext ctx = new(accountRlp); | ||
| account = _accountDecoder.Decode(ref ctx); | ||
| return account is null ? RefreshVerifyResult.NotFound : RefreshVerifyResult.Verified; |
There was a problem hiding this comment.
Medium — null decode from non-empty RLP should be InvalidProof, not NotFound
accountRlp.IsEmpty was already checked on line 331 (returns early with NotFound), so by this point the trie returned non-empty RLP for this path. If _accountDecoder.Decode returns null from non-empty data, the account entry exists in the verified trie but is not a well-formed Account. That is a tampered/corrupt proof, not a "deleted account" scenario.
The current NotFound result is a "terminal success" that silently drops the storage sync for this account. If this ever happens, it is safer to return InvalidProof so the peer is scored and the sync retries. Suggest:
| return account is null ? RefreshVerifyResult.NotFound : RefreshVerifyResult.Verified; | |
| account = _accountDecoder.Decode(ref ctx); | |
| return account is null ? RefreshVerifyResult.InvalidProof : RefreshVerifyResult.Verified; |
There was a problem hiding this comment.
Resolved by the current approach: refresh no longer hashes or decodes a proof node itself. The account is taken from the verified GetAccountRange response (accounts[i].Account) only after VerifyAccountRange confirms the range reconstructs the state root; a malformed/null account makes that reconstruction fail and is caught as InvalidProof. So the non-empty-RLP→null-decode path no longer exists.
(agent comment — Claude Opus 4.8 (1M context))
| batch.AccountsToRefreshResponse = await handler.GetTrieNodes(batch.AccountsToRefreshRequest, cancellationToken); | ||
| // Refresh a single account via GetAccountRange so its storage root is verified | ||
| // against the state root. Limit = path + 1 (the protocol returns at least one | ||
| // account past the limit, so start == limit must be avoided). |
There was a problem hiding this comment.
Low — misleading comment
"the protocol returns at least one account past the limit" reads as if peers are obligated to overshoot the limit. The actual concern is that some peers treat start == limit as an empty range and return nothing. Suggest:
| // account past the limit, so start == limit must be avoided). | |
| // Use limit = path + 1 to avoid start == limit, which some peer | |
| // implementations interpret as an empty range and return nothing. |
There was a problem hiding this comment.
Done in 920ce83 — reworded to: "Use limit = path + 1 to avoid start == limit, which some peers treat as an empty range."
(agent comment — Claude Opus 4.8 (1M context))
| // account past the limit, so start == limit must be avoided). | ||
| AccountWithStorageStartingHash account = batch.AccountsToRefreshRequest.Paths[0]; | ||
| ValueHash256 path = account.PathAndAccount.Path; | ||
| AccountRange range = new(batch.AccountsToRefreshRequest.RootHash, path, path.IncrementPath()); |
There was a problem hiding this comment.
Low — IncrementPath overflow when path == Keccak.MaxValue
If path is Keccak.MaxValue (all 0xFF bytes), path.IncrementPath() overflows and returns Keccak.MaxValue again, producing startingHash == limitHash == MaxValue. This is astronomically unlikely for a keccak path, but worth a guard or comment so the behaviour is explicit:
| AccountRange range = new(batch.AccountsToRefreshRequest.RootHash, path, path.IncrementPath()); | |
| AccountRange range = new(batch.AccountsToRefreshRequest.RootHash, path, path.IncrementPath()); // IncrementPath is a no-op only for the astronomically-unlikely MaxValue path |
There was a problem hiding this comment.
Done in 920ce83 — added a note that IncrementPath is a no-op only for the unreachable MaxValue path.
(agent comment — Claude Opus 4.8 (1M context))
When a storage root mismatches during snap sync, the account was refreshed by fetching its storage-trie root node via GetTrieNodes and trusting the returned node's hash blindly. That refresh path is being removed in snaps/2 and never verified the new storage root against the pivot state root. Refresh now sends a single-account GetAccountRange (limit = path + 1) and verifies the returned account against the pivot state root by reconstructing the account range the same way account-range sync does: fill the boundary proof nodes, set the returned account(s), then check the recomputed root. The reconstruction runs in an isolated, empty-backed trie (no writes to the client state DB, so it cannot race the partition sync), and because the leaf is set from the response it does not rely on the proof containing it. Outcomes: verified (adopt storage root, re-queue storage), not found (deleted account, terminal), expired (stale pivot), or invalid proof (retry and score the peer). Refreshes are rare, so one account per request is fine. The shared verification is factored into SnapProviderHelper.BuildAndVerifyRoot, used by both CommitRange and the new no-commit VerifyAccountRange. The now-unused GetTrieNodes(AccountsToRefreshRequest) overload is left for the snaps/2 cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a6072fc to
bea732f
Compare
| account = null; | ||
| IReadOnlyList<PathWithAccount> accounts = response.PathAndAccounts; | ||
| if (accounts.Count == 0) | ||
| return response.Proofs.Count == 0 ? RefreshVerifyResult.Expired : RefreshVerifyResult.NotFound; |
There was a problem hiding this comment.
Should NotFound require verification?
There was a problem hiding this comment.
That is true, notfound still need proof.
There was a problem hiding this comment.
Fixed in 920ce83. NotFound is now only returned after VerifyAccountRange succeeds — i.e. the verified range proves the account is absent at the state root. An empty response (no accounts to verify) returns Expired and retries instead of concluding deletion without proof. Added TestRefreshAccount_VerifiedNotFound covering the proven-absent path.
(agent comment — Claude Opus 4.8 (1M context), on behalf of @asdacap)
- Require proof for NotFound: an empty GetAccountRange response no longer concludes the account was deleted; it retries (Expired) instead, so a terminal NotFound is only reached via a verified range proof. - Fail closed on malformed proofs: VerifyRefreshedAccount now catches all exceptions from the throwaway verification trie (untrusted P2P data) and returns InvalidProof rather than letting them crash the sync task. - Clarify the limit = path + 1 comment (avoids start == limit, which some peers treat as an empty range) and note the MaxValue no-op. - Add a regression test for the verified-NotFound (proven-absent) path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Changes
When a storage root mismatches during snap sync, the account was refreshed by fetching its storage-trie root node via
GetTrieNodesand trusting the returned node's hash blindly. That refresh path is being removed in snaps/2 and never verified the new storage root against the pivot state root.GetAccountRange(startingHash = path,limitHash = path + 1) and adopts the storage root only after verifying the account against the pivot state root.MemDb-backedPatriciaSnapTrieFactory) and does not commit, so it writes nothing to the client state DB and cannot race the partition account-range sync. An incomplete proof fails verification instead of being completed from live client data.SnapProviderHelper.BuildAndVerifyRoot, used by both the existingCommitRangeand the new no-commitVerifyAccountRange, so refresh verification is identical to account-range verification.SnapProvider.RefreshAccountsnow returnsAddRangeResult, so refresh responses feedAnalyzeResponsePerPeer): verified → adopt the storage root and re-queue storage; not found (deleted account) → terminal, no retry; expired (peer lacks the root) → retry; invalid proof → retry and score the peer.ProgressTracker.DequeAccountToRefreshdequeues one account per request (one refresh perGetAccountRange). Refreshes are rare, so this is fine.GetTrieNodes(AccountsToRefreshRequest)overload is intentionally left in place for the snaps/2 cleanup; the genericGetTrieNodes(GetTrieNodesRequest)used by state healing is untouched.Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
Added parameterized regression tests in
SnapServerTest:TestRefreshAccount(correct root verifies and propagates the real storage root; wrong root is rejected withDifferentRootHash) andTestRefreshAccount_LeafMissingFromProof(proof with leaf nodes stripped still verifies, since the account is set from the response — this fails under a proof-walk that assumes the leaf is present). FullNethermind.Synchronization.TestSnapSync + state-sync-feed suites pass (940 passed, 8 fixture-conditional skips), including the end-to-endSnapSync_*integration tests and the account/storage range tests that exercise the refactoredCommitRange.Documentation
Requires documentation update
Requires explanation in Release Notes