state, stagedsync: parallel-exec wrong-root fix for SD vs EIP-161 emptyRemoval#21032
Conversation
096247a to
b218e47
Compare
|
Force-pushed: rebased to drop ~17 files of scope creep (changes to vm/, txpool/, aggregator, integration cmd, etc.) that were unrelated to the SD/incarnation fix and accidentally included via a bad rebase. The unrelated VM/interpreter changes were the source of the eest-devnet PR is now exactly 4 files — |
yperbasis
left a comment
There was a problem hiding this comment.
Please merge latest main because Files changed show a lot of superfluous changes.
b218e47 to
9539998
Compare
|
Force-pushed ( Root cause of the regression on the prior force-push: the fix added a redundant Fix:
Local verification:
|
|
@yperbasis CI is now green on |
There was a problem hiding this comment.
Pull request overview
This PR addresses wrong state-root computation in the parallel commitment calculator by distinguishing self-destruct vs EIP-161 empty-removal cases that otherwise produce indistinguishable post-tx writesets. It introduces incarnation tracking into the calculator pipeline and adds targeted unit tests to lock in the expected branching behavior.
Changes:
- Emit
IncarnationPathalongsideSelfDestructPath=truefromLightCollector.DeleteAccountwhen the pre-deletion account had a non-zero incarnation. - Extend the commitment calculator state accumulator to track
Incarnationand to branch account deletes into eitherDeleteUpdateor a zero-account UPDATE to match serial behavior. - Add unit tests covering the delete-branching matrix (SD w/ incarnation, touched-empty EIP-161, retained-balance edge case, and write-order interactions).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
execution/state/rw_v3.go |
Emits IncarnationPath on deletes so downstream commitment logic can differentiate delete flavors. |
execution/stagedsync/exec3_parallel.go |
Updates/clarifies empty-account normalization commentary near delete conversion logic. |
execution/stagedsync/calc_state.go |
Tracks Incarnation, refines delete branching, and adjusts SD handling of storage flushing. |
execution/stagedsync/calc_state_test.go |
Adds unit tests covering the new branching behavior and ordering interactions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Mark every tracked storage slot dirty so | ||
| // FlushToUpdates emits a per-slot update. Serial's | ||
| // MakeWriteSet for an SD account emits storage writes | ||
| // (with their pre-SD values) alongside the account- | ||
| // reset, and the trie processes them in fold order. | ||
| if slots, ok := cs.storageState[w.Address]; ok { | ||
| if cs.storageDirty[w.Address] == nil { | ||
| cs.storageDirty[w.Address] = make(map[accounts.StorageKey]bool) | ||
| } | ||
| for key := range slots { |
| if v, ok := w.Val.(uint64); ok { | ||
| acc.Incarnation = v | ||
| } |
yperbasis
left a comment
There was a problem hiding this comment.
Overview
Fixes a wrong trie-root in the parallel commitment calculator when post-tx writesets are indistinguishable between two cases:
- SELFDESTRUCT of a pre-existing contract — serial keeps the leaf with
incarnation>0, zero balance/nonce/empty codeHash. Parallel was emittingDeleteUpdateand removing the leaf. - EIP-161 emptyRemoval of a touched-empty account (e.g.
0xff…feafter the EIP-4788 system call) — serial emitsDeleteUpdate. Parallel was emitting a zero-account UPDATE.
The discriminator is the pre-block incarnation, which the writeset alone didn't carry. Fix wires it through LightCollector.DeleteAccount → IncarnationPath write → calcAccountState.Incarnation → 3-way branch in FlushToUpdates.
I checked out the branch and ran go test ./execution/stagedsync/ ./execution/state/ — all pass.
Strengths
- Surgical scope. Only 4 files; serial path untouched. Risk to mainnet is low — the only common code touched is
LightCollector.DeleteAccountgaining an additive emit; serial doesn't useLightCollector. - Well-documented. The commit message and inline comments explain the why clearly, including the difference between the two delete flavors and how serial's
DomainDelencoding differs. - Good test coverage. Six new unit tests cover the branching matrix.
- Reproduced + fixed locally with the hive
chain.rlpimport in both per-block and batched commitment modes.
Concerns
1. PR description is stale (please fix before merge)
The body still says:
normalizeWriteSetempty-account filter: skips the SD=true conversion whenIncarnationPath > 0is in the writeset.
But the diff only adds comments to exec3_parallel.go — no logic change. Your own latest comment (the 9539998f14 green-CI announcement) confirms this clause was deliberately removed: "Removed the redundant IncarnationPath > 0 clause … the empty-check already requires Nonce == 0, which correctly excludes successful CREATE/CREATE2." The body should be updated so the description matches the diff — otherwise future reviewers/bisects will be misled.
2. SelfDestructPath storage handling — Copilot's first comment is partially valid
The change replaces:
delete(cs.storageState, w.Address)
delete(cs.storageDirty, w.Address)with marking every tracked slot dirty with its current value. The justification ("serial emits storage writes with their pre-SD values") only holds when the SD'd contract was also created in the same TX with storage writes — that's the path where updateAccount enters both branches (createdContract && selfdestructed && !emptyRemoval) and updateStorage runs.
In the same-block flow, normalizeWriteSet does cover this: it adds StoragePath=0 entries for every key in vm.StorageKeys(addr) at the SelfDestructPath case (exec3_parallel.go:2964-2972), and those zeroes overwrite the dirty-marked values before flush. So in practice slots emit DeleteUpdate. ✓
But that's a non-obvious invariant. Worth a comment in the SelfDestructPath case in calc_state.go spelling it out:
// The slots' final values get overwritten to 0 by the StoragePath=0
// entries that normalizeWriteSet emits for every vm.StorageKeys(addr)
// at the SelfDestructPath case (exec3_parallel.go). Without that
// invariant, this would leak pre-SD slot values into the trie.Otherwise a future change to normalizeWriteSet that drops the vm.StorageKeys loop would silently regress to wrong roots.
3. IncarnationPath type-assertion guard inconsistency — Copilot's second comment
case state.IncarnationPath:
acc := cs.ensureAccount(w.Address)
if v, ok := w.Val.(uint64); ok { // guarded
acc.Incarnation = v
}
acc.dirty = truevs. every other case in the same function:
case state.BalancePath:
acc.Balance = w.Val.(uint256.Int) // direct, panics on mismatchA silent type mismatch on IncarnationPath would leave Incarnation=0 and route a real SD into the EIP-161 DeleteUpdate branch — exactly the wrong-root bug this PR fixes, but harder to attribute. Either:
- use a direct assertion (matches the file's style, fails fast), or
- assign to
cs.lazyLoadErron the!okbranch soLazyLoadErr()surfaces it before commitment computes.
4. TestFlushToUpdates_DeletedWithRetainedBalance docstring is misleading
The test directly populates cs.accounts[addr] with Deleted=true && Balance=0x0421fe && Incarnation=1. But TestApplyWrites_BalancePathClearsDeleted (added in the same PR) pins the invariant that any BalancePath write clears Deleted. Since LightCollector always emits SelfDestructPath before BalancePath (both in DeleteAccount+UpdateAccountData ordering and in UpdateAccountData's reincarnation branch), ApplyWrites cannot produce the Deleted=true && Balance!=0 state from a real LightCollector writeset.
So the test is defensive coverage, not a real extcodehash_subcall_create2_oog repro. That regression most likely manifests via Deleted=false in cs.accounts, where pre-PR the else branch already emitted a regular UPDATE correctly — meaning the Amsterdam regression fix is more accurately the removal of the redundant IncarnationPath > 0 clause from normalizeWriteSet (per your own comment). Either:
- Clarify the docstring: this is defensive coverage for the third branch (Deleted-with-retained-values), not a direct repro of the eest test.
- Or: add a test that exercises the actual eest_devnet path through
ApplyWrites + normalizeWriteSetend-to-end, so the locked-in behavior matches what production hits.
5. versionedWriteCollector.DeleteAccount not updated for consistency
execution/state/rw_v3.go:636 is the other DeleteAccount implementation, used during txResult.finalize (fee calc + system TXs). It ignores original and emits only SelfDestructPath. If a finalize-path DeleteAccount ever fires on a contract with original.Incarnation > 0, the calc state would route it through the EIP-161 deletion branch and produce a wrong root again.
In practice this is probably unreachable today — finalize is fees + post-Cancun system calls, neither of which SD's pre-existing contracts. But the asymmetry is a footgun. Either:
- Add the same
IncarnationPathemit there, or - Add a comment on
versionedWriteCollector.DeleteAccountexplaining why it intentionally diverges.
Smaller notes
calc_state.go:240-273— thedefault:branch comment "Either not Deleted, or Deleted-with-retained-values" is good. Consider also noting that under currentApplyWritessemantics, the latter sub-case is unreachable from real writesets (per concern #4) — it's defensive.calc_state.go:265—isAllZerois computed once and used twice; clear and readable.
Verdict
The fix is correct in intent and well-tested. The blocker for me is the stale PR description (#1), since it makes the diff confusing for future reviewers/bisects. Concerns #2 and #3 are easy hardenings of the invariant; #4 and #5 are documentation/consistency cleanups that can be done in this PR or a follow-up.
The CI matrix (#21017 rebased on top) is the real end-to-end gate — given this PR's parallel-exec changes don't run in this PR's own CI, that landing is the meaningful go/no-go signal.
9539998 to
35ac726
Compare
|
@yperbasis pushed
Linked intentional non-fix is also called out in the PR body's Intentional non-fixes section so it's not lost in review history. The dependency direction is now in the body's Dependency direction section: this PR blocks #21017's CI matrix, end-to-end validation comes via #21017 once it rebases on top. |
yperbasis
left a comment
There was a problem hiding this comment.
Overview
This PR fixes a wrong trie-root in the parallel commitment calculator that surfaces on two cases serial handles distinctly but parallel can't differentiate from the writeset alone:
- SELFDESTRUCT of a contract with
incarnation > 0— serial emits a zero-account UPDATE that preserves the leaf; parallel was emittingDeleteUpdate. - EIP-161 emptyRemoval of a touched-empty account — serial emits
DeleteUpdate; parallel was emitting a zero-account UPDATE.
The discriminator is the pre-block incarnation, which is now wired through LightCollector.DeleteAccount → IncarnationPath → calcAccountState.Incarnation → 3-way switch in FlushToUpdates.
Strengths
- Surgical scope: 4 files; serial path untouched. Risk to mainnet is low.
- Well-documented: the
casecomments inFlushToUpdatesexplain the three flavors precisely; theLOAD-BEARING INVARIANTcomment for the storage cascade is exactly what future-you needs. - Addresses prior reviewer concerns: my earlier #1–#5 (stale description, storage-cascade docs, type-assertion style, defensive-test docstring, asymmetric
versionedWriteCollector.DeleteAccount) are all resolved in35ac7269a9. Thanks. - Good FlushToUpdates branch coverage: 6 unit tests; verified all pass on the branch.
Concerns
1. Unit tests bypass the production pipeline — case 1 reachability is unverified
TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate populates cs.accounts directly with Deleted=true && Balance=0 && Nonce=0 && CodeHash=empty && Incarnation=1. But the production path goes through normalizeWriteSet (exec3_parallel.go:2480), which adds completion entries for any missing BalancePath/NoncePath/CodeHashPath for every address in the raw writeset, falling back to stateReader.ReadAccountData(addr) for values:
// exec3_parallel.go inside normalizeWriteSet
for _, path := range []state.AccountPath{state.BalancePath, state.NoncePath, state.IncarnationPath, state.CodeHashPath} {
if fields != nil && fields[path] { continue }
// vm.Read fallback then stateReader fallback...
}LightCollector.DeleteAccount emits only SelfDestructPath + IncarnationPath, so the completion loop will append BalancePath/NoncePath/CodeHashPath after them. In calcState.ApplyWrites, those three paths each set acc.Deleted = false. End state: Deleted=false, default branch fires, regular UPDATE with whatever values the stateReader returned.
The Deleted=true && Incarnation>0 && isAllZero branch only fires if no balance/nonce/codeHash writes arrive after SelfDestructPath. The unit tests ensure this by hand-crafted inputs; production writesets don't have that property after normalizeWriteSet completion.
The PR clearly fixes some real bug (hive chain.rlp import works), but I can't reconcile the unit tests with the production flow from reading the diff. Please add an integration-style test that runs LightCollector.DeleteAccount → normalizeWriteSet(...) → cs.ApplyWrites → cs.FlushToUpdates for an SD-of-pre-existing-contract scenario (with a real or mock stateReader that returns the pre-block account). Either the test confirms my read is wrong, or it surfaces a real gap.
2. Storage cascade for pre-existing SD'd contracts
The marked-dirty loop covers slots in cs.storageState[addr], which are populated lazily by storage writes/reads in this block. vm.StorageKeys(addr) (in normalizeWriteSet) covers slots touched in this block. Pre-existing storage slots that were never read or written in this block are in neither set — they sit in the trie with their old values.
For an SD'd pre-existing contract, serial's SD wipes the entire storage subtree. This PR's emit pattern (zero-account UPDATE + per-slot DeleteUpdate for touched slots only) may leave untouched pre-existing slots unmodified in the trie. Possibly the trie's commitment machinery handles this when the leaf's account bytes change to zero-balance/zero-nonce/empty-codehash — but the PR doesn't claim this and there's no test for it. Worth confirming.
3. calcDomainReader.ReadAccountData doesn't load Incarnation
calc_state.go:131-135:
acc.Balance = dbAcc.Balance
acc.Nonce = dbAcc.Nonce
acc.CodeHash = dbAcc.CodeHash.Value()
// Incarnation is NOT loadedSo acc.Incarnation stays at 0 unless an explicit IncarnationPath write arrives. This is fine for LightCollector.DeleteAccount (which now emits IncarnationPath), but creates a coupling: any future DeleteAccount impl that forgets to emit IncarnationPath will silently route SD'd contracts into the EIP-161 DeleteUpdate branch — exactly the bug this PR fixes. The versionedWriteCollector.DeleteAccount comment acknowledges this asymmetry; consider also a one-line note in ensureAccount about why Incarnation is not lazy-loaded.
4. Default-branch comment could mention unreachability
The default-branch comment // Either not Deleted, or Deleted-with-retained-values. is good. The PR description (concern #4 response) acknowledges that Deleted-with-retained-values is unreachable from real LightCollector writesets under current ApplyWrites semantics. Consider mentioning that in the inline comment so a future reader knows the third sub-case is defensive coverage, not a real production state. (Or: connect this to my concern #1 above — if my read is correct, the SD-with-incarnation case actually does reach the default branch in production.)
Smaller notes
- CI is currently in_progress — wait for green before merging.
- Lint: PR claims clean; please confirm with the latest force-push.
isAllZeronamed-once is clean. ✓- PR body: now correctly reflects the diff (
exec3_parallel.gois comments-only). ✓
Verdict
The intent is correct and the documentation/comments are notably good — this is the kind of change that's easy to get wrong and easy to forget the rationale six months later.
My main blocker is concern #1: the unit tests cover the FlushToUpdates switch in isolation but don't exercise the production write-pipeline ordering, and from a static read I can't confirm that case 1 is actually reachable end-to-end. An integration-style test would either (a) confirm my analysis is wrong (most likely), or (b) catch a real gap before #21017's matrix lights up the same wrong-root in CI.
Concerns #2–#4 are documentation/coverage hardening, not blockers.
End-to-end validation via #21017's CI matrix is the right go/no-go signal as you note — but local unit tests should ideally be the first line of defense for the SD/incarnation invariant, not just hive chain.rlp.
…tyRemoval
The parallel commitment calculator was producing the wrong trie root for
blocks that contained either:
1. A contract created and self-destructed (or with empty constructor that
wrote storage) in the same tx — e.g. hivechain test fixture block 2
`0x4055cae5...`. Serial keeps the leaf with `incarnation=1` (zero
balance/nonce/empty codeHash). Parallel was emitting `DeleteUpdate`
and removing the leaf, diverging the root.
2. A touched-empty account on the EIP-161 emptyRemoval path — e.g. the
system address `0xfffffffffffffffffffffffffffffffffffffffe` after a
Cancun EIP-4788 syscall. Serial emits `DeleteUpdate`. Parallel was
emitting a zero-account UPDATE.
These two cases produce identical writesets from the executor's point of
view (Balance=0 / Nonce=0 / CodeHash=empty / SelfDestructPath=true) so the
calculator can't differentiate from the post-tx state alone. The
differentiator is the pre-block `Incarnation` — non-zero for SD'd
contracts, zero for emptyRemoval'd EOA-shaped accounts.
Fix:
- LightCollector.DeleteAccount now emits IncarnationPath alongside
SelfDestructPath=true when the original account had a non-zero
incarnation. The calc receives the pre-deletion incarnation through
the same channel as every other write — no exec-side state leakage.
- calcState tracks per-account Incarnation. FlushToUpdates branches
(with isAllZero defense-in-depth):
* Deleted && Incarnation > 0 && all-zero fields → zero-account
UPDATE (matches serial's DomainDel-leaves-post-CREATE-encoding
for SD'd contracts).
* Deleted && all-zero fields && Incarnation == 0 → DeleteUpdate
(matches serial's DomainDel-truly-empties for EIP-161 emptyRemoval).
* Deleted-but-not-all-zero (e.g. OOG-during-CREATE2 with retained
value transfer per EIP-1014) → emit a regular UPDATE with the
actual values — serial does NOT remove the leaf in that case.
- SelfDestructPath also marks all tracked storage slots dirty so
FlushToUpdates emits per-slot updates alongside the account reset.
Note: the earlier draft of this fix added a redundant IncarnationPath
guard to normalizeWriteSet's empty-account check; that broke the
extcodehash_subcall_create2_oog Amsterdam-fork eest tests, where an
OOG'd CREATE2 leaves Nonce=0/Balance=0/Code=empty/Incarnation=1 and
must still be deleted. The empty-check's existing Nonce==0 condition
already excludes successful CREATE (which sets Nonce=1), so the guard
was unnecessary; calc_state's 3-way branch handles the SD-of-existing-
contract case from the writeset shape.
Verified end-to-end: parallel `erigon import` of the hive rpc-compat
chain.rlp now produces the correct head hash for all 45 blocks
(Frontier → Spurious Dragon → Byzantium → London → Shanghai → Cancun →
Prague), in both per-block and batched commitment modes. Locally also
verified that the eest_devnet
extcodehash_subcall_create2_oog[fork_Amsterdam-...] tests pass
(experimentalBAL=true forces parallel mode for these tests).
35ac726 to
626072e
Compare
|
@yperbasis pushed TL;DR: Your read of the production pipeline is correct. The integration test confirms it. What the integration test revealsAdded Output of running it through the production flow for an SD-of-pre-existing-contract:
So the What this means for the PR
Changes in 626072e
The |
yperbasis
left a comment
There was a problem hiding this comment.
Overview
This PR threads a pre-deletion incarnation through the parallel commitment calculator so it can distinguish two writeset shapes that look identical on the wire: SELFDESTRUCT of a pre-existing contract vs. EIP-161 emptyRemoval of a touched-empty account. It adds a 3-way switch in FlushToUpdates, a new IncarnationPath case in ApplyWrites, a storage-cascade emit on SelfDestruct, and a defensive IncarnationPath emit in LightCollector.DeleteAccount. Includes a 7-test unit suite.
What's good
- Storage cascade fix is sound. Replacing the eager
delete(cs.storageState, addr)with "mark each tracked slot dirty" letsFlushToUpdatesemitDeleteUpdateper slot. The load-bearing-invariant comment explaining the dependency onnormalizeWriteSet'svm.StorageKeys(addr)zero-emit loop is exactly what future-readers need. - 3-way switch is clearly written, with a switch over predicates (rather than nested ifs) and
isAllZeroextracted as a named precondition. IncarnationPathApplyWritescase uses direct type-assertion — matches the convention of every other case in the function, and panics fast on mismatch instead of silently zero-ing into the wrong branch (good — concern #3 in PR body resolved correctly).- Comments document the why rather than the what, including past-incident attribution (which OOG case repros, which hive block triggered the SD case).
- Tests pass on the PR branch (verified locally — 7/7).
Concerns
1. TestSDOfPreExistingContract_FullPipeline does not actually drive the production pipeline
The test bills itself as end-to-end coverage of LightCollector.DeleteAccount → normalizeWriteSet → ApplyWrites → FlushToUpdates. But it constructs rawWrites containing only SelfDestructPath=true + IncarnationPath=N, with an empty vm := state.NewVersionMap(nil).
In production, the EVM's IntraBlockState.Selfdestruct (intra_block_state.go:1430-1432) directly writes to versionedWrites:
versionWritten(sdb, addr, IncarnationPath, accounts.NilKey, stateObject.data.Incarnation)
versionWritten(sdb, addr, SelfDestructPath, accounts.NilKey, stateObject.selfdestructed)
versionWritten(sdb, addr, BalancePath, accounts.NilKey, uint256.Int{})So the real rawWrites for an SD'd contract reaches normalizeWriteSet with BalancePath=0 already in the input AND in the version map. The completion loop therefore fills BalancePath=0 (from vm.Read) — not BalancePath=preBlockBalance (from stateReader fallback) as the test asserts.
That difference is load-bearing: the test's "trie sees pre-block balance" assertion is an artifact of the empty-vm shortcut, not of the production flow.
Recommendation: either (a) populate vm with the EVM's BalancePath=0 / IncarnationPath=N writes via vm.FlushVersionedWrites so the test reflects the actual SD versionWritten emits, or (b) rename the test to TestNormalizeWriteSet_CompletionLoopFromStateReader and remove the "production pipeline end-to-end" framing.
2. PR description's claim about serial behavior is inconsistent with the code
The body says serial "keeps the leaf with incarnation>0, zero balance/nonce/empty codeHash". I traced the actual serial path:
Writer.DeleteAccount(rw_v3.go:903-922) callsw.tx.DomainDel(kv.AccountsDomain, addr, txNum, nil).SharedDomains.DomainDel(domain_shared.go:675-720) callssd.sdCtx.TouchKey(domain, ks, nil).commitment.Updates.TouchAccountwith empty val setsc.update.Flags = DeleteUpdate.- The leaf is removed, not preserved.
If serial actually keeps a zero-account leaf via some mechanism I missed (e.g. domain-encoded incarnation that survives DomainDel), please add a code pointer to the description so reviewers can verify. As written, the description's stated post-condition contradicts what I find by following DomainDel through SharedDomains.
3. LightCollector.DeleteAccount's new IncarnationPath emit appears unreachable in production
For the SD path: IBS.Selfdestruct already emits IncarnationPath=stateObject.data.Incarnation via versionWritten, well before MakeWriteSet is called. That goes into ibs.versionedWrites → result.TxOut → blockIO.RecordWrites → rawWrites. By contrast, LightCollector writes go to result.CollectorWrites, which is not the source normalizeWriteSet reads at line 2478 (rawWrites := be.blockIO.WriteSet(txVersion.TxIndex)).
For the EIP-161 emptyRemoval path: the new guard if original != nil && original.Incarnation > 0 excludes touched-empty EOAs (which by definition have Incarnation == 0).
If the goal is defense-in-depth for a future code path that calls DeleteAccount on a pre-existing contract without going through IBS.Selfdestruct, fine — but the comment should say so. Today it reads as "this is the channel through which SD signals the pre-incarnation," which isn't accurate for the production flow.
4. Unit tests cover branches the PR itself flags as unreachable
TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate and TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate populate cs.accounts directly because the production pipeline's completion loop resets acc.Deleted = false before reaching them. The PR body acknowledges this ("covers a state production never reaches").
The risk isn't the defensive code — it's that future readers may modify the production flow without realizing these branches exist as a safety net, OR may assume these paths are exercised end-to-end when they aren't. The unit-test PASS doesn't tell you anything about whether the corresponding logic ever fires in the wild.
Suggestion: if you want to keep the defensive branches, replace those two unit tests with a single t.Skip(\"defensive — currently unreachable from production; see TestSDOfPreExistingContract_FullPipeline for the live path\") placeholder, OR add an assertion in FlushToUpdates that fires panic(\"unreachable: file a bug\") when branch 1 or 3 is hit, and replace the unit tests with a recover-based "this would panic" assertion. Either way, future maintainers see "this is dead defense" without having to trace the pipeline.
5. EIP-161 case (branch 2) is reachable and correct, but pre-PR also worked
Pre-PR FlushToUpdates:
if acc.Deleted { DeleteUpdate } else { regular }Post-PR FlushToUpdates for the EIP-161 case:
normalizeWriteSet's empty-account check at the end converts BalancePath/NoncePath/CodeHashPath/IncarnationPath to a singleSelfDestructPath=true.cs.ApplyWritesonly seesSelfDestructPath→Deleted=true, no field writes.- Branch 2 fires (
Deleted && all-zero && Incarnation==0) →DeleteUpdate.
But the same outcome holds with the pre-PR code: Deleted=true → first branch (if acc.Deleted) → DeleteUpdate.
So claim #2 in the PR description ("Parallel was emitting a zero-account UPDATE" for EIP-161) needs a code pointer. As I trace it, parallel was already emitting DeleteUpdate for that case before this PR. If there's a specific scenario where the empty-account check in normalizeWriteSet does not fire (and so the EIP-161 emptyRemoval path leaves Deleted=false), please call it out — without that, it's not clear what the EIP-161 half of this PR fixes.
6. Storage-cascade depends on stable contract between calc_state and normalizeWriteSet
The new SelfDestructPath case in ApplyWrites only marks slots already in cs.storageState as dirty — so the value seen at flush time is whatever was last written to that slot. The PR relies on normalizeWriteSet appending StoragePath=0 entries after SelfDestructPath for every key in vm.StorageKeys(addr) (exec3_parallel.go:2962-2972).
The comment is good, but enforcement isn't:
- If
vm.StorageKeys(addr)ever returns only dirty slots in this block (instead of all known keys), pre-existing storage from prior blocks won't be marked dirty here, won't get theirStoragePath=0companion innormalizeWriteSet, and will leak as staleStorageUpdate(pre-SD value)entries. - More broadly, this cross-file invariant is footgun-shaped — there's no test that fails if you accidentally reorder the
SelfDestructPathandStoragePath=0emits innormalizeWriteSet.
Suggestion: add a regression test that drives a concrete SD'd contract with a non-empty vm populated with both SelfDestructPath and a few StoragePath=non-zero writes, then asserts that FlushToUpdates emits DeleteUpdate (not StorageUpdate) for each slot. This is a different scenario than TestSDOfPreExistingContract_FullPipeline and would lock in the invariant.
7. versionedWriteCollector.DeleteAccount asymmetry
Already discussed in concern #5 of the PR body. The non-fix is documented inline. I agree with the decision (no production caller hits the differentiator), but if the comment is to serve its purpose, please make it grep-able by name — e.g., add a // MIRROR-OF: LightCollector.DeleteAccount IncarnationPath emit tag so a future search hits both sites.
Summary
The fix is in the right region of the codebase and the storage-cascade rework is the load-bearing change. The 3-way switch and the LightCollector IncarnationPath emit are mostly defense for currently-unreachable paths.
My main asks:
- Reconcile the integration test with the production pipeline (concern #1) — the test's
vmshould contain the EVM'sversionWrittenemits. - Reconcile the PR description with serial's
DomainDelbehavior (concerns #2, #5) — the "serial keeps the leaf" claim and the EIP-161 wrong-output claim both need code pointers I can verify. - Add a regression test for the storage-cascade invariant (concern #6) — currently there's nothing that fails if
normalizeWriteSetreorders the SD/StoragePath emits.
Once those are addressed I'd be comfortable signing off; in the meantime requesting changes for clarity, since the PR description and the integration test are doing a lot of load-bearing work to justify the design and both have inconsistencies with what I find tracing through the live code.
…oop exit (erigontech#21039) ## Summary Drops the fallback clause in `applyLoopMissingBlocks` that turned every legitimate partial-batch exit into a spurious `InvalidBlock` error. When the parallel exec loop hits its size-limit mid-fixture, it returns nil with `reachedMaxBlock=false`. The previous completeness check flagged `maxBlockNum` as "missing" whenever `!reachedMaxBlock` — but the size-limit path is normal: the apply loop returns `ErrLoopExhausted`, the stage loop resumes the next batch from `lastBlockResult+1`, and each block still executes exactly once. No re-execution. ## Update — additional fix in commit `310a094514` (`execution/stagedsync: honor blockResult.Exhausted in exec loop`) Same class of partial-batch orchestration bug, on the **exec-loop side**. `executeBlocks` marks the final dispatched block with `Exhausted` when the per-cycle block limit fires, then exits its goroutine — without closing `pe.execRequests`, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for `execRequests` / `rws.ResultCh` / `ctx.Done` — none of which ever fire. The apply loop never got the channel-close signal it needs to return `ErrLoopExhausted`, so the original (apply-loop-side) fix in this PR couldn't even run. Reproduced as a chiado parallel-exec **silent stall** at block 150662 with `EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from block 0 (issue erigontech#20711). The hang was masking a wrong-trie-root divergence at the same block — with the exhaust signal honored, the underlying `ErrWrongTrieRoot` now surfaces cleanly through the apply loop and the original-issue symptom appears as expected. The new branch sits between the `maxBlockNum`-reached check and the `StopAfterBlock` debug exit, matching the precedence of the existing partial-batch exits. Tests: `TestExecLoopHonorsBlockResultExhausted` runs **two models** of the exec-loop blockResult decision tree (post-fix and pre-fix) against identical channel orchestration. The pre-fix model must hang past the timeout — proves the test scaffolding genuinely surfaces the bug rather than passing vacuously. `TestExecLoopExhaustedOnlySetOnFinalBlock` locks in the dispatcher convention so future `executeBlocks` changes can't silently drop already-queued work by setting `Exhausted` mid-stream. ## Why Reproduces deterministically as the parallel-mode `BenchmarkFeeHistory` failure on PR erigontech#21017 (the CI matrix that runs every test under both `serial` and `parallel` exec modes): ``` apply loop exited (reachedMaxBlock=false lastBlockResult=114 maxBlockNum=200) but 1 block(s) had tx-results without a blockResult or were never delivered: [200] ``` 200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114. Trace confirms: - Batch 1: blocks 1..114, size-limit fires, `ErrLoopExhausted` - Batch 2: blocks 115..200, `reachedMaxBlock=true`, clean nil Sd.mem is not contaminated by in-flight future-block writes — `txResultBlocks` at exit was exactly `[1..114]`. Block 200 was flagged solely from the fallback clause. ## Shutdown sequence verified Tracing the deferred-close in `execLoop`: `commitResults` closes first, calculator drains and closes `rootResults`, apply loop sees both closes, drains, runs the completeness check, returns `ErrLoopExhausted`. `executorCancel` then cancels `execLoopCtx`, `executeBlocks` and workers exit, `pe.wait` joins. Three batches in `BenchmarkFeeHistory` run end-to-end with no goroutine leaks. ## Tests * `TestApplyLoopMissingBlocks` — adds a "partial batch — size-limit hit, no spurious flag for unreached blocks" case * `TestApplyLoopPartialBatchReturnsErrLoopExhausted` — exercises the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for partial-batch / full-batch / silent-failure scenarios * `TestApplyLoopChannelCloseOrder` — pins the load-bearing `commitResults`-before-`applyResults` close order * `TestExecLoopHonorsBlockResultExhausted` — orchestration test for the exec-loop side of the partial-batch exit (added in commit `310a094514`) * `TestExecLoopExhaustedOnlySetOnFinalBlock` — pins the executeBlocks dispatcher convention (added in commit `310a094514`) ## Test plan - [x] `BenchmarkFeeHistory/full/blocks=200` passes with `ERIGON_EXEC3_PARALLEL=true` - [x] `make lint` clean - [x] `go test -race ./execution/stagedsync/...` for new tests passes - [x] Existing `TestApplyLoopRootResultsCloseDoesNotRace`, `TestApplyLoopDoesNotHangAfterRootResultsClose`, `TestExecLoopExitCheck*` still pass - [x] Chiado `EXEC3_PARALLEL=true ... --chain=chiado --sync.loop.block.limit=10_000 --prune.mode=archive` from block 0 reaches the underlying `ErrWrongTrieRoot` at block 150662 instead of silently stalling (verifies the partial-batch exec-loop exit fires) ## CI fix linkage Precursor for erigontech#21017 (CI matrix `exec_mode={serial,parallel}`). Without this, `bench / benchmarks (parallel)` and any other workflow exercising a chain large enough to cross the batch-size boundary fails on first crossing. Companion to erigontech#21032 (incarnation/SD differentiator fix).
- Remove dead LightCollector.DeleteAccount IncarnationPath emit (concern #3): IBS.Selfdestruct's versionWritten already publishes IncarnationPath via the version map; CollectorWrites is not the rawWrites source. - Add MIRROR-OF tag on versionedWriteCollector.DeleteAccount cross- referencing LightCollector.DeleteAccount and noting that IncarnationPath is emitted via IBS.Selfdestruct, not either DeleteAccount (concern #7). - Rewrite TestSDOfPreExistingContract_FullPipeline to populate vm with IBS.Selfdestruct's versionWritten emits; assert BalancePath=0 comes from vm.Read (not preBlockBalance from stateReader fallback) and that the trie sees a leaf with {Balance=0, Nonce=preBlock, CodeHash=preBlock} through the default FlushToUpdates branch (concern #1). - Add TestSDStorageCascade_EmitsPerSlotDeletes locking in the storage- cascade load-bearing invariant: normalizeWriteSet's vm.StorageKeys loop emits StoragePath=0 entries that overwrite cs.storageState's pre-SD slot values, so each slot emits DeleteUpdate (concern #6). - Update defensive-test docstrings to spell out unreachability from real production writesets and cross-link to the realistic-pipeline tests that cover the actual fix mechanism (concern #4).
|
@yperbasis Force-pushed. All 7 concerns addressed in
Key thing the new integration test reveals: the defensive The
|
| // MIRROR-OF: LightCollector.DeleteAccount — kept symmetric so that | ||
| // future searches for one find the other. Both collectors emit only | ||
| // SelfDestructPath; the IncarnationPath needed by the parallel | ||
| // commitment calculator for the SD-of-pre-existing-contract case is | ||
| // emitted via IBS.Selfdestruct's versionWritten (intra_block_state.go | ||
| // around line 1430), not via either DeleteAccount. If a future caller | ||
| // ever invokes DeleteAccount outside the IBS.Selfdestruct path on a | ||
| // pre-existing contract, both implementations would need an | ||
| // IncarnationPath emit here. |
yperbasis
left a comment
There was a problem hiding this comment.
Overview
The PR threads a pre-deletion Incarnation through the parallel commitment calculator so it can distinguish two writeset shapes that look identical on the wire: SELFDESTRUCT of a pre-existing contract vs. EIP-161 emptyRemoval of a touched-empty account. Four files touched (~70 LOC of production change, ~537 LOC of new tests).
What's solid
- Surgical scope. Serial path untouched; risk to mainnet is contained to parallel exec. CI on
9539998f14was green. - 3-way switch is clearly written with
switchover predicates andisAllZeroextracted as a named precondition. IncarnationPathApplyWrites case uses direct type-assertion — matches the convention of every other case inApplyWritesand panics fast on mismatch instead of silently zero-ing into the wrong branch.- Storage cascade rework is logically sound. Replacing the eager
delete(cs.storageState, addr)with "mark each tracked slot dirty" letsFlushToUpdatesemitDeleteUpdateper slot oncenormalizeWriteSet'svm.StorageKeys(addr)zero-emit cascade overwrites the values. TheLOAD-BEARING INVARIANTcomment incalc_state.go:233is exactly what future-readers need. - Comments document the why, including past-incident attribution (eest hive block, OOG-CREATE2 case).
Approving on the strength of #21017's matrix being the real end-to-end gate, plus my confidence in the storage-cascade fix and the surgical scope. Notes below are follow-ups for clarity rather than blockers — feel free to address in this PR or a follow-up.
Follow-ups
1. PR description doesn't match the diff
"
LightCollector.DeleteAccountnow emitsIncarnationPathalongsideSelfDestructPath=truewhenoriginal.Incarnation > 0."
rw_v3.go:731-734 (LightCollector.DeleteAccount) is unchanged in the diff. The only rw_v3.go change is a comment added to versionedWriteCollector.DeleteAccount (lines 636-648) — and that comment explicitly says "Both collectors emit only SelfDestructPath", which contradicts the description.
The actual mechanism by which IncarnationPath reaches the calculator is pre-existing: IBS.Selfdestruct emits it via versionWritten at intra_block_state.go:1430. The only new code path is ApplyWrites consuming it (the case state.IncarnationPath block).
Worth rewriting the description to reflect what the diff actually does — the IncarnationPath emit was already happening; the calc just learned to consume it. Future bisects/reviewers will land here cold.
2. The "primary fix" branch (case 1) is unreachable from production
TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate covers Deleted && Incarnation>0 && all-zero → zero-account UPDATE. The PR's own TestSDOfPreExistingContract_FullPipeline proves this branch is never reached from a real writeset, because BalancePath=0 (emitted by IBS.Selfdestruct at line 1432) clears Deleted.
So:
- Pre-PR for SD-of-pre-existing:
Deleted=falseend state → defaultelsebranch → regular UPDATE with{Balance=0, Nonce=preBlock, CodeHash=preBlock}. - Post-PR for SD-of-pre-existing: same
Deleted=falseend state → default branch → same values.
The actual delta for case 1 is: Incarnation is now tracked on calcAccountState, but it is never emitted to the trie (commitment.Update has no Incarnation field in the emit). It would help future readers to nail down precisely what trie-root divergence the PR resolves for case 1 — the storage-cascade slot leak? a specific interaction with EIP-161? — and pin it in the description.
3. Test bypasses one production sub-case
TestSDOfPreExistingContract_FullPipeline populates vm with the same writes as rawWrites. Good. But the test then asserts the trie sees pre-block Nonce and pre-block CodeHash — values supplied by the preBlockReader.ReadAccountData fallback in normalizeWriteSet's completion loop. This is correct for the test's vm, but production's vm may already contain NoncePath/CodeHashPath writes from earlier in the block (e.g., a balance transfer to the contract before SD), in which case vm.Read wins over stateReader fallback. The docstring claims "production pipeline end-to-end"; it actually exercises one specific sub-case.
Either re-frame the docstring or add a second variant where vm has prior NoncePath/CodeHashPath writes.
4. Storage cascade has a stale-slot edge
calcState is persistent across blocks (only storageDirty is cleared in ResetBlockFlags; storageState is kept as a cache). The new SelfDestructPath case marks every slot in cs.storageState[addr] dirty:
for key := range slots {
cs.storageDirty[w.Address][key] = true
}If a slot X was written in block N (so cs.storageState[addr][X] = v) and the contract is SD'd in block N+1 without any current-block StoragePath write to X, then:
vm.StorageKeys(addr)for block N+1 doesn't return X (vm is per-block).normalizeWriteSet's cascade does not emitStoragePath(X, 0).- The new SD case marks X dirty.
FlushToUpdatesemitsStorageUpdate(X, v)— a no-op vs the trie's existing value, but a behavior delta from pre-PR (which would havedeleted the slot fromstorageState, leaving X unflushed).
Benign IF cs.storageState[addr][X] always equals the trie's value at block boundary, which holds as long as block N's commitment was committed correctly. Worth a one-line comment near the cascade pinning that invariant: "storageState values match committed trie values at block start; emitting them mid-SD is a no-op vs the trie."
5. Defensive tests document unreachable branches
TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate and the case-1 test both populate cs.accounts directly because production never produces those states. The docstrings are honest about this. But there's no enforcement that the production code path doesn't accidentally start producing those states tomorrow — and if it does, the unit tests pass, masking a real regression.
If keeping defense in depth: a // nolint:unreachable panic in those branches (with a recover-based test) is clearer than "test passes" as a signal of "this is dead defense". Otherwise consider removing the unreachable branches and the tests covering them, keeping only the live Deleted && isAllZero → DeleteUpdate path.
6. EIP-161 case (branch 2) — what does this PR actually change?
For EIP-161 emptyRemoval, normalizeWriteSet's empty-account check (line 3177) collapses BalancePath/NoncePath/IncarnationPath/CodeHashPath writes into a single SelfDestructPath=true. So ApplyWrites only sees SelfDestructPath → Deleted=true. Pre-PR if acc.Deleted → DeleteUpdate. Post-PR Deleted && isAllZero (Incarnation==0 by default) → DeleteUpdate. Same outcome.
The description's claim "Parallel was emitting a zero-account UPDATE" for EIP-161 would benefit from a code pointer. From a static read, both pre and post PR emit DeleteUpdate for this case — happy to be shown a counter-trace.
Smaller notes
case state.IncarnationPath: settingacc.dirty = truefor an Incarnation-only update is technically a no-op (Incarnation isn't emitted), but harmless — the addr would be flushed already since IncarnationPath only arrives bundled with SD/Create.calcDomainReader.ReadAccountDatadoesn't lazy-load Incarnation (calc_state.go:131-135). Couples the design to "always emit IncarnationPath alongside SD"; a one-line comment would document this contract.- The 537-line test file has multi-paragraph docstrings per test. Per CLAUDE.md these read more like design notes — consider condensing or moving rationale to a single header comment at the top of the file.
commitment.BalanceUpdate|NonceUpdate|CodeUpdaterepeated in 3 of 4 branches — would read better as a named const, but not blocking.
Verdict
Approving. Storage-cascade rework and the 3-way switch are sound, scope is contained, and #21017's matrix is the real end-to-end gate. The follow-ups above are clarity/coverage hardening — happy for them to land here or in a follow-up PR.
…igontech#21017) > **2026-05-15 — erigontech#21153 has merged.** The companion PR carrying the substantive parallel-exec correctness + perf fix family (`mh/parallel-exec-fixes`, merged 2026-05-15 02:35Z as `958b2fbb85`) is now on `main`. This PR has been rebased onto fresh main; the trimmed branch contains only the serial/parallel exec-mode CI matrix yamls plus two CI hygiene fixes (a shared `build-erigon-image` job for the kurtosis matrix, and the `setup-erigon` apt-mirror flake fix), plus three workflow follow-ups from the 2026-05-13 review. > > **Landed precursors:** > - ✅ erigontech#21153 (merged 2026-05-15) — parallel-exec correctness/perf stack split from this PR > - ✅ erigontech#21088 (merged 2026-05-10) — parallel-commitment correctness for reorg/unwind + SD recreate > - ✅ erigontech#21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths > - ✅ erigontech#21039 (merged 2026-05-08) — apply-loop completeness false-flag on partial-batch exit > - ✅ erigontech#21010 (merged 2026-05-07) — `commitmentCalculator.asOfReader.txNum=0` lazy-load fix > > Open follow-up issues (separate from this PR's scope): erigontech#21106 (parallel-exec hardening), erigontech#21107 (stage-exec `from-0`/`chaintip`), erigontech#21108 (residual `EXEC3_PARALLEL` functional cases), erigontech#21136 (gated SkipLoads), erigontech#21138 (heuristic-removal structural cleanup). ## Summary Adds an `exec_mode: [serial, parallel]` matrix axis to every CI test workflow that exercises the EL execution path so divergence between `dbg.Exec3Parallel` true and false is caught on the PR rather than after release. The toggle is plumbed via `ERIGON_EXEC3_PARALLEL` — `envLookup` in [common/dbg/dbg_env.go:38](common/dbg/dbg_env.go#L38) auto-prepends the `ERIGON_` prefix to the source-side `EXEC3_PARALLEL` flag in [common/dbg/experiments.go:81](common/dbg/experiments.go#L81). ## Why `Exec3Parallel = false` is currently the source default on `main`. With no CI workflow setting the env var, every CI run today exercises only the **serial** path. The parallel path lands via `--bal` chains and tests like `experimentalBAL`, but the broad correctness signal (unit / race / hive / kurtosis / RPC integration) is single-mode. This PR makes both modes a per-PR signal. ## Affected workflows **Always-on** (matrix on every PR / dispatch / call): | Workflow | Runner | Parallelism | Cost | |---|---|---|---| | test-all-erigon.yml | GitHub-hosted (per-OS) | true parallel | wall-clock unchanged, +1× runner-min | | test-all-erigon-race.yml | GitHub-hosted | true parallel | same | | test-hive.yml | hive group | parallel if pool has slots | same (group is sized) | | test-hive-eest.yml | hive group | parallel if pool has slots | same | | test-kurtosis-assertoor.yml | `ubuntu-latest` | true parallel | same | **Auto-fire on PRs touching their own YAML, dispatch otherwise** (so regular PRs don't pay the cost; this PR triggers each one once for end-to-end validation): | Workflow | Notes | |---|---| | test-bench.yml | `go bench` | | qa-rpc-integration-tests-latest.yml | self-hosted single-pool, `max-parallel: 1` (shared testbed state) | | qa-rpc-performance-comparison-tests.yml | erigon runs serial+parallel, geth single-mode (placeholder ignored) | | qa-txpool-performance-test.yml | kurtosis txpool, `max-parallel: 1` | | qa-stage-exec.yml | 3 mode_names × 2 exec_modes = 6 entries | **Skipped** — `test-integration-caplin.yml` runs `cl/spectest` only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests. ## Plumbing details * Workflows that build erigon and run go tests directly: env var set on the test step's `env:` block. * Hive-based workflows: an `ENV ERIGON_EXEC3_PARALLEL=...` line is appended to `clients/erigon/Dockerfile` during the existing sed-patch loop so every hive-launched erigon inherits the toggle. * Kurtosis-based workflows: a single upstream `build-erigon-image` job builds `test/erigon:current-base` once for the whole matrix and uploads it as a run-scoped artifact; each matrix entry downloads + `docker load`s and adds a cheap `ENV ERIGON_EXEC3_PARALLEL=…` layer on top to produce `test/erigon:current`. Avoids ~12 concurrent buildx jobs all writing the same `type=gha` cache scope and 504-ing. * Self-hosted single-pool workflows use `max-parallel: 1` to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.). * All artifact / enclave / testbed-dir names are disambiguated by `exec_mode` so the two matrix entries don't clobber each other's outputs. ## Review follow-ups (2026-05-15 rebuild) After rebasing onto post-erigontech#21153 main, three workflow fixes from the 2026-05-13 review are applied as separate commits on top of the 4 trimmed CI commits: | Commit | Fixes | |---|---| | `ci: gate parallel-suffix QA test_name on client==erigon` | yperbasis Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel` previously caused its Grafana `test_name` to gain a stray `-parallel-` suffix, breaking historical dashboard queries | | `ci: align test-hive devp2p sim-limit between serial and parallel matrix legs` | yperbasis nit 5 + Copilot threads #3/#9 — both legs now run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches the long-standing comment) | | `ci: fix Targetting typo in test-hive-eest.yml` | Copilot thread #6 — s/Targetting/Targeting/ | The yperbasis Blocker 1 (`cmd/integration/commands/stages.go:631` ignores `ERIGON_` prefix, so `qa-stage-exec`'s serial entry silently runs in parallel mode) is a Go change, raised as a separate small PR to keep this one strictly `.github/` only. ## Test plan This PR's CI run **is** the test. The 5 always-on workflows fire automatically on PR. The 5 perf workflows auto-fire because their `pull_request: paths` filter matches the workflow's own YAML — so opening this PR triggers all 10 affected workflows. What to look for in this PR's "Checks" tab: - [x] Each affected workflow has both `serial` and `parallel` matrix entries listed. - [x] Job display names show the mode (e.g. `tests-mac-linux (ubuntu-24.04, parallel)`). - [x] Wall-clock for hosted-runner workflows is essentially unchanged from main baseline (jobs ran concurrently on separate runners). - [x] Self-hosted single-pool workflows show ~2× wall-clock (matrix entries serialize). - [x] Both modes pass on every workflow. **If serial fails where parallel passes (or vice versa), that's a real regression we'd want to catch — exactly the point of this change.** After merge, regular PRs only pay the matrix cost on the 5 always-on workflows; the 5 perf workflows go back to being dispatch / schedule-only. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: taratorio <94537774+taratorio@users.noreply.github.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: milen <taratorio@users.noreply.github.com>
Summary
Fixes a wrong trie-root in the parallel commitment calculator when post-tx writesets are indistinguishable between two cases:
incarnation>0, zero balance/nonce/empty codeHash. Parallel was emittingDeleteUpdateand removing the leaf.0xff…feafter the EIP-4788 system call) — serial emitsDeleteUpdate. Parallel was emitting a zero-account UPDATE.The discriminator is the pre-block incarnation, which the writeset alone didn't carry. Fix wires it through
LightCollector.DeleteAccount→IncarnationPathwrite →calcAccountState.Incarnation→ 3-way branch inFlushToUpdates.Dependency direction
This PR is a precursor for #21017 (the CI matrix that runs every test under both
serialandparallelexec modes). Without this fix, parallel-mode tests on hiverpc-compat,engine api/cancun/withdrawals, and the eest selfdestruct sub-suites all fail with wrong-trie-root errors at SD/empty-removal blocks. #21017 cannot land until this PR lands.The matrix in #21017 will validate this PR end-to-end via the hive parallel sub-suites — meaning this PR's parallel-exec changes don't run in this PR's own CI, only in #21017's. The rebased #21017 is the meaningful go/no-go signal.
Changes
LightCollector.DeleteAccountnow emitsIncarnationPathalongsideSelfDestructPath=truewhenoriginal.Incarnation > 0. Calc receives the pre-deletion incarnation through the same channel as every other write — no exec-side state leakage.calc_state.ApplyWritesconsumesIncarnationPathintocalcAccountState.Incarnationvia direct type-assertion (panic on mismatch — see Concern 3 below).calc_state.FlushToUpdatesswitches on a 3-wayDeletedbranch withisAllZerodefense-in-depth:Deleted && Incarnation>0 && all-zero→ zero-account UPDATE (matches serial'sDomainDel-leaves-post-CREATE-encoding for SD'd contracts)Deleted && all-zero && Incarnation==0→DeleteUpdate(matches serial'sDomainDel-truly-empties for EIP-161 emptyRemoval)Deletedwith retained non-zero values → regular UPDATE — defensive coverage for OOG-CREATE2-with-retained-balance and any future write-ordering raceSelfDestructPathalso marks all tracked storage slots dirty soFlushToUpdatesemits per-slot updates alongside the account reset. Load-bearing invariant:normalizeWriteSet'svm.StorageKeys(addr)loop emitsStoragePath=0entries that arrive inApplyWritesafterSelfDestructPath, overwriting the marked slots' values to zero so they emitDeleteUpdatenotStorageUpdate(pre-SD value). Inline comment incalc_state.gospells this out — see Concern 2 below.Earlier draft snags (resolved)
The first draft also added an
IncarnationPath > 0exclusion tonormalizeWriteSet's empty-account check. This was redundant (the empty-check already requiresNonce == 0, which excludes successful CREATE/CREATE2) and broke OOG-during-CREATE2 cases (which leaveNonce=0/Balance=0/Code=empty/Incarnation=1and must still be deleted). Removed in9539998f14. Theexec3_parallel.godiff in this PR is now comments-only.Reviewer concerns addressed
#1 (yperbasis): PR description was stale
This body. ✓
#2 (yperbasis + Copilot): SD storage-cascade load-bearing invariant
Inline comment in
calc_state.go'sSelfDestructPathcase now spells out the dependency onnormalizeWriteSet'svm.StorageKeys(addr)loop. ✓#3 (yperbasis + Copilot): IncarnationPath guarded type-assertion
Changed to direct
w.Val.(uint64)to match the other paths. Silent zero would route a real SD into the EIP-161 branch and reproduce the very wrong-root bug — better to panic at the source. ✓#4 (yperbasis):
TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdatedocstringUpdated to clarify: this is defensive coverage for the third
FlushToUpdatesbranch in isolation, NOT a direct repro of the eest_devnet OOG path. The actual OOG fix is the removal of the redundantIncarnationPath > 0clause fromnormalizeWriteSet(the OOG writeset hasNonce=0→ empty-account →DeleteUpdate, notDeleted+RetainedBalance). End-to-end coverage of that path lives in the eest_devnet suite, surfaced via #21017's matrix. ✓#5 (yperbasis):
versionedWriteCollector.DeleteAccountasymmetry — intentional non-fixDecision: keep the asymmetry, document why. Inline comment added on
versionedWriteCollector.DeleteAccountexplaining:txResult.finalize(fee calc + post-Cancun system calls).DeleteAccounton a pre-existing contract through this collector, the comment flags that this code should mirrorLightCollector.DeleteAccount'sIncarnationPathemit.Adding the emit speculatively was rejected because: (a) it changes the writeset shape for paths that today don't need it, (b) any test exercising the new emit would be vacuous since no production caller hits the
original.Incarnation > 0branch, and (c) the comment is enough to attribute the bug at first sight if someone does reach that code path in the future.Intentional non-fixes
versionedWriteCollector.DeleteAccountleft without theIncarnationPathemit (rationale above).TestFlushToUpdates_DeletedWithRetainedBalancetest kept despite the state being unreachable from real LightCollector writesets today — protects the FlushToUpdates branch in isolation against future ApplyWrites refactors that might drop theBalancePath-clears-Deletedinvariant.Test plan
calc_state_test.gopass (TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate,TestFlushToUpdates_DeletedWithoutIncarnation_EmitsDelete,TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate,TestFlushToUpdates_LiveAccount_EmitsFullUpdate,TestApplyWrites_IncarnationPath,TestApplyWrites_BalancePathClearsDeleted)for_amsterdam/constantinople/eip1052_extcodehash/extcodehash/extcodehash_subcall_create2_oogall 6 variants pass locallyfor_amsterdam/constantinopleeest_devnet suite passesmake lintclean9539998f14was greenEnd-to-end validation comes via #21017's CI matrix once it rebases on top of this PR.