Skip to content

state, stagedsync: parallel-exec wrong-root fix for SD vs EIP-161 emptyRemoval#21032

Merged
mh0lt merged 3 commits into
mainfrom
fix/parallel-exec-prague-setcode
May 8, 2026
Merged

state, stagedsync: parallel-exec wrong-root fix for SD vs EIP-161 emptyRemoval#21032
mh0lt merged 3 commits into
mainfrom
fix/parallel-exec-prague-setcode

Conversation

@mh0lt

@mh0lt mh0lt commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a wrong trie-root in the parallel commitment calculator when post-tx writesets are indistinguishable between two cases:

  1. SELFDESTRUCT of a pre-existing contract — serial keeps the leaf with incarnation>0, zero balance/nonce/empty codeHash. Parallel was emitting DeleteUpdate and removing the leaf.
  2. EIP-161 emptyRemoval of a touched-empty account (e.g. 0xff…fe after the EIP-4788 system call) — serial emits DeleteUpdate. Parallel was emitting a zero-account UPDATE.

The discriminator is the pre-block incarnation, which the writeset alone didn't carry. Fix wires it through LightCollector.DeleteAccountIncarnationPath write → calcAccountState.Incarnation → 3-way branch in FlushToUpdates.

Dependency direction

This PR is a precursor for #21017 (the CI matrix that runs every test under both serial and parallel exec modes). Without this fix, parallel-mode tests on hive rpc-compat, engine api/cancun/withdrawals, and the eest selfdestruct sub-suites all fail with wrong-trie-root errors at SD/empty-removal blocks. #21017 cannot land until this PR lands.

The matrix in #21017 will validate this PR end-to-end via the hive parallel sub-suites — meaning this PR's parallel-exec changes don't run in this PR's own CI, only in #21017's. The rebased #21017 is the meaningful go/no-go signal.

Changes

  • LightCollector.DeleteAccount now emits IncarnationPath alongside SelfDestructPath=true when original.Incarnation > 0. Calc receives the pre-deletion incarnation through the same channel as every other write — no exec-side state leakage.
  • calc_state.ApplyWrites consumes IncarnationPath into calcAccountState.Incarnation via direct type-assertion (panic on mismatch — see Concern 3 below).
  • calc_state.FlushToUpdates switches on a 3-way Deleted branch with isAllZero defense-in-depth:
    • Deleted && Incarnation>0 && all-zero → zero-account UPDATE (matches serial's DomainDel-leaves-post-CREATE-encoding for SD'd contracts)
    • Deleted && all-zero && Incarnation==0DeleteUpdate (matches serial's DomainDel-truly-empties for EIP-161 emptyRemoval)
    • Deleted with retained non-zero values → regular UPDATE — defensive coverage for OOG-CREATE2-with-retained-balance and any future write-ordering race
  • SelfDestructPath also marks all tracked storage slots dirty so FlushToUpdates emits per-slot updates alongside the account reset. Load-bearing invariant: normalizeWriteSet's vm.StorageKeys(addr) loop emits StoragePath=0 entries that arrive in ApplyWrites after SelfDestructPath, overwriting the marked slots' values to zero so they emit DeleteUpdate not StorageUpdate(pre-SD value). Inline comment in calc_state.go spells this out — see Concern 2 below.

Earlier draft snags (resolved)

The first draft also added an IncarnationPath > 0 exclusion to normalizeWriteSet's empty-account check. This was redundant (the empty-check already requires Nonce == 0, which excludes successful CREATE/CREATE2) and broke OOG-during-CREATE2 cases (which leave Nonce=0/Balance=0/Code=empty/Incarnation=1 and must still be deleted). Removed in 9539998f14. The exec3_parallel.go diff in this PR is now comments-only.

Reviewer concerns addressed

#1 (yperbasis): PR description was stale

This body. ✓

#2 (yperbasis + Copilot): SD storage-cascade load-bearing invariant

Inline comment in calc_state.go's SelfDestructPath case now spells out the dependency on normalizeWriteSet's vm.StorageKeys(addr) loop. ✓

#3 (yperbasis + Copilot): IncarnationPath guarded type-assertion

Changed to direct w.Val.(uint64) to match the other paths. Silent zero would route a real SD into the EIP-161 branch and reproduce the very wrong-root bug — better to panic at the source. ✓

#4 (yperbasis): TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate docstring

Updated to clarify: this is defensive coverage for the third FlushToUpdates branch in isolation, NOT a direct repro of the eest_devnet OOG path. The actual OOG fix is the removal of the redundant IncarnationPath > 0 clause from normalizeWriteSet (the OOG writeset has Nonce=0 → empty-account → DeleteUpdate, not Deleted+RetainedBalance). End-to-end coverage of that path lives in the eest_devnet suite, surfaced via #21017's matrix. ✓

#5 (yperbasis): versionedWriteCollector.DeleteAccount asymmetry — intentional non-fix

Decision: keep the asymmetry, document why. Inline comment added on versionedWriteCollector.DeleteAccount explaining:

  • It's wired only into txResult.finalize (fee calc + post-Cancun system calls).
  • Neither path SDs a pre-existing contract today, so the SD-with-incarnation differentiator is unreachable from here.
  • If a future caller ever does emit DeleteAccount on a pre-existing contract through this collector, the comment flags that this code should mirror LightCollector.DeleteAccount's IncarnationPath emit.

Adding the emit speculatively was rejected because: (a) it changes the writeset shape for paths that today don't need it, (b) any test exercising the new emit would be vacuous since no production caller hits the original.Incarnation > 0 branch, and (c) the comment is enough to attribute the bug at first sight if someone does reach that code path in the future.

Intentional non-fixes

  • Concern Fix for the p2p deadlock when remapping the database #5 aboveversionedWriteCollector.DeleteAccount left without the IncarnationPath emit (rationale above).
  • Defensive TestFlushToUpdates_DeletedWithRetainedBalance test kept despite the state being unreachable from real LightCollector writesets today — protects the FlushToUpdates branch in isolation against future ApplyWrites refactors that might drop the BalancePath-clears-Deleted invariant.

Test plan

  • All 6 unit tests in calc_state_test.go pass (TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate, TestFlushToUpdates_DeletedWithoutIncarnation_EmitsDelete, TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate, TestFlushToUpdates_LiveAccount_EmitsFullUpdate, TestApplyWrites_IncarnationPath, TestApplyWrites_BalancePathClearsDeleted)
  • eest_devnet for_amsterdam/constantinople/eip1052_extcodehash/extcodehash/extcodehash_subcall_create2_oog all 6 variants pass locally
  • Full for_amsterdam/constantinople eest_devnet suite passes
  • make lint clean
  • CI on 9539998f14 was green

End-to-end validation comes via #21017's CI matrix once it rebases on top of this PR.

@mh0lt

mh0lt commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Cross-ref: #21017 is the CI matrix PR that surfaced these bugs. Once this lands, #21017 will rebase on top to validate end-to-end via the hive parallel sub-suites.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is red

@mh0lt

mh0lt commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Force-pushed: rebased to drop ~17 files of scope creep (changes to vm/, txpool/, aggregator, integration cmd, etc.) that were unrelated to the SD/incarnation fix and accidentally included via a bad rebase. The unrelated VM/interpreter changes were the source of the eest-devnet extcodehash_subcall_create2_oog failures (gas-tracking change in interpreter affected EXTCODEHASH OOG paths).

PR is now exactly 4 files — rw_v3.go LightCollector.DeleteAccount, calc_state.go 3-way Deleted branch, calc_state_test.go (new), exec3_parallel.go normalizeWriteSet incarnation guard. Lint clean locally; CI re-running.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge latest main because Files changed show a lot of superfluous changes.

@mh0lt

mh0lt commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

Force-pushed (9539998f14) with the regression fix for the eest-devnet extcodehash_subcall_create2_oog Amsterdam tests.

Root cause of the regression on the prior force-push: the fix added a redundant IncarnationPath > 0 exclusion to normalizeWriteSet's empty-account check. The empty-check already requires Nonce == 0, which correctly excludes successful CREATE/CREATE2 (which sets Nonce=1) — the original PR target case (constructor returned empty bytecode but wrote storage) bypasses the empty-check via Nonce==0 anyway. But OOG-during-CREATE2 leaves Nonce=0/Balance=0/Code=empty/Incarnation=1 and must still be deleted (no contract was deployed); the redundant clause was preserving these as zombie accounts and producing wrong trie roots.

Fix:

  • Removed the redundant IncarnationPath > 0 clause from normalizeWriteSet and the now-unused acctState.incarnation/hasInc fields.
  • Added an isAllZero defense-in-depth check to calc_state.FlushToUpdates so a Deleted writeset with retained non-zero values (the OOG-CREATE2-with-value-transfer-retained-per-EIP-1014 case, or any future write-ordering race) emits a regular UPDATE rather than zeroing out the surviving fields.
  • Added TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate as a regression guard.

Local verification:

  • All 6 targeted variants of extcodehash_subcall_create2_oog[fork_Amsterdam-...-{call,callcode,delegatecall}] pass.
  • Full for_amsterdam/constantinople eest_devnet suite passes.
  • All existing calc_state unit tests still pass.
  • make lint clean.

@mh0lt

mh0lt commented May 7, 2026

Copy link
Copy Markdown
Contributor Author

@yperbasis CI is now green on 9539998f14. Force-push earlier this hour (9539998f14) added the OOG-during-CREATE2 regression fix on top of the original SD/incarnation differentiator. Ready for re-review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses wrong state-root computation in the parallel commitment calculator by distinguishing self-destruct vs EIP-161 empty-removal cases that otherwise produce indistinguishable post-tx writesets. It introduces incarnation tracking into the calculator pipeline and adds targeted unit tests to lock in the expected branching behavior.

Changes:

  • Emit IncarnationPath alongside SelfDestructPath=true from LightCollector.DeleteAccount when the pre-deletion account had a non-zero incarnation.
  • Extend the commitment calculator state accumulator to track Incarnation and to branch account deletes into either DeleteUpdate or a zero-account UPDATE to match serial behavior.
  • Add unit tests covering the delete-branching matrix (SD w/ incarnation, touched-empty EIP-161, retained-balance edge case, and write-order interactions).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
execution/state/rw_v3.go Emits IncarnationPath on deletes so downstream commitment logic can differentiate delete flavors.
execution/stagedsync/exec3_parallel.go Updates/clarifies empty-account normalization commentary near delete conversion logic.
execution/stagedsync/calc_state.go Tracks Incarnation, refines delete branching, and adjusts SD handling of storage flushing.
execution/stagedsync/calc_state_test.go Adds unit tests covering the new branching behavior and ordering interactions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +207 to +216
// Mark every tracked storage slot dirty so
// FlushToUpdates emits a per-slot update. Serial's
// MakeWriteSet for an SD account emits storage writes
// (with their pre-SD values) alongside the account-
// reset, and the trie processes them in fold order.
if slots, ok := cs.storageState[w.Address]; ok {
if cs.storageDirty[w.Address] == nil {
cs.storageDirty[w.Address] = make(map[accounts.StorageKey]bool)
}
for key := range slots {
Comment thread execution/stagedsync/calc_state.go Outdated
Comment on lines +235 to +237
if v, ok := w.Val.(uint64); ok {
acc.Incarnation = v
}

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

Fixes a wrong trie-root in the parallel commitment calculator when post-tx writesets are indistinguishable between two cases:

  1. SELFDESTRUCT of a pre-existing contract — serial keeps the leaf with incarnation>0, zero balance/nonce/empty codeHash. Parallel was emitting DeleteUpdate and removing the leaf.
  2. EIP-161 emptyRemoval of a touched-empty account (e.g. 0xff…fe after the EIP-4788 system call) — serial emits DeleteUpdate. Parallel was emitting a zero-account UPDATE.

The discriminator is the pre-block incarnation, which the writeset alone didn't carry. Fix wires it through LightCollector.DeleteAccountIncarnationPath write → calcAccountState.Incarnation → 3-way branch in FlushToUpdates.

I checked out the branch and ran go test ./execution/stagedsync/ ./execution/state/ — all pass.

Strengths

  • Surgical scope. Only 4 files; serial path untouched. Risk to mainnet is low — the only common code touched is LightCollector.DeleteAccount gaining an additive emit; serial doesn't use LightCollector.
  • Well-documented. The commit message and inline comments explain the why clearly, including the difference between the two delete flavors and how serial's DomainDel encoding differs.
  • Good test coverage. Six new unit tests cover the branching matrix.
  • Reproduced + fixed locally with the hive chain.rlp import in both per-block and batched commitment modes.

Concerns

1. PR description is stale (please fix before merge)

The body still says:

normalizeWriteSet empty-account filter: skips the SD=true conversion when IncarnationPath > 0 is in the writeset.

But the diff only adds comments to exec3_parallel.go — no logic change. Your own latest comment (the 9539998f14 green-CI announcement) confirms this clause was deliberately removed: "Removed the redundant IncarnationPath > 0 clause … the empty-check already requires Nonce == 0, which correctly excludes successful CREATE/CREATE2." The body should be updated so the description matches the diff — otherwise future reviewers/bisects will be misled.

2. SelfDestructPath storage handling — Copilot's first comment is partially valid

The change replaces:

delete(cs.storageState, w.Address)
delete(cs.storageDirty, w.Address)

with marking every tracked slot dirty with its current value. The justification ("serial emits storage writes with their pre-SD values") only holds when the SD'd contract was also created in the same TX with storage writes — that's the path where updateAccount enters both branches (createdContract && selfdestructed && !emptyRemoval) and updateStorage runs.

In the same-block flow, normalizeWriteSet does cover this: it adds StoragePath=0 entries for every key in vm.StorageKeys(addr) at the SelfDestructPath case (exec3_parallel.go:2964-2972), and those zeroes overwrite the dirty-marked values before flush. So in practice slots emit DeleteUpdate. ✓

But that's a non-obvious invariant. Worth a comment in the SelfDestructPath case in calc_state.go spelling it out:

// The slots' final values get overwritten to 0 by the StoragePath=0
// entries that normalizeWriteSet emits for every vm.StorageKeys(addr)
// at the SelfDestructPath case (exec3_parallel.go). Without that
// invariant, this would leak pre-SD slot values into the trie.

Otherwise a future change to normalizeWriteSet that drops the vm.StorageKeys loop would silently regress to wrong roots.

3. IncarnationPath type-assertion guard inconsistency — Copilot's second comment

case state.IncarnationPath:
    acc := cs.ensureAccount(w.Address)
    if v, ok := w.Val.(uint64); ok {   // guarded
        acc.Incarnation = v
    }
    acc.dirty = true

vs. every other case in the same function:

case state.BalancePath:
    acc.Balance = w.Val.(uint256.Int)  // direct, panics on mismatch

A silent type mismatch on IncarnationPath would leave Incarnation=0 and route a real SD into the EIP-161 DeleteUpdate branch — exactly the wrong-root bug this PR fixes, but harder to attribute. Either:

  • use a direct assertion (matches the file's style, fails fast), or
  • assign to cs.lazyLoadErr on the !ok branch so LazyLoadErr() surfaces it before commitment computes.

4. TestFlushToUpdates_DeletedWithRetainedBalance docstring is misleading

The test directly populates cs.accounts[addr] with Deleted=true && Balance=0x0421fe && Incarnation=1. But TestApplyWrites_BalancePathClearsDeleted (added in the same PR) pins the invariant that any BalancePath write clears Deleted. Since LightCollector always emits SelfDestructPath before BalancePath (both in DeleteAccount+UpdateAccountData ordering and in UpdateAccountData's reincarnation branch), ApplyWrites cannot produce the Deleted=true && Balance!=0 state from a real LightCollector writeset.

So the test is defensive coverage, not a real extcodehash_subcall_create2_oog repro. That regression most likely manifests via Deleted=false in cs.accounts, where pre-PR the else branch already emitted a regular UPDATE correctly — meaning the Amsterdam regression fix is more accurately the removal of the redundant IncarnationPath > 0 clause from normalizeWriteSet (per your own comment). Either:

  • Clarify the docstring: this is defensive coverage for the third branch (Deleted-with-retained-values), not a direct repro of the eest test.
  • Or: add a test that exercises the actual eest_devnet path through ApplyWrites + normalizeWriteSet end-to-end, so the locked-in behavior matches what production hits.

5. versionedWriteCollector.DeleteAccount not updated for consistency

execution/state/rw_v3.go:636 is the other DeleteAccount implementation, used during txResult.finalize (fee calc + system TXs). It ignores original and emits only SelfDestructPath. If a finalize-path DeleteAccount ever fires on a contract with original.Incarnation > 0, the calc state would route it through the EIP-161 deletion branch and produce a wrong root again.

In practice this is probably unreachable today — finalize is fees + post-Cancun system calls, neither of which SD's pre-existing contracts. But the asymmetry is a footgun. Either:

  • Add the same IncarnationPath emit there, or
  • Add a comment on versionedWriteCollector.DeleteAccount explaining why it intentionally diverges.

Smaller notes

  • calc_state.go:240-273 — the default: branch comment "Either not Deleted, or Deleted-with-retained-values" is good. Consider also noting that under current ApplyWrites semantics, the latter sub-case is unreachable from real writesets (per concern #4) — it's defensive.
  • calc_state.go:265isAllZero is computed once and used twice; clear and readable.

Verdict

The fix is correct in intent and well-tested. The blocker for me is the stale PR description (#1), since it makes the diff confusing for future reviewers/bisects. Concerns #2 and #3 are easy hardenings of the invariant; #4 and #5 are documentation/consistency cleanups that can be done in this PR or a follow-up.

The CI matrix (#21017 rebased on top) is the real end-to-end gate — given this PR's parallel-exec changes don't run in this PR's own CI, that landing is the meaningful go/no-go signal.

@mh0lt mh0lt force-pushed the fix/parallel-exec-prague-setcode branch from 9539998 to 35ac726 Compare May 8, 2026 10:21
@mh0lt

mh0lt commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@yperbasis pushed 35ac7269a9 addressing all 5 review concerns + Copilot's two inline comments. PR body fully rewritten — please re-read; I'm calling out the dependency direction (this PR blocks #21017) and the one intentional non-fix (concern #5) explicitly.

# Concern Resolution
1 Stale PR description Body rewritten; documents the actual diff and the earlier-draft snag
2 SD storage cascade load-bearing invariant Inline comment in calc_state.go's SelfDestructPath case spelling out the dependency on normalizeWriteSet's vm.StorageKeys(addr) loop
3 IncarnationPath guarded type-assertion Changed to direct w.Val.(uint64) to match the file's style; comment explains why silent zero would be worse than panic
4 TestFlushToUpdates_DeletedWithRetainedBalance docstring Updated to clarify it's defensive coverage for FlushToUpdates' third branch in isolation, not a direct repro of the eest_devnet OOG path
5 versionedWriteCollector.DeleteAccount asymmetry Intentional non-fix — comment added explaining why the asymmetry is acceptable today and what to mirror if a future caller does reach the branch

Linked intentional non-fix is also called out in the PR body's Intentional non-fixes section so it's not lost in review history.

The dependency direction is now in the body's Dependency direction section: this PR blocks #21017's CI matrix, end-to-end validation comes via #21017 once it rebases on top.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR fixes a wrong trie-root in the parallel commitment calculator that surfaces on two cases serial handles distinctly but parallel can't differentiate from the writeset alone:

  1. SELFDESTRUCT of a contract with incarnation > 0 — serial emits a zero-account UPDATE that preserves the leaf; parallel was emitting DeleteUpdate.
  2. EIP-161 emptyRemoval of a touched-empty account — serial emits DeleteUpdate; parallel was emitting a zero-account UPDATE.

The discriminator is the pre-block incarnation, which is now wired through LightCollector.DeleteAccount → IncarnationPath → calcAccountState.Incarnation → 3-way switch in FlushToUpdates.

Strengths

  • Surgical scope: 4 files; serial path untouched. Risk to mainnet is low.
  • Well-documented: the case comments in FlushToUpdates explain the three flavors precisely; the LOAD-BEARING INVARIANT comment for the storage cascade is exactly what future-you needs.
  • Addresses prior reviewer concerns: my earlier #1#5 (stale description, storage-cascade docs, type-assertion style, defensive-test docstring, asymmetric versionedWriteCollector.DeleteAccount) are all resolved in 35ac7269a9. Thanks.
  • Good FlushToUpdates branch coverage: 6 unit tests; verified all pass on the branch.

Concerns

1. Unit tests bypass the production pipeline — case 1 reachability is unverified

TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate populates cs.accounts directly with Deleted=true && Balance=0 && Nonce=0 && CodeHash=empty && Incarnation=1. But the production path goes through normalizeWriteSet (exec3_parallel.go:2480), which adds completion entries for any missing BalancePath/NoncePath/CodeHashPath for every address in the raw writeset, falling back to stateReader.ReadAccountData(addr) for values:

// exec3_parallel.go inside normalizeWriteSet
for _, path := range []state.AccountPath{state.BalancePath, state.NoncePath, state.IncarnationPath, state.CodeHashPath} {
    if fields != nil && fields[path] { continue }
    // vm.Read fallback then stateReader fallback...
}

LightCollector.DeleteAccount emits only SelfDestructPath + IncarnationPath, so the completion loop will append BalancePath/NoncePath/CodeHashPath after them. In calcState.ApplyWrites, those three paths each set acc.Deleted = false. End state: Deleted=false, default branch fires, regular UPDATE with whatever values the stateReader returned.

The Deleted=true && Incarnation>0 && isAllZero branch only fires if no balance/nonce/codeHash writes arrive after SelfDestructPath. The unit tests ensure this by hand-crafted inputs; production writesets don't have that property after normalizeWriteSet completion.

The PR clearly fixes some real bug (hive chain.rlp import works), but I can't reconcile the unit tests with the production flow from reading the diff. Please add an integration-style test that runs LightCollector.DeleteAccountnormalizeWriteSet(...)cs.ApplyWritescs.FlushToUpdates for an SD-of-pre-existing-contract scenario (with a real or mock stateReader that returns the pre-block account). Either the test confirms my read is wrong, or it surfaces a real gap.

2. Storage cascade for pre-existing SD'd contracts

The marked-dirty loop covers slots in cs.storageState[addr], which are populated lazily by storage writes/reads in this block. vm.StorageKeys(addr) (in normalizeWriteSet) covers slots touched in this block. Pre-existing storage slots that were never read or written in this block are in neither set — they sit in the trie with their old values.

For an SD'd pre-existing contract, serial's SD wipes the entire storage subtree. This PR's emit pattern (zero-account UPDATE + per-slot DeleteUpdate for touched slots only) may leave untouched pre-existing slots unmodified in the trie. Possibly the trie's commitment machinery handles this when the leaf's account bytes change to zero-balance/zero-nonce/empty-codehash — but the PR doesn't claim this and there's no test for it. Worth confirming.

3. calcDomainReader.ReadAccountData doesn't load Incarnation

calc_state.go:131-135:

acc.Balance = dbAcc.Balance
acc.Nonce = dbAcc.Nonce
acc.CodeHash = dbAcc.CodeHash.Value()
// Incarnation is NOT loaded

So acc.Incarnation stays at 0 unless an explicit IncarnationPath write arrives. This is fine for LightCollector.DeleteAccount (which now emits IncarnationPath), but creates a coupling: any future DeleteAccount impl that forgets to emit IncarnationPath will silently route SD'd contracts into the EIP-161 DeleteUpdate branch — exactly the bug this PR fixes. The versionedWriteCollector.DeleteAccount comment acknowledges this asymmetry; consider also a one-line note in ensureAccount about why Incarnation is not lazy-loaded.

4. Default-branch comment could mention unreachability

The default-branch comment // Either not Deleted, or Deleted-with-retained-values. is good. The PR description (concern #4 response) acknowledges that Deleted-with-retained-values is unreachable from real LightCollector writesets under current ApplyWrites semantics. Consider mentioning that in the inline comment so a future reader knows the third sub-case is defensive coverage, not a real production state. (Or: connect this to my concern #1 above — if my read is correct, the SD-with-incarnation case actually does reach the default branch in production.)

Smaller notes

  • CI is currently in_progress — wait for green before merging.
  • Lint: PR claims clean; please confirm with the latest force-push.
  • isAllZero named-once is clean. ✓
  • PR body: now correctly reflects the diff (exec3_parallel.go is comments-only). ✓

Verdict

The intent is correct and the documentation/comments are notably good — this is the kind of change that's easy to get wrong and easy to forget the rationale six months later.

My main blocker is concern #1: the unit tests cover the FlushToUpdates switch in isolation but don't exercise the production write-pipeline ordering, and from a static read I can't confirm that case 1 is actually reachable end-to-end. An integration-style test would either (a) confirm my analysis is wrong (most likely), or (b) catch a real gap before #21017's matrix lights up the same wrong-root in CI.

Concerns #2#4 are documentation/coverage hardening, not blockers.

End-to-end validation via #21017's CI matrix is the right go/no-go signal as you note — but local unit tests should ideally be the first line of defense for the SD/incarnation invariant, not just hive chain.rlp.

mh0lt added 2 commits May 8, 2026 11:47
…tyRemoval

The parallel commitment calculator was producing the wrong trie root for
blocks that contained either:

  1. A contract created and self-destructed (or with empty constructor that
     wrote storage) in the same tx — e.g. hivechain test fixture block 2
     `0x4055cae5...`. Serial keeps the leaf with `incarnation=1` (zero
     balance/nonce/empty codeHash). Parallel was emitting `DeleteUpdate`
     and removing the leaf, diverging the root.

  2. A touched-empty account on the EIP-161 emptyRemoval path — e.g. the
     system address `0xfffffffffffffffffffffffffffffffffffffffe` after a
     Cancun EIP-4788 syscall. Serial emits `DeleteUpdate`. Parallel was
     emitting a zero-account UPDATE.

These two cases produce identical writesets from the executor's point of
view (Balance=0 / Nonce=0 / CodeHash=empty / SelfDestructPath=true) so the
calculator can't differentiate from the post-tx state alone. The
differentiator is the pre-block `Incarnation` — non-zero for SD'd
contracts, zero for emptyRemoval'd EOA-shaped accounts.

Fix:

  - LightCollector.DeleteAccount now emits IncarnationPath alongside
    SelfDestructPath=true when the original account had a non-zero
    incarnation. The calc receives the pre-deletion incarnation through
    the same channel as every other write — no exec-side state leakage.

  - calcState tracks per-account Incarnation. FlushToUpdates branches
    (with isAllZero defense-in-depth):
      * Deleted && Incarnation > 0 && all-zero fields → zero-account
        UPDATE (matches serial's DomainDel-leaves-post-CREATE-encoding
        for SD'd contracts).
      * Deleted && all-zero fields && Incarnation == 0 → DeleteUpdate
        (matches serial's DomainDel-truly-empties for EIP-161 emptyRemoval).
      * Deleted-but-not-all-zero (e.g. OOG-during-CREATE2 with retained
        value transfer per EIP-1014) → emit a regular UPDATE with the
        actual values — serial does NOT remove the leaf in that case.

  - SelfDestructPath also marks all tracked storage slots dirty so
    FlushToUpdates emits per-slot updates alongside the account reset.

Note: the earlier draft of this fix added a redundant IncarnationPath
guard to normalizeWriteSet's empty-account check; that broke the
extcodehash_subcall_create2_oog Amsterdam-fork eest tests, where an
OOG'd CREATE2 leaves Nonce=0/Balance=0/Code=empty/Incarnation=1 and
must still be deleted. The empty-check's existing Nonce==0 condition
already excludes successful CREATE (which sets Nonce=1), so the guard
was unnecessary; calc_state's 3-way branch handles the SD-of-existing-
contract case from the writeset shape.

Verified end-to-end: parallel `erigon import` of the hive rpc-compat
chain.rlp now produces the correct head hash for all 45 blocks
(Frontier → Spurious Dragon → Byzantium → London → Shanghai → Cancun →
Prague), in both per-block and batched commitment modes. Locally also
verified that the eest_devnet
extcodehash_subcall_create2_oog[fork_Amsterdam-...] tests pass
(experimentalBAL=true forces parallel mode for these tests).
@mh0lt mh0lt force-pushed the fix/parallel-exec-prague-setcode branch from 35ac726 to 626072e Compare May 8, 2026 11:48
@mh0lt

mh0lt commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@yperbasis pushed 626072e6e9 addressing review #2's concern about FlushToUpdates branch reachability.

TL;DR: Your read of the production pipeline is correct. The integration test confirms it.

What the integration test reveals

Added TestSDOfPreExistingContract_FullPipeline which drives LightCollector.DeleteAccount → normalizeWriteSet → calcState.ApplyWrites → calcState.FlushToUpdates end-to-end with a stub stateReader that returns the pre-block account.

Output of running it through the production flow for an SD-of-pre-existing-contract:

  1. LightCollector.DeleteAccount emits [SelfDestructPath, IncarnationPath=N]
  2. normalizeWriteSet completion loop appends [BalancePath=X, NoncePath=Y, CodeHashPath=Z] from the stateReader (pre-block values)
  3. calcState.ApplyWrites processes them in order:
    • SelfDestructPathacc.Deleted=true
    • IncarnationPathacc.Incarnation=N
    • BalancePathacc.Balance=X, acc.Deleted=false ← clears it
    • NoncePathacc.Nonce=Y, acc.Deleted=false
    • CodeHashPathacc.CodeHash=Z, acc.Deleted=false
  4. FlushToUpdates sees acc.Deleted=falsedefault branch fires (regular UPDATE with X/Y/Z)

So the Deleted && Incarnation>0 && isAllZero branch in FlushToUpdates is unreachable from real LightCollector writesets through the current production pipeline.

What this means for the PR

  • The actual hive chain.rlp 4055cae5 fix mechanism is the default branch emitting pre-block field values via the completion-loop path. That matches what serial's DomainPut does for the same scenario (leaf survives encoded with pre-block field values; commitment.Update has no Incarnation field so the trie's leaf-hash inputs match between serial and parallel).
  • The Deleted && Incarnation>0 && isAllZero and Deleted && retained-balance branches in FlushToUpdates are defensive coverage against future ApplyWrites/normalizeWriteSet refactors (e.g. if the completion loop is dropped, or BalancePath stops resetting Deleted) — kept rather than removed because the FlushToUpdates branching is otherwise easy to break silently.

Changes in 626072e

  1. New TestSDOfPreExistingContract_FullPipeline integration test (drives the full production pipeline; locks in the "default branch with pre-block values" behaviour).
  2. Updated docstring on TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate to call out it's defensive coverage (parallel to what was already done for TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate).

The LightCollector.DeleteAccount IncarnationPath emit is still useful — it's the channel the calculator uses to populate acc.Incarnation (just doesn't end up driving the zero-account branch under current ApplyWrites semantics).

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

This PR threads a pre-deletion incarnation through the parallel commitment calculator so it can distinguish two writeset shapes that look identical on the wire: SELFDESTRUCT of a pre-existing contract vs. EIP-161 emptyRemoval of a touched-empty account. It adds a 3-way switch in FlushToUpdates, a new IncarnationPath case in ApplyWrites, a storage-cascade emit on SelfDestruct, and a defensive IncarnationPath emit in LightCollector.DeleteAccount. Includes a 7-test unit suite.

What's good

  • Storage cascade fix is sound. Replacing the eager delete(cs.storageState, addr) with "mark each tracked slot dirty" lets FlushToUpdates emit DeleteUpdate per slot. The load-bearing-invariant comment explaining the dependency on normalizeWriteSet's vm.StorageKeys(addr) zero-emit loop is exactly what future-readers need.
  • 3-way switch is clearly written, with a switch over predicates (rather than nested ifs) and isAllZero extracted as a named precondition.
  • IncarnationPath ApplyWrites case uses direct type-assertion — matches the convention of every other case in the function, and panics fast on mismatch instead of silently zero-ing into the wrong branch (good — concern #3 in PR body resolved correctly).
  • Comments document the why rather than the what, including past-incident attribution (which OOG case repros, which hive block triggered the SD case).
  • Tests pass on the PR branch (verified locally — 7/7).

Concerns

1. TestSDOfPreExistingContract_FullPipeline does not actually drive the production pipeline

The test bills itself as end-to-end coverage of LightCollector.DeleteAccount → normalizeWriteSet → ApplyWrites → FlushToUpdates. But it constructs rawWrites containing only SelfDestructPath=true + IncarnationPath=N, with an empty vm := state.NewVersionMap(nil).

In production, the EVM's IntraBlockState.Selfdestruct (intra_block_state.go:1430-1432) directly writes to versionedWrites:

versionWritten(sdb, addr, IncarnationPath, accounts.NilKey, stateObject.data.Incarnation)
versionWritten(sdb, addr, SelfDestructPath, accounts.NilKey, stateObject.selfdestructed)
versionWritten(sdb, addr, BalancePath, accounts.NilKey, uint256.Int{})

So the real rawWrites for an SD'd contract reaches normalizeWriteSet with BalancePath=0 already in the input AND in the version map. The completion loop therefore fills BalancePath=0 (from vm.Read) — not BalancePath=preBlockBalance (from stateReader fallback) as the test asserts.

That difference is load-bearing: the test's "trie sees pre-block balance" assertion is an artifact of the empty-vm shortcut, not of the production flow.

Recommendation: either (a) populate vm with the EVM's BalancePath=0 / IncarnationPath=N writes via vm.FlushVersionedWrites so the test reflects the actual SD versionWritten emits, or (b) rename the test to TestNormalizeWriteSet_CompletionLoopFromStateReader and remove the "production pipeline end-to-end" framing.

2. PR description's claim about serial behavior is inconsistent with the code

The body says serial "keeps the leaf with incarnation>0, zero balance/nonce/empty codeHash". I traced the actual serial path:

  • Writer.DeleteAccount (rw_v3.go:903-922) calls w.tx.DomainDel(kv.AccountsDomain, addr, txNum, nil).
  • SharedDomains.DomainDel (domain_shared.go:675-720) calls sd.sdCtx.TouchKey(domain, ks, nil).
  • commitment.Updates.TouchAccount with empty val sets c.update.Flags = DeleteUpdate.
  • The leaf is removed, not preserved.

If serial actually keeps a zero-account leaf via some mechanism I missed (e.g. domain-encoded incarnation that survives DomainDel), please add a code pointer to the description so reviewers can verify. As written, the description's stated post-condition contradicts what I find by following DomainDel through SharedDomains.

3. LightCollector.DeleteAccount's new IncarnationPath emit appears unreachable in production

For the SD path: IBS.Selfdestruct already emits IncarnationPath=stateObject.data.Incarnation via versionWritten, well before MakeWriteSet is called. That goes into ibs.versionedWritesresult.TxOutblockIO.RecordWritesrawWrites. By contrast, LightCollector writes go to result.CollectorWrites, which is not the source normalizeWriteSet reads at line 2478 (rawWrites := be.blockIO.WriteSet(txVersion.TxIndex)).

For the EIP-161 emptyRemoval path: the new guard if original != nil && original.Incarnation > 0 excludes touched-empty EOAs (which by definition have Incarnation == 0).

If the goal is defense-in-depth for a future code path that calls DeleteAccount on a pre-existing contract without going through IBS.Selfdestruct, fine — but the comment should say so. Today it reads as "this is the channel through which SD signals the pre-incarnation," which isn't accurate for the production flow.

4. Unit tests cover branches the PR itself flags as unreachable

TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate and TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate populate cs.accounts directly because the production pipeline's completion loop resets acc.Deleted = false before reaching them. The PR body acknowledges this ("covers a state production never reaches").

The risk isn't the defensive code — it's that future readers may modify the production flow without realizing these branches exist as a safety net, OR may assume these paths are exercised end-to-end when they aren't. The unit-test PASS doesn't tell you anything about whether the corresponding logic ever fires in the wild.

Suggestion: if you want to keep the defensive branches, replace those two unit tests with a single t.Skip(\"defensive — currently unreachable from production; see TestSDOfPreExistingContract_FullPipeline for the live path\") placeholder, OR add an assertion in FlushToUpdates that fires panic(\"unreachable: file a bug\") when branch 1 or 3 is hit, and replace the unit tests with a recover-based "this would panic" assertion. Either way, future maintainers see "this is dead defense" without having to trace the pipeline.

5. EIP-161 case (branch 2) is reachable and correct, but pre-PR also worked

Pre-PR FlushToUpdates:

if acc.Deleted { DeleteUpdate } else { regular }

Post-PR FlushToUpdates for the EIP-161 case:

  • normalizeWriteSet's empty-account check at the end converts BalancePath/NoncePath/CodeHashPath/IncarnationPath to a single SelfDestructPath=true.
  • cs.ApplyWrites only sees SelfDestructPathDeleted=true, no field writes.
  • Branch 2 fires (Deleted && all-zero && Incarnation==0) → DeleteUpdate.

But the same outcome holds with the pre-PR code: Deleted=true → first branch (if acc.Deleted) → DeleteUpdate.

So claim #2 in the PR description ("Parallel was emitting a zero-account UPDATE" for EIP-161) needs a code pointer. As I trace it, parallel was already emitting DeleteUpdate for that case before this PR. If there's a specific scenario where the empty-account check in normalizeWriteSet does not fire (and so the EIP-161 emptyRemoval path leaves Deleted=false), please call it out — without that, it's not clear what the EIP-161 half of this PR fixes.

6. Storage-cascade depends on stable contract between calc_state and normalizeWriteSet

The new SelfDestructPath case in ApplyWrites only marks slots already in cs.storageState as dirty — so the value seen at flush time is whatever was last written to that slot. The PR relies on normalizeWriteSet appending StoragePath=0 entries after SelfDestructPath for every key in vm.StorageKeys(addr) (exec3_parallel.go:2962-2972).

The comment is good, but enforcement isn't:

  • If vm.StorageKeys(addr) ever returns only dirty slots in this block (instead of all known keys), pre-existing storage from prior blocks won't be marked dirty here, won't get their StoragePath=0 companion in normalizeWriteSet, and will leak as stale StorageUpdate(pre-SD value) entries.
  • More broadly, this cross-file invariant is footgun-shaped — there's no test that fails if you accidentally reorder the SelfDestructPath and StoragePath=0 emits in normalizeWriteSet.

Suggestion: add a regression test that drives a concrete SD'd contract with a non-empty vm populated with both SelfDestructPath and a few StoragePath=non-zero writes, then asserts that FlushToUpdates emits DeleteUpdate (not StorageUpdate) for each slot. This is a different scenario than TestSDOfPreExistingContract_FullPipeline and would lock in the invariant.

7. versionedWriteCollector.DeleteAccount asymmetry

Already discussed in concern #5 of the PR body. The non-fix is documented inline. I agree with the decision (no production caller hits the differentiator), but if the comment is to serve its purpose, please make it grep-able by name — e.g., add a // MIRROR-OF: LightCollector.DeleteAccount IncarnationPath emit tag so a future search hits both sites.

Summary

The fix is in the right region of the codebase and the storage-cascade rework is the load-bearing change. The 3-way switch and the LightCollector IncarnationPath emit are mostly defense for currently-unreachable paths.

My main asks:

  • Reconcile the integration test with the production pipeline (concern #1) — the test's vm should contain the EVM's versionWritten emits.
  • Reconcile the PR description with serial's DomainDel behavior (concerns #2, #5) — the "serial keeps the leaf" claim and the EIP-161 wrong-output claim both need code pointers I can verify.
  • Add a regression test for the storage-cascade invariant (concern #6) — currently there's nothing that fails if normalizeWriteSet reorders the SD/StoragePath emits.

Once those are addressed I'd be comfortable signing off; in the meantime requesting changes for clarity, since the PR description and the integration test are doing a lot of load-bearing work to justify the design and both have inconsistencies with what I find tracing through the live code.

pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 8, 2026
…oop exit (erigontech#21039)

## Summary

Drops the fallback clause in `applyLoopMissingBlocks` that turned every
legitimate partial-batch exit into a spurious `InvalidBlock` error.

When the parallel exec loop hits its size-limit mid-fixture, it returns
nil with `reachedMaxBlock=false`. The previous completeness check
flagged `maxBlockNum` as "missing" whenever `!reachedMaxBlock` — but the
size-limit path is normal: the apply loop returns `ErrLoopExhausted`,
the stage loop resumes the next batch from `lastBlockResult+1`, and each
block still executes exactly once. No re-execution.

## Update — additional fix in commit `310a094514`
(`execution/stagedsync: honor blockResult.Exhausted in exec loop`)

Same class of partial-batch orchestration bug, on the **exec-loop
side**. `executeBlocks` marks the final dispatched block with
`Exhausted` when the per-cycle block limit fires, then exits its
goroutine — without closing `pe.execRequests`, without cancelling ctx.
The exec loop had no branch to react to that signal: after the last
blockResult was applied, the main select parked waiting for
`execRequests` / `rws.ResultCh` / `ctx.Done` — none of which ever fire.
The apply loop never got the channel-close signal it needs to return
`ErrLoopExhausted`, so the original (apply-loop-side) fix in this PR
couldn't even run.

Reproduced as a chiado parallel-exec **silent stall** at block 150662
with `EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from block
0 (issue erigontech#20711). The hang was masking a wrong-trie-root divergence at
the same block — with the exhaust signal honored, the underlying
`ErrWrongTrieRoot` now surfaces cleanly through the apply loop and the
original-issue symptom appears as expected.

The new branch sits between the `maxBlockNum`-reached check and the
`StopAfterBlock` debug exit, matching the precedence of the existing
partial-batch exits.

Tests: `TestExecLoopHonorsBlockResultExhausted` runs **two models** of
the exec-loop blockResult decision tree (post-fix and pre-fix) against
identical channel orchestration. The pre-fix model must hang past the
timeout — proves the test scaffolding genuinely surfaces the bug rather
than passing vacuously. `TestExecLoopExhaustedOnlySetOnFinalBlock` locks
in the dispatcher convention so future `executeBlocks` changes can't
silently drop already-queued work by setting `Exhausted` mid-stream.

## Why

Reproduces deterministically as the parallel-mode `BenchmarkFeeHistory`
failure on PR erigontech#21017 (the CI matrix that runs every test under both
`serial` and `parallel` exec modes):

```
apply loop exited (reachedMaxBlock=false lastBlockResult=114 maxBlockNum=200) but 1 block(s) had tx-results without a blockResult or were never delivered: [200]
```

200 blocks × 100 simple transfers exceed the test fixture's 5MB batch
budget at block 114. Trace confirms:
- Batch 1: blocks 1..114, size-limit fires, `ErrLoopExhausted`
- Batch 2: blocks 115..200, `reachedMaxBlock=true`, clean nil

Sd.mem is not contaminated by in-flight future-block writes —
`txResultBlocks` at exit was exactly `[1..114]`. Block 200 was flagged
solely from the fallback clause.

## Shutdown sequence verified

Tracing the deferred-close in `execLoop`: `commitResults` closes first,
calculator drains and closes `rootResults`, apply loop sees both closes,
drains, runs the completeness check, returns `ErrLoopExhausted`.
`executorCancel` then cancels `execLoopCtx`, `executeBlocks` and workers
exit, `pe.wait` joins. Three batches in `BenchmarkFeeHistory` run
end-to-end with no goroutine leaks.

## Tests

* `TestApplyLoopMissingBlocks` — adds a "partial batch — size-limit hit,
no spurious flag for unreached blocks" case
* `TestApplyLoopPartialBatchReturnsErrLoopExhausted` — exercises the
apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for
partial-batch / full-batch / silent-failure scenarios
* `TestApplyLoopChannelCloseOrder` — pins the load-bearing
`commitResults`-before-`applyResults` close order
* `TestExecLoopHonorsBlockResultExhausted` — orchestration test for the
exec-loop side of the partial-batch exit (added in commit `310a094514`)
* `TestExecLoopExhaustedOnlySetOnFinalBlock` — pins the executeBlocks
dispatcher convention (added in commit `310a094514`)

## Test plan

- [x] `BenchmarkFeeHistory/full/blocks=200` passes with
`ERIGON_EXEC3_PARALLEL=true`
- [x] `make lint` clean
- [x] `go test -race ./execution/stagedsync/...` for new tests passes
- [x] Existing `TestApplyLoopRootResultsCloseDoesNotRace`,
`TestApplyLoopDoesNotHangAfterRootResultsClose`,
`TestExecLoopExitCheck*` still pass
- [x] Chiado `EXEC3_PARALLEL=true ... --chain=chiado
--sync.loop.block.limit=10_000 --prune.mode=archive` from block 0
reaches the underlying `ErrWrongTrieRoot` at block 150662 instead of
silently stalling (verifies the partial-batch exec-loop exit fires)

## CI fix linkage

Precursor for erigontech#21017 (CI matrix `exec_mode={serial,parallel}`). Without
this, `bench / benchmarks (parallel)` and any other workflow exercising
a chain large enough to cross the batch-size boundary fails on first
crossing. Companion to erigontech#21032 (incarnation/SD differentiator fix).
- Remove dead LightCollector.DeleteAccount IncarnationPath emit (concern #3):
  IBS.Selfdestruct's versionWritten already publishes IncarnationPath via
  the version map; CollectorWrites is not the rawWrites source.
- Add MIRROR-OF tag on versionedWriteCollector.DeleteAccount cross-
  referencing LightCollector.DeleteAccount and noting that IncarnationPath
  is emitted via IBS.Selfdestruct, not either DeleteAccount (concern #7).
- Rewrite TestSDOfPreExistingContract_FullPipeline to populate vm with
  IBS.Selfdestruct's versionWritten emits; assert BalancePath=0 comes
  from vm.Read (not preBlockBalance from stateReader fallback) and that
  the trie sees a leaf with {Balance=0, Nonce=preBlock, CodeHash=preBlock}
  through the default FlushToUpdates branch (concern #1).
- Add TestSDStorageCascade_EmitsPerSlotDeletes locking in the storage-
  cascade load-bearing invariant: normalizeWriteSet's vm.StorageKeys
  loop emits StoragePath=0 entries that overwrite cs.storageState's
  pre-SD slot values, so each slot emits DeleteUpdate (concern #6).
- Update defensive-test docstrings to spell out unreachability from real
  production writesets and cross-link to the realistic-pipeline tests
  that cover the actual fix mechanism (concern #4).
@mh0lt

mh0lt commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@yperbasis Force-pushed. All 7 concerns addressed in 3e93ac971c (commit on top of the merge from main; the next squash-merge will fold everything together):

# Concern Resolution
1 Integration test populated vm empty so completion-loop's vm.Read fallback never fired TestSDOfPreExistingContract_FullPipeline now populates vm with IBS.Selfdestruct's versionWritten emits; asserts BalancePath=0 from vm.Read (NOT preBlockBalance from stateReader fallback)
2 "Serial keeps the leaf" claim was unverified Body rewritten — describes the verified parallel-side production flow only; serial-equivalence reduced to "parallel emits the same trie writes as serial" with eest_devnet end-to-end coverage as evidence
3 LightCollector.DeleteAccount IncarnationPath emit was dead code Removed. IBS.Selfdestruct's versionWritten already emits IncarnationPath via the version map
4 Defensive unit tests claimed to cover the fix but tested unreachable branches Docstrings rewritten: each defensive test states the branch is unreachable from real production writesets, why, and that it's kept against future refactors. Realistic-pipeline tests now cover the actual fix mechanism
5 PR body had stale claims Rewritten
6 Storage cascade was only covered end-to-end New TestSDStorageCascade_EmitsPerSlotDeletes locks in the storage-cascade load-bearing invariant at unit level: pre-loads cs.storageState with non-zero values, populates vm with StoragePath entries, asserts normalizeWriteSet emits StoragePath=0 per slot and that all slots emit DeleteUpdate
7 versionedWriteCollector.DeleteAccount and LightCollector.DeleteAccount were asymmetric Both now emit only SelfDestructPath. MIRROR-OF comment on each cross-references the other and explains that IncarnationPath is emitted via IBS.Selfdestruct's versionWritten

Key thing the new integration test reveals: the defensive Deleted+Incarnation>0 and Deleted+RetainedBalance branches in FlushToUpdates are confirmed unreachable from real production writesets. The actual fix mechanism is via the default !Deleted branch — IBS.Selfdestruct's BalancePath=0 versionWritten lands in ApplyWrites after SelfDestructPath and resets acc.Deleted=false, so by the time FlushToUpdates runs the account ends up in the regular UPDATE path with {Balance=0, Nonce=preBlock, CodeHash=preBlock}.

The Deleted+... branches are kept as defensive coverage against future ApplyWrites refactors that might change write ordering or drop the BalancePath-clears-Deleted invariant. Tests are explicit about that in the new docstrings.

make lint clean; all 8 calc_state tests pass locally. CI running.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread execution/state/rw_v3.go
Comment on lines +637 to +645
// MIRROR-OF: LightCollector.DeleteAccount — kept symmetric so that
// future searches for one find the other. Both collectors emit only
// SelfDestructPath; the IncarnationPath needed by the parallel
// commitment calculator for the SD-of-pre-existing-contract case is
// emitted via IBS.Selfdestruct's versionWritten (intra_block_state.go
// around line 1430), not via either DeleteAccount. If a future caller
// ever invokes DeleteAccount outside the IBS.Selfdestruct path on a
// pre-existing contract, both implementations would need an
// IncarnationPath emit here.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overview

The PR threads a pre-deletion Incarnation through the parallel commitment calculator so it can distinguish two writeset shapes that look identical on the wire: SELFDESTRUCT of a pre-existing contract vs. EIP-161 emptyRemoval of a touched-empty account. Four files touched (~70 LOC of production change, ~537 LOC of new tests).

What's solid

  • Surgical scope. Serial path untouched; risk to mainnet is contained to parallel exec. CI on 9539998f14 was green.
  • 3-way switch is clearly written with switch over predicates and isAllZero extracted as a named precondition.
  • IncarnationPath ApplyWrites case uses direct type-assertion — matches the convention of every other case in ApplyWrites and panics fast on mismatch instead of silently zero-ing into the wrong branch.
  • Storage cascade rework is logically sound. Replacing the eager delete(cs.storageState, addr) with "mark each tracked slot dirty" lets FlushToUpdates emit DeleteUpdate per slot once normalizeWriteSet's vm.StorageKeys(addr) zero-emit cascade overwrites the values. The LOAD-BEARING INVARIANT comment in calc_state.go:233 is exactly what future-readers need.
  • Comments document the why, including past-incident attribution (eest hive block, OOG-CREATE2 case).

Approving on the strength of #21017's matrix being the real end-to-end gate, plus my confidence in the storage-cascade fix and the surgical scope. Notes below are follow-ups for clarity rather than blockers — feel free to address in this PR or a follow-up.

Follow-ups

1. PR description doesn't match the diff

"LightCollector.DeleteAccount now emits IncarnationPath alongside SelfDestructPath=true when original.Incarnation > 0."

rw_v3.go:731-734 (LightCollector.DeleteAccount) is unchanged in the diff. The only rw_v3.go change is a comment added to versionedWriteCollector.DeleteAccount (lines 636-648) — and that comment explicitly says "Both collectors emit only SelfDestructPath", which contradicts the description.

The actual mechanism by which IncarnationPath reaches the calculator is pre-existing: IBS.Selfdestruct emits it via versionWritten at intra_block_state.go:1430. The only new code path is ApplyWrites consuming it (the case state.IncarnationPath block).

Worth rewriting the description to reflect what the diff actually does — the IncarnationPath emit was already happening; the calc just learned to consume it. Future bisects/reviewers will land here cold.

2. The "primary fix" branch (case 1) is unreachable from production

TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate covers Deleted && Incarnation>0 && all-zero → zero-account UPDATE. The PR's own TestSDOfPreExistingContract_FullPipeline proves this branch is never reached from a real writeset, because BalancePath=0 (emitted by IBS.Selfdestruct at line 1432) clears Deleted.

So:

  • Pre-PR for SD-of-pre-existing: Deleted=false end state → default else branch → regular UPDATE with {Balance=0, Nonce=preBlock, CodeHash=preBlock}.
  • Post-PR for SD-of-pre-existing: same Deleted=false end state → default branch → same values.

The actual delta for case 1 is: Incarnation is now tracked on calcAccountState, but it is never emitted to the trie (commitment.Update has no Incarnation field in the emit). It would help future readers to nail down precisely what trie-root divergence the PR resolves for case 1 — the storage-cascade slot leak? a specific interaction with EIP-161? — and pin it in the description.

3. Test bypasses one production sub-case

TestSDOfPreExistingContract_FullPipeline populates vm with the same writes as rawWrites. Good. But the test then asserts the trie sees pre-block Nonce and pre-block CodeHash — values supplied by the preBlockReader.ReadAccountData fallback in normalizeWriteSet's completion loop. This is correct for the test's vm, but production's vm may already contain NoncePath/CodeHashPath writes from earlier in the block (e.g., a balance transfer to the contract before SD), in which case vm.Read wins over stateReader fallback. The docstring claims "production pipeline end-to-end"; it actually exercises one specific sub-case.

Either re-frame the docstring or add a second variant where vm has prior NoncePath/CodeHashPath writes.

4. Storage cascade has a stale-slot edge

calcState is persistent across blocks (only storageDirty is cleared in ResetBlockFlags; storageState is kept as a cache). The new SelfDestructPath case marks every slot in cs.storageState[addr] dirty:

for key := range slots {
    cs.storageDirty[w.Address][key] = true
}

If a slot X was written in block N (so cs.storageState[addr][X] = v) and the contract is SD'd in block N+1 without any current-block StoragePath write to X, then:

  • vm.StorageKeys(addr) for block N+1 doesn't return X (vm is per-block).
  • normalizeWriteSet's cascade does not emit StoragePath(X, 0).
  • The new SD case marks X dirty.
  • FlushToUpdates emits StorageUpdate(X, v) — a no-op vs the trie's existing value, but a behavior delta from pre-PR (which would have deleted the slot from storageState, leaving X unflushed).

Benign IF cs.storageState[addr][X] always equals the trie's value at block boundary, which holds as long as block N's commitment was committed correctly. Worth a one-line comment near the cascade pinning that invariant: "storageState values match committed trie values at block start; emitting them mid-SD is a no-op vs the trie."

5. Defensive tests document unreachable branches

TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate and the case-1 test both populate cs.accounts directly because production never produces those states. The docstrings are honest about this. But there's no enforcement that the production code path doesn't accidentally start producing those states tomorrow — and if it does, the unit tests pass, masking a real regression.

If keeping defense in depth: a // nolint:unreachable panic in those branches (with a recover-based test) is clearer than "test passes" as a signal of "this is dead defense". Otherwise consider removing the unreachable branches and the tests covering them, keeping only the live Deleted && isAllZero → DeleteUpdate path.

6. EIP-161 case (branch 2) — what does this PR actually change?

For EIP-161 emptyRemoval, normalizeWriteSet's empty-account check (line 3177) collapses BalancePath/NoncePath/IncarnationPath/CodeHashPath writes into a single SelfDestructPath=true. So ApplyWrites only sees SelfDestructPathDeleted=true. Pre-PR if acc.DeletedDeleteUpdate. Post-PR Deleted && isAllZero (Incarnation==0 by default) → DeleteUpdate. Same outcome.

The description's claim "Parallel was emitting a zero-account UPDATE" for EIP-161 would benefit from a code pointer. From a static read, both pre and post PR emit DeleteUpdate for this case — happy to be shown a counter-trace.

Smaller notes

  • case state.IncarnationPath: setting acc.dirty = true for an Incarnation-only update is technically a no-op (Incarnation isn't emitted), but harmless — the addr would be flushed already since IncarnationPath only arrives bundled with SD/Create.
  • calcDomainReader.ReadAccountData doesn't lazy-load Incarnation (calc_state.go:131-135). Couples the design to "always emit IncarnationPath alongside SD"; a one-line comment would document this contract.
  • The 537-line test file has multi-paragraph docstrings per test. Per CLAUDE.md these read more like design notes — consider condensing or moving rationale to a single header comment at the top of the file.
  • commitment.BalanceUpdate|NonceUpdate|CodeUpdate repeated in 3 of 4 branches — would read better as a named const, but not blocking.

Verdict

Approving. Storage-cascade rework and the 3-way switch are sound, scope is contained, and #21017's matrix is the real end-to-end gate. The follow-ups above are clarity/coverage hardening — happy for them to land here or in a follow-up PR.

@mh0lt mh0lt added this pull request to the merge queue May 8, 2026
Merged via the queue into main with commit 95d9662 May 8, 2026
41 of 42 checks passed
@mh0lt mh0lt deleted the fix/parallel-exec-prague-setcode branch May 8, 2026 16:51
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 26, 2026
…igontech#21017)

> **2026-05-15 — erigontech#21153 has merged.** The companion PR carrying the
substantive parallel-exec correctness + perf fix family
(`mh/parallel-exec-fixes`, merged 2026-05-15 02:35Z as `958b2fbb85`) is
now on `main`. This PR has been rebased onto fresh main; the trimmed
branch contains only the serial/parallel exec-mode CI matrix yamls plus
two CI hygiene fixes (a shared `build-erigon-image` job for the kurtosis
matrix, and the `setup-erigon` apt-mirror flake fix), plus three
workflow follow-ups from the 2026-05-13 review.
>
> **Landed precursors:**
> - ✅ erigontech#21153 (merged 2026-05-15) — parallel-exec correctness/perf stack
split from this PR
> - ✅ erigontech#21088 (merged 2026-05-10) — parallel-commitment correctness for
reorg/unwind + SD recreate
> - ✅ erigontech#21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths
> - ✅ erigontech#21039 (merged 2026-05-08) — apply-loop completeness false-flag on
partial-batch exit
> - ✅ erigontech#21010 (merged 2026-05-07) —
`commitmentCalculator.asOfReader.txNum=0` lazy-load fix
>
> Open follow-up issues (separate from this PR's scope): erigontech#21106
(parallel-exec hardening), erigontech#21107 (stage-exec `from-0`/`chaintip`),
erigontech#21108 (residual `EXEC3_PARALLEL` functional cases), erigontech#21136 (gated
SkipLoads), erigontech#21138 (heuristic-removal structural cleanup).

## Summary

Adds an `exec_mode: [serial, parallel]` matrix axis to every CI test
workflow that exercises the EL execution path so divergence between
`dbg.Exec3Parallel` true and false is caught on the PR rather than after
release.

The toggle is plumbed via `ERIGON_EXEC3_PARALLEL` — `envLookup` in
[common/dbg/dbg_env.go:38](common/dbg/dbg_env.go#L38) auto-prepends the
`ERIGON_` prefix to the source-side `EXEC3_PARALLEL` flag in
[common/dbg/experiments.go:81](common/dbg/experiments.go#L81).

## Why

`Exec3Parallel = false` is currently the source default on `main`. With
no CI workflow setting the env var, every CI run today exercises only
the **serial** path. The parallel path lands via `--bal` chains and
tests like `experimentalBAL`, but the broad correctness signal (unit /
race / hive / kurtosis / RPC integration) is single-mode. This PR makes
both modes a per-PR signal.

## Affected workflows

**Always-on** (matrix on every PR / dispatch / call):

| Workflow | Runner | Parallelism | Cost |
|---|---|---|---|
| test-all-erigon.yml | GitHub-hosted (per-OS) | true parallel |
wall-clock unchanged, +1× runner-min |
| test-all-erigon-race.yml | GitHub-hosted | true parallel | same |
| test-hive.yml | hive group | parallel if pool has slots | same (group
is sized) |
| test-hive-eest.yml | hive group | parallel if pool has slots | same |
| test-kurtosis-assertoor.yml | `ubuntu-latest` | true parallel | same |

**Auto-fire on PRs touching their own YAML, dispatch otherwise** (so
regular PRs don't pay the cost; this PR triggers each one once for
end-to-end validation):

| Workflow | Notes |
|---|---|
| test-bench.yml | `go bench` |
| qa-rpc-integration-tests-latest.yml | self-hosted single-pool,
`max-parallel: 1` (shared testbed state) |
| qa-rpc-performance-comparison-tests.yml | erigon runs serial+parallel,
geth single-mode (placeholder ignored) |
| qa-txpool-performance-test.yml | kurtosis txpool, `max-parallel: 1` |
| qa-stage-exec.yml | 3 mode_names × 2 exec_modes = 6 entries |

**Skipped** — `test-integration-caplin.yml` runs `cl/spectest` only,
doesn't exercise the EL exec path; matrix-doubling would just re-run
identical CL tests.

## Plumbing details

* Workflows that build erigon and run go tests directly: env var set on
the test step's `env:` block.
* Hive-based workflows: an `ENV ERIGON_EXEC3_PARALLEL=...` line is
appended to `clients/erigon/Dockerfile` during the existing sed-patch
loop so every hive-launched erigon inherits the toggle.
* Kurtosis-based workflows: a single upstream `build-erigon-image` job
builds `test/erigon:current-base` once for the whole matrix and uploads
it as a run-scoped artifact; each matrix entry downloads + `docker
load`s and adds a cheap `ENV ERIGON_EXEC3_PARALLEL=…` layer on top to
produce `test/erigon:current`. Avoids ~12 concurrent buildx jobs all
writing the same `type=gha` cache scope and 504-ing.
* Self-hosted single-pool workflows use `max-parallel: 1` to serialize
matrix entries cleanly when state on the runner box is shared (testbed
datadir, reference datadir, etc.).
* All artifact / enclave / testbed-dir names are disambiguated by
`exec_mode` so the two matrix entries don't clobber each other's
outputs.

## Review follow-ups (2026-05-15 rebuild)

After rebasing onto post-erigontech#21153 main, three workflow fixes from the
2026-05-13 review are applied as separate commits on top of the 4
trimmed CI commits:

| Commit | Fixes |
|---|---|
| `ci: gate parallel-suffix QA test_name on client==erigon` | yperbasis
Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel`
previously caused its Grafana `test_name` to gain a stray `-parallel-`
suffix, breaking historical dashboard queries |
| `ci: align test-hive devp2p sim-limit between serial and parallel
matrix legs` | yperbasis nit 5 + Copilot threads #3/#9 — both legs now
run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches
the long-standing comment) |
| `ci: fix Targetting typo in test-hive-eest.yml` | Copilot thread #6 —
s/Targetting/Targeting/ |

The yperbasis Blocker 1 (`cmd/integration/commands/stages.go:631`
ignores `ERIGON_` prefix, so `qa-stage-exec`'s serial entry silently
runs in parallel mode) is a Go change, raised as a separate small PR to
keep this one strictly `.github/` only.

## Test plan

This PR's CI run **is** the test. The 5 always-on workflows fire
automatically on PR. The 5 perf workflows auto-fire because their
`pull_request: paths` filter matches the workflow's own YAML — so
opening this PR triggers all 10 affected workflows.

What to look for in this PR's "Checks" tab:
- [x] Each affected workflow has both `serial` and `parallel` matrix
entries listed.
- [x] Job display names show the mode (e.g. `tests-mac-linux
(ubuntu-24.04, parallel)`).
- [x] Wall-clock for hosted-runner workflows is essentially unchanged
from main baseline (jobs ran concurrently on separate runners).
- [x] Self-hosted single-pool workflows show ~2× wall-clock (matrix
entries serialize).
- [x] Both modes pass on every workflow. **If serial fails where
parallel passes (or vice versa), that's a real regression we'd want to
catch — exactly the point of this change.**

After merge, regular PRs only pay the matrix cost on the 5 always-on
workflows; the 5 perf workflows go back to being dispatch /
schedule-only.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: taratorio <94537774+taratorio@users.noreply.github.com>
Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com>
Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: milen <taratorio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants