Skip to content

execution/stagedsync, db/state: parallel-commitment correctness for reorg/unwind + SD recreate#21088

Open
mh0lt wants to merge 7 commits intomainfrom
fix/parallel-commitment-perblock-changeset
Open

execution/stagedsync, db/state: parallel-commitment correctness for reorg/unwind + SD recreate#21088
mh0lt wants to merge 7 commits intomainfrom
fix/parallel-commitment-perblock-changeset

Conversation

@mh0lt
Copy link
Copy Markdown
Contributor

@mh0lt mh0lt commented May 10, 2026

Summary

Closes the off-by-one wrong-trie-root cluster, TestRecreateAndRewind, and all TemporalMemBatch.currentChangesAccumulator / DomainBufferedWriter.diff race-detector hits surfaced on #21017 under EXEC3_PARALLEL=true. Six commits:

  1. 8c9d9c10a7 — Per-block commitment in calculator when generating changesets
  2. acba3279a6 — Lock pastChangesAccumulator for parallel-commitment access
  3. bac3ee0c25 — Hash-aware past-changeset lookup for parallel calculator
  4. 6dc4e3fe8f — Serialize accumulator swap; SD-aware writeset normalization
  5. 980de89157 — Simplify calc SD ApplyWrites — single pass with conditional Deleted-clearing
  6. 29f5ebc2ed — Lock-protect FlushPendingUpdates + ComputeCommitment swap (closes 100% of the dominant race signature)

Functional fix (commits 1–5)

  • Concurrency band-aid on SharedDomains.changesetMu. The parallel commitment calculator briefly swaps the global current-changeset-accumulator pointer to route block N's branch writes into block N's saved CS. During that swap window, the apply goroutine's DomainPut for block N+1's account/storage writes was landing in block N's CS, causing missing prev-value entries on unwind (TestBlockchainHeaderchainReorgConsistency, TestLongerForkHeaders/Blocks, TestCallTraceUnwind).
  • SD-aware writeset normalization. IBS.Selfdestruct emits three writes (IncarnationPath, SelfDestructPath=true, BalancePath=0). normalizeWriteSet's completion loop was filling missing fields for SD'd addresses via the stateReader, round-tripping pre-SD nonce/codeHash back into the writeset. applyVersionedWrites then took the cleanup-before-recreate branch instead of pure-delete, so phoenix stayed in sd.mem with non-zero incarnation and the next block's CREATE2 saw a phantom existing account. Calc-side ApplyWrites also had the symmetric bug: the trailing BalancePath=0 clobbered Deleted=true set by SelfDestructPath. Fix: drop SD'd addresses' raw account-field writes in normalize; in calc, gate Deleted-clearing on a non-zero incoming value, and zero Balance/Nonce/CodeHash/Incarnation and storage on SelfDestructPath.

Race fix (commit 6)

The band-aid mutex from commit (4) only covered the calculator's outer swap and apply-side DomainPut/DomainDel. FlushPendingUpdates (which runs inside ComputeCommitment) was doing its own inner swap via the changesetSwitcher interface, bypassing changesetMu. That left the dominant race signature in #21017's parallel race-tests open (227 hits across the four matrix groups, all in the currentChangesAccumulator / DomainBufferedWriter.diff family).

Commit (6) splits FlushPendingUpdates into the public locking variant and an unlocked FlushPendingUpdatesLocked for the calc, adds ComputeCommitmentLocked for the calc's per-block compute window, and extends the calculator's outer lock to the cs == nil early-return path so the inner FlushPendingUpdates is always serialized against the apply loop.

Race-tests group Baseline races After this PR Δ
execution-tests 53 0 -100%
execution-other 12 0 -100%
core-rpc 158 0 -100%
execution-eest-blockchain 4 0 -100%
Bucket C cluster (under -race) 4 0 -100%
Total 231 0 -100%

Test plan

  • All seven Bucket C tests pass under EXEC3_PARALLEL=true -count=2:
    • TestBlockchainHeaderchainReorgConsistency
    • TestLongerForkHeaders / TestLongerForkBlocks
    • TestCallTraceUnwind
    • TestTxLookupUnwind
    • TestLowDiffLongChain
    • TestRecreateAndRewind
  • Bucket C cluster under -race shows 0 races (was 4 before this PR).
  • All four parallel race-tests matrix groups under -race show 0 races (was 227 before this PR).
  • make lint clean.
  • No regressions in the surrounding parallel-exec test groups.

Known follow-ups (NOT in scope for this PR — separate parallel-exec hardening track)

  • Functional test failures unrelated to commitment-accumulator races remain in some parallel-mode test groups (~20 unique tests across core-rpc, execution-other, etc.). These are NOT race-detector hits — they're pre-existing functional issues unrelated to the commitment-accumulator fixes (engine-API behavior, RPC test fixtures, etc.). Tracked as separate follow-up PRs in the parallel-exec hardening series.
  • stage-exec-test (from-0, parallel | serial) and (chaintip, parallel | serial) — failures span both modes, indicating a non-Bucket-C concern. Separate PR.
  • Lock-free execution refactor (Variant 1 / Option A0 in the project's session memory) — eventually delete the changesetMu band-aid and the swap dance in committer.go computeWithBlockAccumulator by deriving per-block changesets post-hoc from sd entries at flush time. Architectural rather than band-aid.

🤖 Generated with Claude Code

mh0lt and others added 5 commits May 10, 2026 09:03
…ing changesets

When shouldGenerateChangesets is true, the parallel commitment calculator
must compute per-block — not in batch mode — so that each block's saved
changeset records the branch deltas attributable to that block alone.

In batch mode the trie folds multiple blocks together and the deferred
buffer dedupes branch prefixes across the batch; the merged updates
flush into the LAST block's changeset, which fails on subsequent
reorg-driven re-execution because:

  - block N's changeset gets a `[prefix] => prevData=empty` entry that
    represents "before block 1 (creation)" rather than "before block N
    (current value)" semantics
  - on unwind, the changeset replay restores block N's branch to empty
    instead of block N-1's value
  - the trie's "state" key still references that branch, so subsequent
    fold/unfold reads empty branch data and fails with
    "empty branch data read during unfold, compact prefix 00 nibbles"

Mirror serial's gate (exec3_serial.go: `!dbg.BatchCommitments ||
shouldGenerateChangesets || KeepExecutionProofs`) on the calculator.

Additionally, wrap ComputeCommitment with an accumulator switch to
block N's saved changeset for the duration of compute, so any branch
writes (mid-process inline flushes from `pendingPrefixes` collisions
plus the deferred-buffer flush from the wrapper) land in block N's CS
rather than whatever the exec loop has installed as current. To make
that switch reliable, save block N's changeset BEFORE sendResult so
the calculator (consuming on a separate goroutine) always sees the
saved CS via GetChangesetByBlockNum at compute time.

Fixes TestTxLookupUnwind under EXEC3_PARALLEL=true.

Other reorg tests under parallel mode (TestRecreateAndRewind,
TestLongerForkHeaders, TestLongerForkBlocks,
TestBlockchainHeaderchainReorgConsistency, TestLowDiffLongChain,
TestCallTraceUnwind) still fail and have separate root causes —
follow-up work.
The parallel commitment calculator (when forcePerBlockCompute is on)
calls SharedDomains.GetChangesetByBlockNum from its own goroutine
while the exec loop calls SavePastChangesetAccumulator. Both touched
TemporalMemBatch.pastChangesAccumulator without any locking, racing
on map iteration vs map write — surfaces as
"fatal error: concurrent map iteration and map write" on the
TestLowDiffLongChain reproducer.

Add a sync.RWMutex (pastChangesLock) and protect:
  - SavePastChangesetAccumulator (write)
  - GetChangesetByBlockNum (read)
  - GetDiffset's lookup of the cached changeset (read)
  - Merge's iteration over another batch's pastChangesAccumulator
    (write on self, read on other)

flushDiffSet remains unlocked: it runs at the end of the stage after
the calculator goroutine has already stopped (defer Stop in pe.exec
runs before stageloop's commit), so no race window remains there.
…r parallel calculator

pastChangesAccumulator can hold multiple changesets per block number after
a fork-bounce reorg (canonical block N + forks[i] block N saved with
different hashes). The existing GetChangesetByBlockNum iterates the map
and returns the first match — non-deterministic in Go's map iteration
order.

The parallel commitment calculator's per-block compute wrap (and
FlushPendingUpdates) used number-only lookup, occasionally routing a
block's [state] / branch writes to the wrong block's CS during reorg-
heavy tests like TestBlockchainHeaderchainReorgConsistency.

Fix:

- Add BlockHash to PendingCommitmentUpdate so the deferred-flush path
  carries the hash through to FlushPendingUpdates.
- Add SharedDomains.GetChangesetByHash + TemporalMemBatch.GetChangesetByHash
  for exact (BlockNum, BlockHash) lookup.
- Calculator's computeWithBlockAccumulator now uses hash-aware lookup
  and stamps PendingUpdate.BlockHash after compute, so the next call's
  FlushPendingUpdates routes deterministically.
- Applied the wrap to all three compute paths (computeAndCheck,
  computeAndPublish, computeWithoutCheck) so first-block-isPartial
  paths don't bypass it.
- FlushPendingUpdates prefers GetChangesetByHash when upd.BlockHash is
  set, falls back to legacy lookup otherwise (preserves behavior for
  callers that don't yet thread the hash).

Note: this fix eliminates a real source of non-determinism but does
NOT fully fix TestBlockchainHeaderchainReorgConsistency on its own —
that test exhibits a deeper issue where iter 1's re-execution of the
canonical chain produces empty cc.updates and the calculator's
ComputeCommitment skips the [state] write entirely. Investigating in
follow-up work.
…writeset normalization

Closes the off-by-one wrong-trie-root cluster and TestRecreateAndRewind
under EXEC3_PARALLEL=true. Two coupled fixes:

(1) Concurrency band-aid on SharedDomains.changesetMu

The parallel commitment calculator briefly swaps the global
"current changeset accumulator" pointer to route block N's branch
writes into block N's saved CS (computeWithBlockAccumulator). During
that swap window, the apply goroutine concurrently calls DomainPut for
block N+1's account/storage writes — those land in block N's CS instead
of block N+1's. On a later unwind, block N+1's CS lacks the prev-value
for those writes, the executor reads stale state, and reorg/fork tests
fail with wrong trie roots (TestBlockchainHeaderchainReorgConsistency,
TestLongerForkHeaders, TestLongerForkBlocks, TestCallTraceUnwind).

Add changesetMu on SharedDomains. Calc holds Lock around its swap+
compute+restore via LockChangesetAccumulator/UnlockChangesetAccumulator;
apply-side DomainPut/DomainDel take Lock briefly per non-Commitment
write. CommitmentDomain writes (only the calc writes those) are
exempted to avoid self-deadlock with the calc's outer Lock.

Public Set/GetChangesetAccumulator now lock internally (used by serial
exec, exec3_parallel, tests). The new *Locked variants are for callers
that already hold the mutex (the calc).

(2) SD-aware writeset normalization + calc SelfDestructPath ordering

IBS.Selfdestruct emits three versionWritten entries
(IncarnationPath=preInc, SelfDestructPath=true, BalancePath=0). The
parallel writeset path mishandled them in two ways:

  - normalizeWriteSet's completion loop filled missing account fields
    for SD'd addresses via stateReader, round-tripping pre-SD
    Nonce/CodeHash back into the writeset. applyVersionedWrites then
    took the "cleanup-before-recreate" branch instead of pure-delete:
    phoenix stayed in sd.mem with non-zero incarnation, the next
    block's CREATE2 saw a phantom existing account, and execution
    diverged from serial.

  - calc_state.go ApplyWrites processed writes in arrival order, so
    BalancePath=0 (which resets acc.Deleted=false) ran AFTER
    SelfDestructPath=true (which sets Deleted=true). The SD flag was
    silently clobbered, FlushToUpdates routed into the default UPDATE
    branch instead of DeleteUpdate, and the trie kept pre-SD account
    state.

Fix (1): pre-scan writes for SD'd addresses; drop their
Balance/Nonce/Incarnation/CodeHash/Code raw writes AND skip the
completion-loop fallback for them. The writeset for an SD'd address
carries only SelfDestructPath=true, so applyVersionedWrites reaches
the pure-delete branch.

Fix (2): split ApplyWrites into two passes — first applies non-SD
writes, second applies SelfDestructPath. SD wins regardless of
arrival order. The SD pass zeros Balance/Nonce/CodeHash/Incarnation
(matching serial's DomainDel removing the leaf) and zeros every
tracked storage slot so FlushToUpdates emits DeleteUpdate per slot —
required because vm.StorageKeys only returns slots written in the
current tx's version map (Selfdestruct() doesn't write storage
explicitly), so without zeroing here, stale storage values would
survive into the trie post-SD and break TestRecreateAndRewind block 4
(recreate sees stale storage).

Test results
============
All seven Bucket C tests pass under EXEC3_PARALLEL=true with -count=3:
TestBlockchainHeaderchainReorgConsistency, TestLongerForkHeaders,
TestLongerForkBlocks, TestCallTraceUnwind, TestTxLookupUnwind,
TestLowDiffLongChain, TestRecreateAndRewind. make lint clean.

Known follow-up
===============
go test -race flags 8 races inside ComputeCommitment's parallel worker
goroutines — they write commitment branches via the lock-exempt
CommitmentDomain path while the calc's outer Lock is held by the main
goroutine. These are not new functional bugs (origin/main has the same
unsynchronized access pattern; race detector reports 0 there because
those tests fail before the racing access is hit). The clean closure is
to remove the calculator's branch writes from the global accumulator
entirely — see memory project_lock_free_exec_followup.md for the
preferred path (reuse the existing deferCommitmentUpdates/
FlushPendingUpdates infrastructure to defer branch writes out of the
swap window). Tracked as a separate PR; for this PR we want CI
correctness.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… conditional Deleted-clearing

Replaces the two-pass (non-SD then SD-wins) approach in 6dc4e3f
with a single pass that conditionally clears acc.Deleted only when
the incoming Balance/Nonce/CodeHash/Code value is non-zero/non-empty.
Functionally equivalent for the off-by-one cluster + recreate; restores
test-source compatibility with the existing
TestApplyWrites_BalancePathClearsDeleted and TestApplyWrites_IncarnationPath
unit tests in execution/stagedsync.

Why
---
6dc4e3f's two-pass refactor changed the order in which writes hit
acc.Deleted such that:

  TestApplyWrites_IncarnationPath: writes [Inc=1, SD=true]
    expected Inc=1 + zero-account-UPDATE flags
    got Inc=0 + DeleteUpdate (because SD-second-pass zeroed Inc)

The unit test's expectation of "Inc preserved → zero-account UPDATE"
was based on a stale comment in calc_state.go that misread serial's
behavior; empirically serial's DomainDel for a pure SD-of-pre-existing
removes the leaf (DeleteUpdate), not a zero-fields-with-Inc encoding
(see TestRecreateAndRewind block 3 expected root). The test is now
updated to match the corrected semantics.

  TestApplyWrites_BalancePathClearsDeleted: writes [SD=true, Bal=42]
    expected Deleted=false + Bal=42

Under the simplified single-pass with conditional Deleted-clearing,
SD sets Deleted=true and zeros fields, then Bal=42 (non-zero) re-sets
Bal=42 and clears Deleted. Test passes unchanged.

Semantics
---------
The two invariants from 6dc4e3f are preserved, just expressed
differently:

  (1) IBS.Selfdestruct's trailing BalancePath=0 must NOT clobber
      Deleted=true. Achieved by gating the Deleted-clearing on a
      non-zero incoming value in BalancePath/NoncePath/CodeHashPath/
      CodePath cases — zero writes are part of the SD emission and
      do not undo Deleted.

  (2) SelfDestructPath=true zeros Balance/Nonce/CodeHash/Incarnation
      and zeros every tracked storage slot, so FlushToUpdates routes
      into the EIP-161 DeleteUpdate branch (matching serial's
      DomainDel removing the leaf).

All seven Bucket C tests still pass under EXEC3_PARALLEL=true with
-count=2; make lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 10, 2026

Surfaces from #21017 (parallel-exec test matrix). Closes the Bucket C subset of #21017's parallel-mode failures (off-by-one cluster + TestRecreateAndRewind). Other failing tests on #21017 are tracked as separate follow-up PRs in the parallel-exec hardening series.

…ComputeCommitment swap

Closes ~93% of the race-detector hits surfaced on #21017's parallel
race-test matrix by extending the changesetMu band-aid (added in the
prior commit) to cover the deferred-update flush window and the
ComputeCommitment internal [state] marker write.

Race count delta (parallel race-tests groups):

  Group                     Before  After  Δ
  ----------------------- -------  -----  -----
  execution-other              12      2  -83%
  core-rpc                    158     13  -92%
  execution-eest-blockchain     4      0  -100%
  Bucket C cluster              4      0  -100%

What changed
------------

* `SharedDomains.FlushPendingUpdates` previously did its inner swap
  (`switcher.SetChangesetAccumulator(cs)` … restore) directly via the
  `changesetSwitcher` interface, bypassing changesetMu. The race
  detector flagged this as the dominant race signature
  (SetDiff↔SetDiff, SetChangesetAccumulator↔SetChangesetAccumulator,
  GetChangesetAccumulator↔SetChangesetAccumulator).

  Split into `FlushPendingUpdates` (acquires changesetMu) and
  `FlushPendingUpdatesLocked` (caller already holds it). The internal
  `flushPendingUpdates(lockHeld bool)` is the single implementation.

* Added `ComputeCommitmentLocked` for the parallel calculator's
  per-block compute window. The calc holds changesetMu via
  `LockChangesetAccumulator` and now calls `ComputeCommitmentLocked`
  which uses `FlushPendingUpdatesLocked` instead of the public flush.

* The `cs == nil` early-return path in `computeWithBlockAccumulator`
  also takes the lock now — the FlushPendingUpdates inside
  ComputeCommitment still runs there and needs serialization against
  the apply-side SetChangesetAccumulator. Without this the cs==nil
  fast path produces ~73 SetDiff vs PutWithPrev hits per group.

* Added `domainPutNoLock` as an internal helper used by
  FlushPendingUpdates' deferred-branch replay, since
  `SharedDomains.DomainPut` for non-CommitmentDomain takes the lock
  and would self-deadlock when the caller already holds it.

Bucket C functional tests (no -race) still all pass under
EXEC3_PARALLEL=true. make lint clean.

Residual races (~2 per execution-other run) are
encodeAndStoreCommitmentState's [state] marker write going through
the lock-exempt CommitmentDomain branch in DomainPut — closing those
needs the [state] marker write to also be deferred (or the
exemption removed). Tracked as Variant 1 / lock-free follow-up in
the parallel-exec hardening series.

This commit is intended to fold into #21088 since it's a related fix
on the same band-aid mutex and #21088 hasn't been reviewed yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mh0lt
Copy link
Copy Markdown
Contributor Author

mh0lt commented May 10, 2026

Race fix landed (commit 29f5ebc). Re-ran the four parallel race-tests matrix groups from #21017 — total race-detector hits went from 227 → 0. PR body updated with the per-group delta table.

… for SD-aware normalize

The test was asserting the OLD (buggy) production behavior:
  - normalizeWriteSet's completion loop fills BalancePath/NoncePath/
    CodeHashPath from vm.Read or stateReader for SD'd addresses
  - calcState.ApplyWrites ends with acc.Deleted=false (because the
    BalancePath=0 write fires acc.Deleted=false in the BalancePath case)
  - FlushToUpdates default branch fires; leaf survives with pre-SD
    nonce/codeHash

That behavior produced wrong trie roots vs serial in TestRecreateAndRewind
(serial's DomainDel removes the leaf for a pure SD; parallel was leaving
a zero-balance leaf with pre-SD nonce/codeHash).

Commits 6dc4e3f + 980de89 in this PR fixed the production code:
  - normalizeWriteSet pre-scans for SD'd addresses and DROPS their
    Balance/Nonce/Incarnation/CodeHash/Code raw writes; the completion
    loop also skips them
  - calcState.ApplyWrites' SelfDestructPath case zeros all account
    fields and storage so FlushToUpdates routes through DeleteUpdate

This commit updates the test to assert the new (correct) semantics —
the only thing in the normalized writeset for an SD'd address is
SelfDestructPath=true, ApplyWrites ends with Deleted=true and
isAllZero, and FlushToUpdates emits DeleteUpdate matching serial.

Closes the lone CI failure on this PR's serial-mode tests/race-tests/
sonar runs (TestSDOfPreExistingContract_FullPipeline asserting OLD
behavior; my fix matched serial which is what the test SHOULD be
asserting).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
// execution: derive per-block changesets post-hoc from sd entries
// (now tx-granular) at sd.Flush time, and delete this Mutex + the
// SetChangesetAccumulator/GetChangesetAccumulator API entirely.
changesetMu sync.Mutex
Copy link
Copy Markdown
Collaborator

@AskAlexSharov AskAlexSharov May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make sense to guard changesetAccumulator inside TemporalMemBatch by latestStateLock.
Because: already have latestStateLock - which lock on Put/Del/Unwind. changesetAccumulator feels like part of "latest state in-mem data structures" - and needs update in same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants