Skip to content

Commit 348041b

Browse files
taratorioyperbasis
andauthored
execution: fix sporadic bal mismatches due to phantom accesses in system txns (erigontech#21654)
Fixes erigontech#21651 — non-deterministic parallel-exec BAL computation rejecting canonical blocks as invalid and wedging the CL. ## Root cause BAL address-access recording is started by `ibs.Prepare` (allocates a fresh `addressAccess` map, sets `recordAccess = true`) and ended by the `ibs.AccessedAddresses()` harvest at the end of `TxTask.Execute`. The bug is that **only regular user transactions call `Prepare`** (reached via `ApplyMessage` → `execution/protocol/txn_executor.go:423/590`), while **system tasks don't** — yet system tasks still perform the harvest: - **block-init** (`TxIndex == -1`) runs `engine.Initialize` + syscalls via `SysCallContract` — no `ApplyMessage`, no `Prepare`; - **block-end** (`IsBlockEnd()`) executes no EVM code at all. Both end in `TxTask.Execute`'s success-path harvest (`result.AccessedAddresses = ibs.AccessedAddresses()`, `execution/exec/txtask.go:624`). On a clean worker IBS that harvest returns nil — `recordAccess` is false, so even the Initialize syscalls record nothing, which is the spec-correct behavior (system-call touches don't belong in the BAL). The system tasks therefore implicitly relied on the worker IBS being pristine. That assumption broke in combination with two other facts: 1. A conflict-aborted incarnation never reaches the harvest — `TxTask.Execute` skips it when `result.Err != nil` (`execution/exec/txtask.go:618`) — so the aborted tx's touched-address set **and** `recordAccess = true` survive on the worker's long-lived IBS. `IntraBlockState.Reset()`, called before every task, cleared all other per-task state (versioned reads/writes, journal, EIP-2929 access list, transient storage) but **not** `addressAccess`/`recordAccess`. Regular txs were still immune because their `Prepare` overwrites the residue; system tasks were not. 2. Workers are shared by all in-flight block executors of a batch (single `pe.in` queue), so the system task that lands on the residue-laden worker usually belongs to a **different block** than the aborted tx. It scoops the residue as its own accesses, and `nextResult` records them into its block's `blockIO` (`execution/stagedsync/exec3_parallel.go:2301`). `AsBlockAccessList` then creates an entry for every accessed address (`execution/state/versionedio.go:1417`) → phantom **empty, address-only** entries in the computed BAL. Which worker holds residue and which block's system task lands on it is schedule-dependent → the computed BAL hash differs across re-execution attempts of the same block. This explains all field symptoms: extra empty accounts, precompile-range extras, extras belonging to neighboring blocks' canonical BALs, and computed BALs with more accounts than the stored sidecar. ## Reproduction (before the fix) Local glamsterdam-devnet-5, erigon @ `1ca634d4` (`ERIGON_EXEC3_PARALLEL=true`) + prysm `glamsterdam-devnet-5`, full resync from genesis. During catch-up batch execution the node rejected canonical block **6638**: - computed BAL: 340 accounts; canonical: 339 - the extra entry `0xd4093C4A57D6F952849D6bf47e9a1F6CDCa79b7a` is address-only (no reads, no changes), absent from 6638's canonical BAL, and legitimately active in a dozen neighboring blocks of the same batch (6600–6619) - the post-unwind re-execution of 6638 computed 339 accounts and passed — non-determinism confirmed on the same block within one run Same signature in 9 earlier mismatch dumps (blocks 1405–1430): every extra address was an empty entry appearing canonically in *other* blocks of the same batch. ## Fix Clear `recordAccess`/`addressAccess` in `Reset()` — i.e. make `Reset()` guarantee for every task what `Prepare` only guaranteed for user transactions, so a task's `AccessedAddresses` can only contain addresses touched during that task's own execution. This also covers any future Prepare-less task type, not just block-init/block-end. Audited the full `IntraBlockState` struct: this was the only accumulating field `Reset()` missed; no caller relies on recording state surviving `Reset()`. ## Testing - TDD: `TestAddressAccessResetInIBSReset` written first — red on old code, green with the fix. - `./execution/state/...` (incl. `-race`) and `./execution/exec/...` pass. - All engine-api BAL tests pass, incl. `TestEngineApiBALParallelConsistencyStress` ×50 with `-race` (the canary for this bug class). - Devnet validation: full glamsterdam-devnet-5 resync from genesis with the fixed binary — genesis → tip 11431 (delta 0 vs network), **0 BAL mismatches, 0 invalid blocks**, through every previously-failing range (960, 1405–1430, 5545–6320, 6638); at tip NewPayload/FCU return VALID and prysm imports live blocks normally. - `make lint` clean (×2). --------- Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>
1 parent aef33a8 commit 348041b

2 files changed

Lines changed: 23 additions & 0 deletions

File tree

execution/state/intra_block_state.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -372,6 +372,8 @@ func (sdb *IntraBlockState) Reset() {
372372
// references are dropped, while the next execution lazily allocates fresh ones.
373373
sdb.versionedReads = nil
374374
sdb.versionedWrites = nil
375+
sdb.recordAccess = false
376+
sdb.addressAccess = nil
375377
sdb.accountReadDuration = 0
376378
sdb.accountReadCount = 0
377379
sdb.storageReadDuration = 0

execution/state/parallel_fixes_test.go

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ import (
77
"github.com/stretchr/testify/assert"
88
"github.com/stretchr/testify/require"
99

10+
"github.com/erigontech/erigon/execution/chain"
1011
"github.com/erigontech/erigon/execution/commitment"
1112
"github.com/erigontech/erigon/execution/types/accounts"
1213
)
@@ -170,6 +171,26 @@ func TestAccessListResetInIBSReset(t *testing.T) {
170171
assert.False(t, ibs.AddressInAccessList(testAddr), "Address should be cold after Reset")
171172
}
172173

174+
// TestAddressAccessResetInIBSReset verifies that IBS.Reset() clears BAL
175+
// address-access recording. An aborted incarnation never harvests
176+
// AccessedAddresses, and only regular txs call Prepare (which re-inits
177+
// recording) — a worker next assigned a system block-start/block-end
178+
// transaction never calls Prepare, so it would harvest the leftovers into
179+
// its own block's access list as phantom entries.
180+
func TestAddressAccessResetInIBSReset(t *testing.T) {
181+
ibs := New(nil)
182+
sender := accounts.InternAddress([20]byte{0x01})
183+
coinbase := accounts.InternAddress([20]byte{0x02})
184+
leaked := accounts.InternAddress([20]byte{0x42})
185+
// Prepare enables access recording at tx start.
186+
require.NoError(t, ibs.Prepare(&chain.Rules{}, sender, coinbase, accounts.NilAddress, nil, nil, nil))
187+
ibs.MarkAddressAccess(leaked, false)
188+
// Tx aborts: AccessedAddresses is never harvested. The worker resets
189+
// the shared IBS before the next task.
190+
ibs.Reset()
191+
assert.Empty(t, ibs.AccessedAddresses(), "no recorded accesses should survive Reset")
192+
}
193+
173194
// TestTransientStorageResetInIBSReset verifies that IBS.Reset() clears
174195
// transient storage (EIP-1153).
175196
func TestTransientStorageResetInIBSReset(t *testing.T) {

0 commit comments

Comments
 (0)