Commit 348041b
execution: fix sporadic bal mismatches due to phantom accesses in system txns (erigontech#21654)
Fixes erigontech#21651 — non-deterministic parallel-exec BAL computation rejecting
canonical blocks as invalid and wedging the CL.
## Root cause
BAL address-access recording is started by `ibs.Prepare` (allocates a
fresh `addressAccess` map, sets `recordAccess = true`) and ended by the
`ibs.AccessedAddresses()` harvest at the end of `TxTask.Execute`. The
bug is that **only regular user transactions call `Prepare`** (reached
via `ApplyMessage` → `execution/protocol/txn_executor.go:423/590`),
while **system tasks don't** — yet system tasks still perform the
harvest:
- **block-init** (`TxIndex == -1`) runs `engine.Initialize` + syscalls
via `SysCallContract` — no `ApplyMessage`, no `Prepare`;
- **block-end** (`IsBlockEnd()`) executes no EVM code at all.
Both end in `TxTask.Execute`'s success-path harvest
(`result.AccessedAddresses = ibs.AccessedAddresses()`,
`execution/exec/txtask.go:624`). On a clean worker IBS that harvest
returns nil — `recordAccess` is false, so even the Initialize syscalls
record nothing, which is the spec-correct behavior (system-call touches
don't belong in the BAL). The system tasks therefore implicitly relied
on the worker IBS being pristine.
That assumption broke in combination with two other facts:
1. A conflict-aborted incarnation never reaches the harvest —
`TxTask.Execute` skips it when `result.Err != nil`
(`execution/exec/txtask.go:618`) — so the aborted tx's touched-address
set **and** `recordAccess = true` survive on the worker's long-lived
IBS. `IntraBlockState.Reset()`, called before every task, cleared all
other per-task state (versioned reads/writes, journal, EIP-2929 access
list, transient storage) but **not** `addressAccess`/`recordAccess`.
Regular txs were still immune because their `Prepare` overwrites the
residue; system tasks were not.
2. Workers are shared by all in-flight block executors of a batch
(single `pe.in` queue), so the system task that lands on the
residue-laden worker usually belongs to a **different block** than the
aborted tx. It scoops the residue as its own accesses, and `nextResult`
records them into its block's `blockIO`
(`execution/stagedsync/exec3_parallel.go:2301`).
`AsBlockAccessList` then creates an entry for every accessed address
(`execution/state/versionedio.go:1417`) → phantom **empty,
address-only** entries in the computed BAL. Which worker holds residue
and which block's system task lands on it is schedule-dependent → the
computed BAL hash differs across re-execution attempts of the same
block. This explains all field symptoms: extra empty accounts,
precompile-range extras, extras belonging to neighboring blocks'
canonical BALs, and computed BALs with more accounts than the stored
sidecar.
## Reproduction (before the fix)
Local glamsterdam-devnet-5, erigon @ `1ca634d4`
(`ERIGON_EXEC3_PARALLEL=true`) + prysm `glamsterdam-devnet-5`, full
resync from genesis. During catch-up batch execution the node rejected
canonical block **6638**:
- computed BAL: 340 accounts; canonical: 339
- the extra entry `0xd4093C4A57D6F952849D6bf47e9a1F6CDCa79b7a` is
address-only (no reads, no changes), absent from 6638's canonical BAL,
and legitimately active in a dozen neighboring blocks of the same batch
(6600–6619)
- the post-unwind re-execution of 6638 computed 339 accounts and passed
— non-determinism confirmed on the same block within one run
Same signature in 9 earlier mismatch dumps (blocks 1405–1430): every
extra address was an empty entry appearing canonically in *other* blocks
of the same batch.
## Fix
Clear `recordAccess`/`addressAccess` in `Reset()` — i.e. make `Reset()`
guarantee for every task what `Prepare` only guaranteed for user
transactions, so a task's `AccessedAddresses` can only contain addresses
touched during that task's own execution. This also covers any future
Prepare-less task type, not just block-init/block-end. Audited the full
`IntraBlockState` struct: this was the only accumulating field `Reset()`
missed; no caller relies on recording state surviving `Reset()`.
## Testing
- TDD: `TestAddressAccessResetInIBSReset` written first — red on old
code, green with the fix.
- `./execution/state/...` (incl. `-race`) and `./execution/exec/...`
pass.
- All engine-api BAL tests pass, incl.
`TestEngineApiBALParallelConsistencyStress` ×50 with `-race` (the canary
for this bug class).
- Devnet validation: full glamsterdam-devnet-5 resync from genesis with
the fixed binary — genesis → tip 11431 (delta 0 vs network), **0 BAL
mismatches, 0 invalid blocks**, through every previously-failing range
(960, 1405–1430, 5545–6320, 6638); at tip NewPayload/FCU return VALID
and prysm imports live blocks normally.
- `make lint` clean (×2).
---------
Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>1 parent aef33a8 commit 348041b
2 files changed
Lines changed: 23 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
372 | 372 | | |
373 | 373 | | |
374 | 374 | | |
| 375 | + | |
| 376 | + | |
375 | 377 | | |
376 | 378 | | |
377 | 379 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
| |||
170 | 171 | | |
171 | 172 | | |
172 | 173 | | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
173 | 194 | | |
174 | 195 | | |
175 | 196 | | |
| |||
0 commit comments