ci: matrix-test serial vs parallel exec across the test workflows by mh0lt · Pull Request #21017 · erigontech/erigon

mh0lt · 2026-05-06T17:39:42Z

⚠️ Outstanding prerequisite: #21088 — closes the off-by-one cluster (TestBlockchainHeaderchainReorgConsistency, TestLongerForkHeaders, TestLongerForkBlocks, TestCallTraceUnwind) and TestRecreateAndRewind on the parallel matrix.

Landed precursors:

✅ state, stagedsync: parallel-exec wrong-root fix for SD vs EIP-161 emptyRemoval #21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths

✅ execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit #21039 (merged 2026-05-08) — apply-loop completeness false-flag on partial-batch exit

✅ stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains #21010 (merged 2026-05-07) — commitmentCalculator.asOfReader.txNum=0 lazy-load fix

Once #21088 lands, this PR rebases and the matrix should run clean across the workflows that exercise the EL exec path. Pipeline-level parallel-mode races (forkchoice / senders / canonical-txnums) and from-0/chaintip stage-exec failures are tracked as separate follow-up PRs in the parallel-exec hardening series.

Summary

Adds an exec_mode: [serial, parallel] matrix axis to every CI test workflow that exercises the EL execution path so divergence between dbg.Exec3Parallel true and false is caught on the PR rather than after release.

The toggle is plumbed via ERIGON_EXEC3_PARALLEL — envLookup in common/dbg/dbg_env.go:38 auto-prepends the ERIGON_ prefix to the source-side EXEC3_PARALLEL flag in common/dbg/experiments.go:80.

Why

Exec3Parallel = false is currently the source default on main. With no CI workflow setting the env var, every CI run today exercises only the serial path. The parallel path lands via --bal chains and tests like experimentalBAL, but the broad correctness signal (unit / race / hive / kurtosis / RPC integration) is single-mode. This PR makes both modes a per-PR signal.

Affected workflows

Always-on (matrix on every PR / dispatch / call):

Workflow	Runner	Parallelism	Cost
test-all-erigon.yml	GitHub-hosted (per-OS)	true parallel	wall-clock unchanged, +1× runner-min
test-all-erigon-race.yml	GitHub-hosted	true parallel	same
test-hive.yml	hive group	parallel if pool has slots	same (group is sized)
test-hive-eest.yml	hive group	parallel if pool has slots	same
test-kurtosis-assertoor.yml	`ubuntu-latest`	true parallel	same

Auto-fire on PRs touching their own YAML, dispatch otherwise (so regular PRs don't pay the cost; this PR triggers each one once for end-to-end validation):

Workflow	Notes
test-bench.yml	`go bench`
qa-rpc-integration-tests-latest.yml	self-hosted single-pool, `max-parallel: 1` (shared testbed state)
qa-rpc-performance-comparison-tests.yml	erigon runs serial+parallel, geth single-mode
qa-txpool-performance-test.yml	kurtosis txpool, `max-parallel: 1`
qa-stage-exec.yml	3 mode_names × 2 exec_modes = 6 entries

Skipped — test-integration-caplin.yml runs cl/spectest only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests.

Plumbing details

Workflows that build erigon and run go tests directly: env var set on the test step's env: block.
Hive-based workflows: an ENV ERIGON_EXEC3_PARALLEL=... line is appended to clients/erigon/Dockerfile during the existing sed-patch loop so every hive-launched erigon inherits the toggle.
Kurtosis-based workflows: a small follow-up docker build adds an ENV layer on top of test/erigon:current-base to produce test/erigon:current with the toggle baked in. Cheap (single layer) and doesn't invalidate earlier build-cache layers.
Self-hosted single-pool workflows use max-parallel: 1 to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.).
All artifact / enclave / testbed-dir names are disambiguated by exec_mode so the two matrix entries don't clobber each other's outputs.

Test plan

This PR's CI run is the test, once #21088 has landed and this branch rebases on top. The 5 always-on workflows fire automatically on PR. The 5 perf workflows auto-fire because their pull_request: paths filter matches the workflow's own YAML — so opening this PR triggers all 10 affected workflows.

What to look for in this PR's "Checks" tab (after rebasing on the precursors):

Each affected workflow has both serial and parallel matrix entries listed.
Job display names show the mode (e.g. tests-mac-linux (ubuntu-24.04, parallel)).
Wall-clock for hosted-runner workflows is essentially unchanged from main baseline (jobs ran concurrently on separate runners).
Self-hosted single-pool workflows show ~2× wall-clock (matrix entries serialize).
Both modes pass on every workflow. If serial fails where parallel passes (or vice versa), that's a real regression we'd want to catch — exactly the point of this change.

After merge, regular PRs only pay the matrix cost on the 5 always-on workflows; the 5 perf workflows go back to being dispatch / schedule-only.

Adds an exec_mode: [serial, parallel] axis to every test workflow that exercises the EL execution path so divergence between dbg.Exec3Parallel true/false is caught on the PR rather than after release. The matrix entries spawn separate runners and run concurrently — wall-clock unchanged, runner-minutes ~2x for affected workflows. The toggle is plumbed via ERIGON_EXEC3_PARALLEL — envLookup in common/dbg/dbg_env.go auto-prepends the ERIGON_ prefix to the source- side EXEC3_PARALLEL flag in common/dbg/experiments.go. Affected workflows: Always-on (matrix on every PR / dispatch / call): test-all-erigon.yml — unit tests, hosted, true parallel test-all-erigon-race.yml — race detector, hosted, true parallel test-hive.yml — engine API + devp2p, hive group runners test-hive-eest.yml — EEST consume-engine/-rlp, hive group test-kurtosis-assertoor.yml — kurtosis devnet suites, hosted Auto-fire on PRs touching their own YAML, dispatch otherwise: test-bench.yml — go bench qa-rpc-integration-tests-latest.yml — self-hosted, max-parallel=1 qa-rpc-performance-comparison-tests.yml — erigon-vs-geth, geth one-mode qa-txpool-performance-test.yml — kurtosis txpool, max-parallel=1 qa-stage-exec.yml — 3 modes × 2 exec_modes = 6 Skipped: test-integration-caplin.yml — runs cl/spectest only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests. Plumbing notes: * For workflows that build erigon and run go tests directly, the env var is set on the test step's env: block. * For hive-based workflows (test-hive, test-hive-eest), an ENV ERIGON_EXEC3_PARALLEL=... line is appended to clients/erigon/Dockerfile during the existing sed patch loop so every hive-launched erigon inherits the toggle. * For kurtosis-based workflows (test-kurtosis-assertoor, qa-txpool-performance-test), a small follow-up docker build adds an ENV layer on top of test/erigon:current-base to produce test/erigon:current with the toggle baked in. Cheap (single layer) and doesn't invalidate earlier build-cache layers. * Self-hosted single-pool workflows use max-parallel=1 to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.). * All artifact / enclave / testbed-dir names are disambiguated by exec_mode so the two matrix entries don't clobber each other's outputs. The pull_request: paths: filter on the perf workflows means they auto-fire only on PRs that touch their own YAML — i.e. this PR triggers them once each so the change can be observed end-to-end, and future regular PRs that don't touch them stay free of the cost.

mriccobene · 2026-05-07T09:23:54Z

Here, we need to change the test name to distinguish between different results in our Hive UI.
For sequential execution, it is preferable to leave the name as it is, and change it for parallel execution by adding a suffix. This allows us to use a simple regular expression to filter results; for example, 'rpc-integration-tests-latest-parallel'.

Fixed in 406cb45 — test_name now suffixes -parallel only for the parallel matrix entry, serial keeps the existing name unchanged. Renders as rpc-integration-tests-latest (serial) / rpc-integration-tests-latest-parallel (parallel).

mriccobene · 2026-05-07T09:30:20Z

Here, again, we need to change the test name to distinguish between different results in our Grafana dashboard
For sequential execution, it is preferable to leave the name as it is, and change it for parallel execution by adding a suffix. This allows us to use a simple regular expression to filter results; for example, 'rpc-performance-test-latest-parallel-$method'.

Fixed in 406cb45 — --test_name for upload_test_results.py now expands to rpc-performance-test-latest-$method (serial) / rpc-performance-test-latest-parallel-$method (parallel). The HDR-analysis report's local test_name (line 340) stays as $client-$method since that only labels the PDF, which is already disambiguated by the artifact-name's -${{ matrix.exec_mode }} suffix — let me know if you want that one renamed too for consistency.

mh0lt · 2026-05-07T11:17:20Z

Cross-ref to #21032 — the precursor that fixes the parallel-exec wrong-trie-root bugs this matrix surfaces. Once #21032 lands, rebase this on top and the parallel hive sub-suites should go green.

@mriccobene

Addresses review feedback from @mriccobene on PR #21017: the QA dashboard filters results by test_name regex. Without a suffix the parallel and serial entries land under the same test_name and can't be distinguished in the Hive UI / Grafana dashboard. Per the reviewer's preferred convention, leave the serial entry's name unchanged and add `-parallel` only for the parallel matrix entry. Both `upload_test_results.py` invocations are updated: - qa-rpc-integration-tests-latest.yml:266 → rpc-integration-tests-latest[-parallel] - qa-rpc-performance-comparison-tests.yml:374 → rpc-performance-test-latest[-parallel]-$method The HDR-analysis PDF test_name (line 340 of the perf-comparison file) is intentionally left as `$client-$method` — that name only labels a local report file, which is already disambiguated by the artifact name's `-${{ matrix.exec_mode }}` suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mh0lt · 2026-05-07T12:58:18Z

Second precursor open: #21039 — fixes the parallel-mode BenchmarkFeeHistory failure (Bucket B). Once #21032 and #21039 merge I'll rebase. Bucket C (reorg state-visibility) investigation continuing in a third precursor.

…oop exit (erigontech#21039) ## Summary Drops the fallback clause in `applyLoopMissingBlocks` that turned every legitimate partial-batch exit into a spurious `InvalidBlock` error. When the parallel exec loop hits its size-limit mid-fixture, it returns nil with `reachedMaxBlock=false`. The previous completeness check flagged `maxBlockNum` as "missing" whenever `!reachedMaxBlock` — but the size-limit path is normal: the apply loop returns `ErrLoopExhausted`, the stage loop resumes the next batch from `lastBlockResult+1`, and each block still executes exactly once. No re-execution. ## Update — additional fix in commit `310a094514` (`execution/stagedsync: honor blockResult.Exhausted in exec loop`) Same class of partial-batch orchestration bug, on the **exec-loop side**. `executeBlocks` marks the final dispatched block with `Exhausted` when the per-cycle block limit fires, then exits its goroutine — without closing `pe.execRequests`, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for `execRequests` / `rws.ResultCh` / `ctx.Done` — none of which ever fire. The apply loop never got the channel-close signal it needs to return `ErrLoopExhausted`, so the original (apply-loop-side) fix in this PR couldn't even run. Reproduced as a chiado parallel-exec **silent stall** at block 150662 with `EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from block 0 (issue erigontech#20711). The hang was masking a wrong-trie-root divergence at the same block — with the exhaust signal honored, the underlying `ErrWrongTrieRoot` now surfaces cleanly through the apply loop and the original-issue symptom appears as expected. The new branch sits between the `maxBlockNum`-reached check and the `StopAfterBlock` debug exit, matching the precedence of the existing partial-batch exits. Tests: `TestExecLoopHonorsBlockResultExhausted` runs **two models** of the exec-loop blockResult decision tree (post-fix and pre-fix) against identical channel orchestration. The pre-fix model must hang past the timeout — proves the test scaffolding genuinely surfaces the bug rather than passing vacuously. `TestExecLoopExhaustedOnlySetOnFinalBlock` locks in the dispatcher convention so future `executeBlocks` changes can't silently drop already-queued work by setting `Exhausted` mid-stream. ## Why Reproduces deterministically as the parallel-mode `BenchmarkFeeHistory` failure on PR erigontech#21017 (the CI matrix that runs every test under both `serial` and `parallel` exec modes): ``` apply loop exited (reachedMaxBlock=false lastBlockResult=114 maxBlockNum=200) but 1 block(s) had tx-results without a blockResult or were never delivered: [200] ``` 200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114. Trace confirms: - Batch 1: blocks 1..114, size-limit fires, `ErrLoopExhausted` - Batch 2: blocks 115..200, `reachedMaxBlock=true`, clean nil Sd.mem is not contaminated by in-flight future-block writes — `txResultBlocks` at exit was exactly `[1..114]`. Block 200 was flagged solely from the fallback clause. ## Shutdown sequence verified Tracing the deferred-close in `execLoop`: `commitResults` closes first, calculator drains and closes `rootResults`, apply loop sees both closes, drains, runs the completeness check, returns `ErrLoopExhausted`. `executorCancel` then cancels `execLoopCtx`, `executeBlocks` and workers exit, `pe.wait` joins. Three batches in `BenchmarkFeeHistory` run end-to-end with no goroutine leaks. ## Tests * `TestApplyLoopMissingBlocks` — adds a "partial batch — size-limit hit, no spurious flag for unreached blocks" case * `TestApplyLoopPartialBatchReturnsErrLoopExhausted` — exercises the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for partial-batch / full-batch / silent-failure scenarios * `TestApplyLoopChannelCloseOrder` — pins the load-bearing `commitResults`-before-`applyResults` close order * `TestExecLoopHonorsBlockResultExhausted` — orchestration test for the exec-loop side of the partial-batch exit (added in commit `310a094514`) * `TestExecLoopExhaustedOnlySetOnFinalBlock` — pins the executeBlocks dispatcher convention (added in commit `310a094514`) ## Test plan - [x] `BenchmarkFeeHistory/full/blocks=200` passes with `ERIGON_EXEC3_PARALLEL=true` - [x] `make lint` clean - [x] `go test -race ./execution/stagedsync/...` for new tests passes - [x] Existing `TestApplyLoopRootResultsCloseDoesNotRace`, `TestApplyLoopDoesNotHangAfterRootResultsClose`, `TestExecLoopExitCheck*` still pass - [x] Chiado `EXEC3_PARALLEL=true ... --chain=chiado --sync.loop.block.limit=10_000 --prune.mode=archive` from block 0 reaches the underlying `ErrWrongTrieRoot` at block 150662 instead of silently stalling (verifies the partial-batch exec-loop exit fires) ## CI fix linkage Precursor for erigontech#21017 (CI matrix `exec_mode={serial,parallel}`). Without this, `bench / benchmarks (parallel)` and any other workflow exercising a chain large enough to cross the batch-size boundary fails on first crossing. Companion to erigontech#21032 (incarnation/SD differentiator fix).

# Conflicts: # .github/workflows/test-hive.yml

mh0lt · 2026-05-08T15:21:55Z

Merged origin/main into the branch (commit 7f15ca7cea) to pick up the parallel-exec precursors that landed today: #21039 (apply-loop), #21062 (Set*Workers race), #21042 (stateCache wiring), #21010 (asOfReader.txNum=0 fix), and #21065 (devp2p discv5).

One conflict: .github/workflows/test-hive.yml — main bumped devp2p sim-limit from eth to eth|discv5 (#21065) at the same time this branch was expanding the row into serial+parallel exec_mode variants. Resolved by keeping both serial and parallel devp2p rows and applying eth|discv5 to both.

Still missing: #21032 (SD vs EIP-161 emptyRemoval wrong-root). That's the only remaining precursor; once it lands, this branch will rebase to pick it up and we'll have the full set. Pushing now to see how the parallel-mode hive sub-suites land with the today-precursors but without #21032 — expect SD-related sub-suites (rpc-compat selfdestruct, eest selfdestruct, cancun system-call address) to fail on the exec_mode: parallel rows; the rest should give a useful signal.

mh0lt · 2026-05-08T16:57:35Z

Merged origin/main (now including #21032 — landed at 16:51Z) into the branch as commit 1075b33acd. The full parallel-exec precursor stack is now in: #21039, #21042, #21046, #21062, #21010, #21065, #21032.

No conflicts on the merge — #21032's diff (calc_state.go, calc_state_test.go, exec3_parallel.go, rw_v3.go) is orthogonal to #21017's workflow-yaml-only diff.

Lint clean. CI now exercises the parallel-exec hive sub-suites with the SD/EIP-161 emptyRemoval fix in place — this is the meaningful go/no-go signal. Watching for parallel-mode rows on:

hive / test-hive (ethereum/engine, cancun)
hive / test-hive (ethereum/engine, withdrawals)
hive / test-hive (ethereum/rpc-compat, .*)
race-tests / tests-linux (ubuntu-latest, execution-eest-devnet)

…tyRemoval (erigontech#21032) ## Summary Fixes a wrong trie-root in the parallel commitment calculator when post-tx writesets are indistinguishable between two cases: 1. **SELFDESTRUCT of a pre-existing contract** — serial keeps the leaf with `incarnation>0`, zero balance/nonce/empty codeHash. Parallel was emitting `DeleteUpdate` and removing the leaf. 2. **EIP-161 emptyRemoval** of a touched-empty account (e.g. `0xff…fe` after the EIP-4788 system call) — serial emits `DeleteUpdate`. Parallel was emitting a zero-account UPDATE. The discriminator is the **pre-block incarnation**, which the writeset alone didn't carry. Fix wires it through `LightCollector.DeleteAccount` → `IncarnationPath` write → `calcAccountState.Incarnation` → 3-way branch in `FlushToUpdates`. ## Dependency direction This PR is a **precursor for erigontech#21017** (the CI matrix that runs every test under both `serial` and `parallel` exec modes). Without this fix, parallel-mode tests on hive `rpc-compat`, `engine api/cancun/withdrawals`, and the eest selfdestruct sub-suites all fail with wrong-trie-root errors at SD/empty-removal blocks. **erigontech#21017 cannot land until this PR lands.** The matrix in erigontech#21017 will validate this PR end-to-end via the hive parallel sub-suites — meaning this PR's parallel-exec changes don't run in *this* PR's own CI, only in erigontech#21017's. The rebased erigontech#21017 is the meaningful go/no-go signal. ## Changes * `LightCollector.DeleteAccount` now emits `IncarnationPath` alongside `SelfDestructPath=true` when `original.Incarnation > 0`. Calc receives the pre-deletion incarnation through the same channel as every other write — no exec-side state leakage. * `calc_state.ApplyWrites` consumes `IncarnationPath` into `calcAccountState.Incarnation` via direct type-assertion (panic on mismatch — see *Concern 3* below). * `calc_state.FlushToUpdates` switches on a 3-way `Deleted` branch with `isAllZero` defense-in-depth: * `Deleted && Incarnation>0 && all-zero` → zero-account UPDATE (matches serial's `DomainDel`-leaves-post-CREATE-encoding for SD'd contracts) * `Deleted && all-zero && Incarnation==0` → `DeleteUpdate` (matches serial's `DomainDel`-truly-empties for EIP-161 emptyRemoval) * `Deleted` with retained non-zero values → regular UPDATE — defensive coverage for OOG-CREATE2-with-retained-balance and any future write-ordering race * `SelfDestructPath` also marks all tracked storage slots dirty so `FlushToUpdates` emits per-slot updates alongside the account reset. Load-bearing invariant: `normalizeWriteSet`'s `vm.StorageKeys(addr)` loop emits `StoragePath=0` entries that arrive in `ApplyWrites` after `SelfDestructPath`, overwriting the marked slots' values to zero so they emit `DeleteUpdate` not `StorageUpdate(pre-SD value)`. Inline comment in `calc_state.go` spells this out — see *Concern 2* below. ## Earlier draft snags (resolved) The first draft also added an `IncarnationPath > 0` exclusion to `normalizeWriteSet`'s empty-account check. This was **redundant** (the empty-check already requires `Nonce == 0`, which excludes successful CREATE/CREATE2) and **broke OOG-during-CREATE2 cases** (which leave `Nonce=0/Balance=0/Code=empty/Incarnation=1` and *must* still be deleted). Removed in `9539998f14`. The `exec3_parallel.go` diff in this PR is now comments-only. ## Reviewer concerns addressed ### #1 (yperbasis): PR description was stale This body. ✓ ### #2 (yperbasis + Copilot): SD storage-cascade load-bearing invariant Inline comment in `calc_state.go`'s `SelfDestructPath` case now spells out the dependency on `normalizeWriteSet`'s `vm.StorageKeys(addr)` loop. ✓ ### #3 (yperbasis + Copilot): IncarnationPath guarded type-assertion Changed to direct `w.Val.(uint64)` to match the other paths. Silent zero would route a real SD into the EIP-161 branch and reproduce the very wrong-root bug — better to panic at the source. ✓ ### #4 (yperbasis): `TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate` docstring Updated to clarify: this is **defensive coverage** for the third `FlushToUpdates` branch in isolation, NOT a direct repro of the eest_devnet OOG path. The actual OOG fix is the removal of the redundant `IncarnationPath > 0` clause from `normalizeWriteSet` (the OOG writeset has `Nonce=0` → empty-account → `DeleteUpdate`, not `Deleted+RetainedBalance`). End-to-end coverage of that path lives in the eest_devnet suite, surfaced via erigontech#21017's matrix. ✓ ### #5 (yperbasis): `versionedWriteCollector.DeleteAccount` asymmetry — *intentional non-fix* Decision: **keep the asymmetry, document why.** Inline comment added on `versionedWriteCollector.DeleteAccount` explaining: * It's wired only into `txResult.finalize` (fee calc + post-Cancun system calls). * Neither path SDs a pre-existing contract today, so the SD-with-incarnation differentiator is unreachable from here. * If a future caller ever does emit `DeleteAccount` on a pre-existing contract through this collector, the comment flags that this code should mirror `LightCollector.DeleteAccount`'s `IncarnationPath` emit. Adding the emit speculatively was rejected because: (a) it changes the writeset shape for paths that today don't need it, (b) any test exercising the new emit would be vacuous since no production caller hits the `original.Incarnation > 0` branch, and (c) the comment is enough to attribute the bug at first sight if someone *does* reach that code path in the future. ## Intentional non-fixes * **Concern #5 above** — `versionedWriteCollector.DeleteAccount` left without the `IncarnationPath` emit (rationale above). * **Defensive `TestFlushToUpdates_DeletedWithRetainedBalance` test kept** despite the state being unreachable from real LightCollector writesets today — protects the FlushToUpdates branch in isolation against future ApplyWrites refactors that might drop the `BalancePath`-clears-`Deleted` invariant. ## Test plan - [x] All 6 unit tests in `calc_state_test.go` pass (`TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate`, `TestFlushToUpdates_DeletedWithoutIncarnation_EmitsDelete`, `TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate`, `TestFlushToUpdates_LiveAccount_EmitsFullUpdate`, `TestApplyWrites_IncarnationPath`, `TestApplyWrites_BalancePathClearsDeleted`) - [x] eest_devnet `for_amsterdam/constantinople/eip1052_extcodehash/extcodehash/extcodehash_subcall_create2_oog` all 6 variants pass locally - [x] Full `for_amsterdam/constantinople` eest_devnet suite passes - [x] `make lint` clean - [x] CI on `9539998f14` was green End-to-end validation comes via erigontech#21017's CI matrix once it rebases on top of this PR.

mh0lt · 2026-05-10T16:07:45Z

Bucket C parallel-commitment correctness fix is up in #21088 — closes the off-by-one cluster (TestBlockchainHeaderchainReorgConsistency, TestLongerForkHeaders, TestLongerForkBlocks, TestCallTraceUnwind) and TestRecreateAndRewind on the parallel-exec matrix. Remaining parallel-mode failures (pipeline-level races in forkchoice/senders/canonical-txnums and from-0 stage-exec) are tracked as separate follow-up PRs.

…ComputeCommitment swap Closes ~93% of the race-detector hits surfaced on #21017's parallel race-test matrix by extending the changesetMu band-aid (added in the prior commit) to cover the deferred-update flush window and the ComputeCommitment internal [state] marker write. Race count delta (parallel race-tests groups): Group Before After Δ ----------------------- ------- ----- ----- execution-other 12 2 -83% core-rpc 158 13 -92% execution-eest-blockchain 4 0 -100% Bucket C cluster 4 0 -100% What changed ------------ * `SharedDomains.FlushPendingUpdates` previously did its inner swap (`switcher.SetChangesetAccumulator(cs)` … restore) directly via the `changesetSwitcher` interface, bypassing changesetMu. The race detector flagged this as the dominant race signature (SetDiff↔SetDiff, SetChangesetAccumulator↔SetChangesetAccumulator, GetChangesetAccumulator↔SetChangesetAccumulator). Split into `FlushPendingUpdates` (acquires changesetMu) and `FlushPendingUpdatesLocked` (caller already holds it). The internal `flushPendingUpdates(lockHeld bool)` is the single implementation. * Added `ComputeCommitmentLocked` for the parallel calculator's per-block compute window. The calc holds changesetMu via `LockChangesetAccumulator` and now calls `ComputeCommitmentLocked` which uses `FlushPendingUpdatesLocked` instead of the public flush. * The `cs == nil` early-return path in `computeWithBlockAccumulator` also takes the lock now — the FlushPendingUpdates inside ComputeCommitment still runs there and needs serialization against the apply-side SetChangesetAccumulator. Without this the cs==nil fast path produces ~73 SetDiff vs PutWithPrev hits per group. * Added `domainPutNoLock` as an internal helper used by FlushPendingUpdates' deferred-branch replay, since `SharedDomains.DomainPut` for non-CommitmentDomain takes the lock and would self-deadlock when the caller already holds it. Bucket C functional tests (no -race) still all pass under EXEC3_PARALLEL=true. make lint clean. Residual races (~2 per execution-other run) are encodeAndStoreCommitmentState's [state] marker write going through the lock-exempt CommitmentDomain branch in DomainPut — closing those needs the [state] marker write to also be deferred (or the exemption removed). Tracked as Variant 1 / lock-free follow-up in the parallel-exec hardening series. This commit is intended to fold into #21088 since it's a related fix on the same band-aid mutex and #21088 hasn't been reviewed yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mh0lt requested review from AskAlexSharov and mriccobene as code owners May 6, 2026 17:39

mh0lt requested a review from yperbasis May 6, 2026 17:41

mh0lt mentioned this pull request May 7, 2026

state, stagedsync: parallel-exec wrong-root fix for SD vs EIP-161 emptyRemoval #21032

Merged

5 tasks

mriccobene reviewed May 7, 2026

View reviewed changes

mh0lt marked this pull request as draft May 7, 2026 11:14

mh0lt marked this pull request as ready for review May 7, 2026 11:16

mh0lt mentioned this pull request May 7, 2026

execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit #21039

Merged

5 tasks

mriccobene approved these changes May 7, 2026

View reviewed changes

mh0lt mentioned this pull request May 7, 2026

stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains #21010

Merged

3 tasks

Merge remote-tracking branch 'origin/main' into mh/ci-exec-mode-matrix

7f15ca7

# Conflicts: # .github/workflows/test-hive.yml

Merge remote-tracking branch 'origin/main' into mh/ci-exec-mode-matrix

1075b33

Merge remote-tracking branch 'origin/main' into mh/ci-exec-mode-matrix

a4e5dc5

mh0lt mentioned this pull request May 10, 2026

execution/stagedsync, db/state: parallel-commitment correctness for reorg/unwind + SD recreate #21088

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: matrix-test serial vs parallel exec across the test workflows#21017

ci: matrix-test serial vs parallel exec across the test workflows#21017
mh0lt wants to merge 5 commits intomainfrom
mh/ci-exec-mode-matrix

mh0lt commented May 6, 2026 •

edited

Loading

Uh oh!

mriccobene May 7, 2026

Uh oh!

mh0lt May 7, 2026

Uh oh!

mriccobene May 7, 2026

Uh oh!

mh0lt May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mh0lt commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Affected workflows

Plumbing details

Test plan

Uh oh!

mriccobene May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mh0lt May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mriccobene May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mh0lt May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mh0lt commented May 6, 2026 •

edited

Loading