cmd/integration, execution/stagedsync: fix from-0 flow (env-prefix + reset-progress-delete)#21210
Conversation
|
2026-05-16 update: added a second commit ( After the original The engine-API InsertChain + UpdateForkChoice path never hits this because it doesn't call ResetExec. Per @yperbasis's review on #21017, this PR is the integration-tool-side fix that makes both Includes a unit test ( |
2ff4c6b to
c22d33d
Compare
… default
stageExec was checking only the unprefixed EXEC3_PARALLEL env name when
deciding whether to apply its default of dbg.Exec3Parallel=true. CI
workflows set the ERIGON_-prefixed form because dbg.envLookup auto-
prepends ERIGON_, so for the serial matrix entry:
1. Package init reads ERIGON_EXEC3_PARALLEL=false via envLookup
→ dbg.Exec3Parallel = false.
2. stageExec sees no unprefixed EXEC3_PARALLEL → flips it back to true.
Both modes ended up running in parallel, with the matrix passing through
silent equivalence rather than real CI signal.
Fix per AskAlexSharov: move the env parse out of the command runtime and
into a package init() that assigns dbg.Exec3Parallel directly via
dbg.EnvBool, matching the dbg/experiments.go pattern. EnvBool checks both
the bare and ERIGON_-prefixed forms via envLookup, so a workflow setting
ERIGON_EXEC3_PARALLEL=false now correctly suppresses the integration
tool's default; the default is true so callers that set nothing (typical
local debugging) still get parallel mode.
stageExec entry loses the 3-line runtime env-check entirely.
Reported by @yperbasis on #21017 review.
…et, don't overwrite with 0 ResetExec → clearStageProgress was overwriting the Execution stage progress with 8 zero bytes via SaveStageProgress(tx, stage, 0). That conflated two distinct states: (a) stage never started — entry absent (len(bnBytes) == 0) (b) stage executed up to block 0 — entry present, value 0 SeekCommitment in commitmentdb/commitment_context.go uses this entry as a fallback when no commitment state is in the domain. Per its own comment at lines 651-660: blockNum=0 means "genesis executed", so it returns txNum = TxNums.Max(0) = 1 (not txNum=0) so the next exec cycle does not re-run the genesis init task. After ResetExec the actual state is (a) — domain tables wiped, no commitment at all. But because clearStageProgress wrote 0 instead of deleting, SeekCommitment saw state (b) and returned (txNum=1, blockNum=0). ExecV3's exec loop then started at txNum=1, skipping block 0's init task — the only place that re-runs the genesis allocation through the worker pool's LightCollector → applyVersionedWrites pipeline. Result: genesis-allocated addresses that no subsequent block touches end up with balance=0 in the parallel-exec view, producing wrong-trie-root mismatches on `qa-stage-exec (from-0, parallel)` (#21017 / #21138 / mainnet block 46147 / 0xA1E4380A3B1f749673E270229993eE55F35663b4). The engine-API InsertChain + FCU path doesn't hit this because it doesn't call ResetExec; it starts with no entry at all and falls into the (a) branch correctly. Fix: clearStageProgress now deletes the SyncStageProgress entries rather than writing 0. This is the integration-tool-side fix the user's diagnosis pointed at — make the integration path's reset state consistent with the FCU path's "fresh DB" state, rather than papering over the inconsistency in exec3.go / SeekCommitment. Adds a unit test that exercises the failing path in <100ms: TestFromZero_GenesisAllocPreservedAfterResetReExec Sets up a custom genesis allocating a dormant address, syncs 5 empty blocks via the engine-API InsertChain (passes pre-fix), then resets state via rawdbreset.ResetExec and drives stage execution via direct SpawnExecuteBlocksStage (the integration-tool path that fails pre-fix). Asserts the genesis-allocated balance is preserved across the reset + re-exec cycle. Pre-fix the test fails with `wrong trie root, block=5` (state diverges because genesis init never runs). Diagnosis credit: Mark Holt.
c22d33d to
f18e28a
Compare
There was a problem hiding this comment.
Pull request overview
Fixes cmd/integration stage_exec “from-0 after reset” behavior so a reset truly returns the DB to the “never executed” state (avoiding block-0 initialization being skipped), and ensures the integration tool’s parallel-exec default can be overridden via the ERIGON_-prefixed env var. Adds a regression test reproducing the genesis-allocation drop/wrong-trie-root failure after reset + re-exec.
Changes:
- Reset flow: delete
SyncStageProgressentries (and theirprune_counterparts) instead of writing progress0, preserving the “absent vs 0” semantic needed bySeekCommitment. - Integration tool: move
dbg.Exec3Paralleldefaulting todbg.EnvBool("EXEC3_PARALLEL", true)so bothEXEC3_PARALLELandERIGON_EXEC3_PARALLELare honored. - Add an internal execmodule tester unit test to cover reset + integration-path re-execution preserving genesis allocations.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
execution/stagedsync/rawdbreset/reset_stages.go |
Deletes stage progress keys on reset to preserve “never executed” semantics for SeekCommitment. |
execution/execmodule/execmoduletester/from0_genesis_internal_test.go |
Adds regression test for genesis allocation preservation after ResetExec + integration-style re-exec. |
cmd/integration/commands/stages.go |
Removes per-command env handling for EXEC3_PARALLEL (now set in package init). |
cmd/integration/commands/flags.go |
Sets integration-tool default for dbg.Exec3Parallel via dbg.EnvBool, honoring ERIGON_ prefix. |
Comments suppressed due to low confidence (3)
execution/execmodule/execmoduletester/from0_genesis_internal_test.go:110
checkBalancewraps work inemt.DB.ViewTemporal(...)but then opens a separateBeginTemporalRotransaction inside the callback and ignores thetxpassed byViewTemporal. This is redundant and can lead to extra aggregator/file txs; use the providedtxdirectly (or dropViewTemporaland manage a single Ro tx explicitly).
require.NoError(t, emt.DB.ViewTemporal(ctx, func(tx kv.TemporalTx) error {
rTx, err := emt.DB.BeginTemporalRo(ctx)
if err != nil {
return err
}
defer rTx.Rollback()
doms, err := execctx.NewSharedDomains(ctx, rTx, logger)
execution/execmodule/execmoduletester/from0_genesis_internal_test.go:187
defer tx.Rollback()captures only the initial transaction; aftertxis reassigned inside the loop, errors will return without rolling back the latest open tx. Use a deferred closure that rolls back the currenttx(likedefer func(){ tx.Rollback() }()), or explicitly rollback/close each new tx on error paths.
tx, err := emt.DB.BeginTemporalRw(ctx)
if err != nil {
return err
}
defer tx.Rollback()
execution/execmodule/execmoduletester/from0_genesis_internal_test.go:203
- The integration tool’s loop explicitly treats
*stagedsync.ErrLoopExhaustedas non-fatal and continues iterating, but this helper returns any error directly. To accurately mirror the integration path (and avoid flaky failures when the stage hits its loop-iteration limit), handleErrLoopExhaustedthe same way (ignore/continue).
for {
if err := stagedsync.SpawnExecuteBlocksStage(s, emt.Sync, doms, tx, toBlock, ctx, cfg, logger); err != nil {
return err
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if testing.Short() { | ||
| t.Skip() | ||
| } |
yperbasis
left a comment
There was a problem hiding this comment.
LGTM — correct root-cause fix with a faithful regression test. A few optional suggestions:
-
Redundant transaction in
checkBalance(from0_genesis_internal_test.go:104-123):require.NoError(t, emt.DB.ViewTemporal(ctx, func(tx kv.TemporalTx) error { rTx, err := emt.DB.BeginTemporalRo(ctx) ... }))
The outer
ViewTemporalopenstxbut it's never used; the body opens a secondrTx. Drop theViewTemporalwrapper and callBeginTemporalRoonce. -
Hardcoded
"prune_"prefix (reset_stages.go:196): the original code went throughSaveStagePruneProgress, which encapsulates the prefix. Consider exposingstages.PruneProgressKey(stage)(or similar) so the"prune_"literal isn't duplicated in two places (here andstages.go:113/105). -
Duplicate
[env]warning at startup:dbg's package init andflags.go's init both callEnvBool("EXEC3_PARALLEL", …), so when the env var is set the warning is logged twice in the integration tool's startup. Cosmetic only. -
Stray blank line at
from0_genesis_internal_test.go:71(right after thefuncsignature).
Verified locally that the new test passes on both serial and parallel, and reverting only the clearStageProgress change reproduces the failure — so the test catches the regression cleanly.
Summary
Two-commit fix for
cmd/integration'sstage_exec --reset; stage_exec --batchSize=10mbfrom-0 flow. Both issues surface together as theqa-stage-exec (from-0, *)CI failures on #21017 and as the chiado-from-0 / mainnet-block-46147 wrong-trie-root failures discussed in #21138.Commit 1 —
stages.goenv-prefix bug (yperbasis Blocker 1)stageExecincmd/integration/commands/stages.go:631checked only the unprefixedEXEC3_PARALLELenv name when deciding whether to keep its default ofdbg.Exec3Parallel = true. CI workflows set theERIGON_-prefixed form becausedbg.envLookupauto-prependsERIGON_. Result: for any caller settingERIGON_EXEC3_PARALLEL=false, the integration tool silently overrode it back totrue.Fix: also check the
ERIGON_-prefixed name (usingdbg.ErigonEnvPrefixto matchenvLookup).Commit 2 —
ResetExecwriting 0 instead of deletingResetExec → clearStageProgressoverwrote the Execution stage progress with 8 zero bytes viaSaveStageProgress(tx, stage, 0). That conflated two distinct states:len(bnBytes) == 0)SeekCommitmentincommitmentdb/commitment_context.go:644-665uses this entry as a fallback when no commitment state is in the domain. Per its own comment:blockNum=0means "genesis executed", so it returnstxNum = TxNums.Max(0) = 1so the next exec cycle does not re-run the genesis init task.After
ResetExecthe actual state is (a) (domain tables wiped, no commitment at all). But becauseclearStageProgresswrote 0 instead of deleting,SeekCommitmentsaw state (b) and returned(txNum=1, blockNum=0).ExecV3's exec loop then started attxNum=1, skipping block 0's init task — the only place that re-runs the genesis allocation through the worker pool'sLightCollector → applyVersionedWritespipeline.Result: genesis-allocated addresses that no subsequent block touches end up with
balance=0in the parallel-exec view, producing wrong-trie-root mismatches onqa-stage-exec (from-0, parallel)at mainnet block 46147 (address0xA1E4380A3B1f749673E270229993eE55F35663b4).The engine-API
InsertChain + UpdateForkChoicepath doesn't hit this because it doesn't callResetExec— it starts with no entry at all and falls into the (a) branch correctly.Fix:
clearStageProgressnow deletes theSyncStageProgressentries rather than writing 0. Makes the integration path's "after reset" state consistent with the FCU path's "fresh DB" state.Test
Adds an internal-package unit test
TestFromZero_GenesisAllocPreservedAfterResetReExecinexecution/execmodule/execmoduletester/that exercises the failing path in <100 ms:InsertChain(passes pre-fix — engine-API path is correct).rawdbreset.ResetExec.SpawnExecuteBlocksStagein aFlush/ClearRam/Commitloop (mirrorscmd/integration/commands/stages.go:802, the integration path that fails pre-fix).Pre-fix the test fails with
wrong trie root, block=5. Post-fix it passes both phases.Verified locally
Under
ERIGON_EXEC3_PARALLEL=trueon freshmain+ this PR:make lintcleanmake test-shortgreen onexecution/stagedsync,execution/state,execution/execmodule,execution/tests,execution/engineapi,rpc/jsonrpcTestFromZero_GenesisAllocPreservedAfterResetReExecpassesqa-stage-exec (from-0, serial)and(from-0, parallel)legs of ci: matrix-test serial vs parallel exec across the test workflows #21017 go green after this landsDiagnosis credit
Mark Holt's #21138 comment thread; iterative narrowing through CI traces + local unit-test repro.
Related