Skip to content

cmd/integration, execution/stagedsync: fix from-0 flow (env-prefix + reset-progress-delete)#21210

Merged
mh0lt merged 2 commits into
mainfrom
mh/integration-exec3-parallel-env-prefix
May 18, 2026
Merged

cmd/integration, execution/stagedsync: fix from-0 flow (env-prefix + reset-progress-delete)#21210
mh0lt merged 2 commits into
mainfrom
mh/integration-exec3-parallel-env-prefix

Conversation

@mh0lt

@mh0lt mh0lt commented May 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Two-commit fix for cmd/integration's stage_exec --reset; stage_exec --batchSize=10mb from-0 flow. Both issues surface together as the qa-stage-exec (from-0, *) CI failures on #21017 and as the chiado-from-0 / mainnet-block-46147 wrong-trie-root failures discussed in #21138.

Commit 1 — stages.go env-prefix bug (yperbasis Blocker 1)

stageExec in cmd/integration/commands/stages.go:631 checked only the unprefixed EXEC3_PARALLEL env name when deciding whether to keep its default of dbg.Exec3Parallel = true. CI workflows set the ERIGON_-prefixed form because dbg.envLookup auto-prepends ERIGON_. Result: for any caller setting ERIGON_EXEC3_PARALLEL=false, the integration tool silently overrode it back to true.

Fix: also check the ERIGON_-prefixed name (using dbg.ErigonEnvPrefix to match envLookup).

Commit 2 — ResetExec writing 0 instead of deleting

ResetExec → clearStageProgress overwrote the Execution stage progress with 8 zero bytes via SaveStageProgress(tx, stage, 0). That conflated two distinct states:

  • (a) stage never started — entry absent (len(bnBytes) == 0)
  • (b) stage executed up to block 0 — entry present, value 0

SeekCommitment in commitmentdb/commitment_context.go:644-665 uses this entry as a fallback when no commitment state is in the domain. Per its own comment: blockNum=0 means "genesis executed", so it returns txNum = TxNums.Max(0) = 1 so the next exec cycle does not re-run the genesis init task.

After ResetExec the actual state is (a) (domain tables wiped, no commitment at all). But because clearStageProgress wrote 0 instead of deleting, SeekCommitment saw state (b) and returned (txNum=1, blockNum=0). ExecV3's exec loop then started at txNum=1, skipping block 0's init task — the only place that re-runs the genesis allocation through the worker pool's LightCollector → applyVersionedWrites pipeline.

Result: genesis-allocated addresses that no subsequent block touches end up with balance=0 in the parallel-exec view, producing wrong-trie-root mismatches on qa-stage-exec (from-0, parallel) at mainnet block 46147 (address 0xA1E4380A3B1f749673E270229993eE55F35663b4).

The engine-API InsertChain + UpdateForkChoice path doesn't hit this because it doesn't call ResetExec — it starts with no entry at all and falls into the (a) branch correctly.

Fix: clearStageProgress now deletes the SyncStageProgress entries rather than writing 0. Makes the integration path's "after reset" state consistent with the FCU path's "fresh DB" state.

Test

Adds an internal-package unit test TestFromZero_GenesisAllocPreservedAfterResetReExec in execution/execmodule/execmoduletester/ that exercises the failing path in <100 ms:

  1. Set up custom genesis allocating a dormant address.
  2. Sync 5 empty blocks via engine-API InsertChain (passes pre-fix — engine-API path is correct).
  3. Reset state via rawdbreset.ResetExec.
  4. Drive stage execution via direct SpawnExecuteBlocksStage in a Flush/ClearRam/Commit loop (mirrors cmd/integration/commands/stages.go:802, the integration path that fails pre-fix).
  5. Assert the genesis-allocated balance is preserved.

Pre-fix the test fails with wrong trie root, block=5. Post-fix it passes both phases.

Verified locally

Under ERIGON_EXEC3_PARALLEL=true on fresh main + this PR:

  • make lint clean
  • make test-short green on execution/stagedsync, execution/state, execution/execmodule, execution/tests, execution/engineapi, rpc/jsonrpc
  • New TestFromZero_GenesisAllocPreservedAfterResetReExec passes
  • CI: qa-stage-exec (from-0, serial) and (from-0, parallel) legs of ci: matrix-test serial vs parallel exec across the test workflows #21017 go green after this lands

Diagnosis credit

Mark Holt's #21138 comment thread; iterative narrowing through CI traces + local unit-test repro.

Related

Comment thread cmd/integration/commands/stages.go Outdated
@mh0lt mh0lt changed the title cmd/integration: respect ERIGON_-prefixed EXEC3_PARALLEL in stages.go default cmd/integration, execution/stagedsync: fix from-0 flow (env-prefix + reset-progress-delete) May 16, 2026
@mh0lt

mh0lt commented May 16, 2026

Copy link
Copy Markdown
Contributor Author

2026-05-16 update: added a second commit (42ac3828b9) fixing the ResetExec → clearStageProgress inconsistency that was independently causing the same family of from-0 failures.

After the original stages.go env-prefix fix, the matrix would correctly run serial/parallel modes, but both would still fail at chiado-block-21 / mainnet-block-46147 because ResetExec was writing 8 zero bytes to SyncStageProgress.Execution instead of deleting the entry. SeekCommitment interpreted that as "block 0 executed" and returned (txNum=1, blockNum=0), making ExecV3 skip the block-0 init task — the only place that re-runs the genesis allocation through the worker pool. Dormant genesis-allocated addresses ended up with balance=0.

The engine-API InsertChain + UpdateForkChoice path never hits this because it doesn't call ResetExec.

Per @yperbasis's review on #21017, this PR is the integration-tool-side fix that makes both stage_exec --reset semantics and stage_exec --batchSize=10mb from-0 actually work. Once this lands, #21017's qa-stage-exec (from-0, serial/parallel) legs should go green.

Includes a unit test (TestFromZero_GenesisAllocPreservedAfterResetReExec) that reproduces the failure in <100 ms on the pre-fix code and passes post-fix. The PR title and body are updated accordingly.

@mh0lt mh0lt force-pushed the mh/integration-exec3-parallel-env-prefix branch 2 times, most recently from 2ff4c6b to c22d33d Compare May 16, 2026 10:22
mh0lt added 2 commits May 16, 2026 10:26
… default

stageExec was checking only the unprefixed EXEC3_PARALLEL env name when
deciding whether to apply its default of dbg.Exec3Parallel=true. CI
workflows set the ERIGON_-prefixed form because dbg.envLookup auto-
prepends ERIGON_, so for the serial matrix entry:

  1. Package init reads ERIGON_EXEC3_PARALLEL=false via envLookup
     → dbg.Exec3Parallel = false.
  2. stageExec sees no unprefixed EXEC3_PARALLEL → flips it back to true.

Both modes ended up running in parallel, with the matrix passing through
silent equivalence rather than real CI signal.

Fix per AskAlexSharov: move the env parse out of the command runtime and
into a package init() that assigns dbg.Exec3Parallel directly via
dbg.EnvBool, matching the dbg/experiments.go pattern. EnvBool checks both
the bare and ERIGON_-prefixed forms via envLookup, so a workflow setting
ERIGON_EXEC3_PARALLEL=false now correctly suppresses the integration
tool's default; the default is true so callers that set nothing (typical
local debugging) still get parallel mode.

stageExec entry loses the 3-line runtime env-check entirely.

Reported by @yperbasis on #21017 review.
…et, don't overwrite with 0

ResetExec → clearStageProgress was overwriting the Execution stage
progress with 8 zero bytes via SaveStageProgress(tx, stage, 0). That
conflated two distinct states:

  (a) stage never started — entry absent (len(bnBytes) == 0)
  (b) stage executed up to block 0 — entry present, value 0

SeekCommitment in commitmentdb/commitment_context.go uses this entry
as a fallback when no commitment state is in the domain. Per its own
comment at lines 651-660: blockNum=0 means "genesis executed", so it
returns txNum = TxNums.Max(0) = 1 (not txNum=0) so the next exec cycle
does not re-run the genesis init task.

After ResetExec the actual state is (a) — domain tables wiped, no
commitment at all. But because clearStageProgress wrote 0 instead of
deleting, SeekCommitment saw state (b) and returned (txNum=1, blockNum=0).
ExecV3's exec loop then started at txNum=1, skipping block 0's init
task — the only place that re-runs the genesis allocation through the
worker pool's LightCollector → applyVersionedWrites pipeline. Result:
genesis-allocated addresses that no subsequent block touches end up
with balance=0 in the parallel-exec view, producing wrong-trie-root
mismatches on `qa-stage-exec (from-0, parallel)` (#21017 / #21138 /
mainnet block 46147 / 0xA1E4380A3B1f749673E270229993eE55F35663b4).

The engine-API InsertChain + FCU path doesn't hit this because it
doesn't call ResetExec; it starts with no entry at all and falls into
the (a) branch correctly.

Fix: clearStageProgress now deletes the SyncStageProgress entries
rather than writing 0. This is the integration-tool-side fix the
user's diagnosis pointed at — make the integration path's reset state
consistent with the FCU path's "fresh DB" state, rather than papering
over the inconsistency in exec3.go / SeekCommitment.

Adds a unit test that exercises the failing path in <100ms:

  TestFromZero_GenesisAllocPreservedAfterResetReExec

Sets up a custom genesis allocating a dormant address, syncs 5 empty
blocks via the engine-API InsertChain (passes pre-fix), then resets
state via rawdbreset.ResetExec and drives stage execution via direct
SpawnExecuteBlocksStage (the integration-tool path that fails pre-fix).
Asserts the genesis-allocated balance is preserved across the reset +
re-exec cycle.

Pre-fix the test fails with `wrong trie root, block=5` (state diverges
because genesis init never runs).

Diagnosis credit: Mark Holt.
@mh0lt mh0lt force-pushed the mh/integration-exec3-parallel-env-prefix branch from c22d33d to f18e28a Compare May 16, 2026 10:27
@mh0lt mh0lt requested a review from AskAlexSharov May 16, 2026 10:27
@mh0lt mh0lt enabled auto-merge May 16, 2026 14:31
@yperbasis yperbasis requested a review from Copilot May 18, 2026 08:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes cmd/integration stage_exec “from-0 after reset” behavior so a reset truly returns the DB to the “never executed” state (avoiding block-0 initialization being skipped), and ensures the integration tool’s parallel-exec default can be overridden via the ERIGON_-prefixed env var. Adds a regression test reproducing the genesis-allocation drop/wrong-trie-root failure after reset + re-exec.

Changes:

  • Reset flow: delete SyncStageProgress entries (and their prune_ counterparts) instead of writing progress 0, preserving the “absent vs 0” semantic needed by SeekCommitment.
  • Integration tool: move dbg.Exec3Parallel defaulting to dbg.EnvBool("EXEC3_PARALLEL", true) so both EXEC3_PARALLEL and ERIGON_EXEC3_PARALLEL are honored.
  • Add an internal execmodule tester unit test to cover reset + integration-path re-execution preserving genesis allocations.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
execution/stagedsync/rawdbreset/reset_stages.go Deletes stage progress keys on reset to preserve “never executed” semantics for SeekCommitment.
execution/execmodule/execmoduletester/from0_genesis_internal_test.go Adds regression test for genesis allocation preservation after ResetExec + integration-style re-exec.
cmd/integration/commands/stages.go Removes per-command env handling for EXEC3_PARALLEL (now set in package init).
cmd/integration/commands/flags.go Sets integration-tool default for dbg.Exec3Parallel via dbg.EnvBool, honoring ERIGON_ prefix.
Comments suppressed due to low confidence (3)

execution/execmodule/execmoduletester/from0_genesis_internal_test.go:110

  • checkBalance wraps work in emt.DB.ViewTemporal(...) but then opens a separate BeginTemporalRo transaction inside the callback and ignores the tx passed by ViewTemporal. This is redundant and can lead to extra aggregator/file txs; use the provided tx directly (or drop ViewTemporal and manage a single Ro tx explicitly).
		require.NoError(t, emt.DB.ViewTemporal(ctx, func(tx kv.TemporalTx) error {
			rTx, err := emt.DB.BeginTemporalRo(ctx)
			if err != nil {
				return err
			}
			defer rTx.Rollback()
			doms, err := execctx.NewSharedDomains(ctx, rTx, logger)

execution/execmodule/execmoduletester/from0_genesis_internal_test.go:187

  • defer tx.Rollback() captures only the initial transaction; after tx is reassigned inside the loop, errors will return without rolling back the latest open tx. Use a deferred closure that rolls back the current tx (like defer func(){ tx.Rollback() }()), or explicitly rollback/close each new tx on error paths.
	tx, err := emt.DB.BeginTemporalRw(ctx)
	if err != nil {
		return err
	}
	defer tx.Rollback()

execution/execmodule/execmoduletester/from0_genesis_internal_test.go:203

  • The integration tool’s loop explicitly treats *stagedsync.ErrLoopExhausted as non-fatal and continues iterating, but this helper returns any error directly. To accurately mirror the integration path (and avoid flaky failures when the stage hits its loop-iteration limit), handle ErrLoopExhausted the same way (ignore/continue).
	for {
		if err := stagedsync.SpawnExecuteBlocksStage(s, emt.Sync, doms, tx, toBlock, ctx, cfg, logger); err != nil {
			return err
		}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +51 to +53
if testing.Short() {
t.Skip()
}

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — correct root-cause fix with a faithful regression test. A few optional suggestions:

  1. Redundant transaction in checkBalance (from0_genesis_internal_test.go:104-123):

    require.NoError(t, emt.DB.ViewTemporal(ctx, func(tx kv.TemporalTx) error {
        rTx, err := emt.DB.BeginTemporalRo(ctx)
        ...
    }))

    The outer ViewTemporal opens tx but it's never used; the body opens a second rTx. Drop the ViewTemporal wrapper and call BeginTemporalRo once.

  2. Hardcoded "prune_" prefix (reset_stages.go:196): the original code went through SaveStagePruneProgress, which encapsulates the prefix. Consider exposing stages.PruneProgressKey(stage) (or similar) so the "prune_" literal isn't duplicated in two places (here and stages.go:113/105).

  3. Duplicate [env] warning at startup: dbg's package init and flags.go's init both call EnvBool("EXEC3_PARALLEL", …), so when the env var is set the warning is logged twice in the integration tool's startup. Cosmetic only.

  4. Stray blank line at from0_genesis_internal_test.go:71 (right after the func signature).

Verified locally that the new test passes on both serial and parallel, and reverting only the clearStageProgress change reproduces the failure — so the test catches the regression cleanly.

@mh0lt mh0lt added this pull request to the merge queue May 18, 2026
Merged via the queue into main with commit 7569867 May 18, 2026
76 checks passed
@mh0lt mh0lt deleted the mh/integration-exec3-parallel-env-prefix branch May 18, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants