[don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp#19544
Open
AlexeyAkhunov wants to merge 9 commits intomainfrom
Open
[don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp#19544AlexeyAkhunov wants to merge 9 commits intomainfrom
AlexeyAkhunov wants to merge 9 commits intomainfrom
Conversation
AskAlexSharov
approved these changes
Mar 1, 2026
2272f03 to
aabd358
Compare
QueueWithRetry (79.5GB, 18.8% of allocs): pool via sync.Pool with Release() that drains the 100K-element channel without closing it, preserving the 1.6MB buffer across reuses. parallelExecutor.run uses Release() instead of Close(); workers exit via context cancellation when the exec loop goroutine defers execLoopCtxCancel(). Cleanup ordering ensures stopWorkers() completes before Release(). filterDirtyFiles regexp (14.6GB, 3.3% of allocs): cache compiled regexps in sync.Map keyed by pattern string. Each unique (filenameBase, ext) pair compiles once instead of per-call. Combined expected savings: ~94GB (22% of total allocations). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aabd358 to
b6d591f
Compare
…imeout and leak detection ExecModuleTester.Close() now dumps all goroutine stacks to stderr if bgComponentsEg.Wait() takes longer than 30s, helping identify which goroutines are blocked during test cleanup. TestExecutionSpecBlockchainDevnet now tracks goroutine counts before/after each subtest and logs a warning when the delta exceeds 5, identifying which subtests leak goroutines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
Looks like changes in this PR make the flaky test reproducible. I am using it to debug that flaky test |
…k noise Cancel mock.Ctx 5 seconds before the test binary deadline so that background goroutines (sentry pump loops, exec workers) exit on their own even when the test function is stuck in UpdateForkChoice. This makes the timeout goroutine dump show only the truly deadlocked goroutines instead of dozens of sentry pump loops. Remove per-subtest goroutine leak detection from block_test.go — it confirmed that every test leaks temporarily (all clean up within 30s) and the 8,810 warnings were noise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a test starts near the binary deadline (e.g. after a stuck test is unblocked by auto-cancel), time.Until(deadline)-5s is zero or negative, causing time.NewTimer to fire immediately. This cancels the brand-new context before RecvMessageLoop can establish the sentry stream, leaving StreamWg.Wait() stuck forever. Skip the auto-cancel when remaining time is not positive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Execution skips all pending tasks Two new diagnostic mechanisms: 1. scheduleExecution stall detector: logs task state when all pending tasks are skipped by the speculative check and none are sent to workers 2. execLoop stall timer: dumps full blockExecutor state if no results or requests arrive for 30 seconds Both log maxValidated, maxExecComplete, pending/inProgress/complete counts, and per-task incarnation/abort/fail details to identify the exact deadlock scenario. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rce-schedule on stall Two fixes for the scheduleExecution stall that causes execution/tests hangs: 1. Clear phantom inProgress state: when the speculative check skips a task, call clearInProgress to undo the takeNextPending move. Previously, skipped tasks remained in both inProgress AND pending, which could prevent removeDependency from correctly re-scheduling them. 2. Force-schedule on stall: when all pending tasks are skipped by the speculative check (scheduled==0 && skipped>0), force-schedule the first pending task with skipCheck=true. Re-executing a task is always correct — the spec check is purely a performance optimization. A wasted re-execution is infinitely better than a deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When asyncTx.Apply/asyncRwTx.Apply/ApplyRw is called, it creates an unbuffered result channel (rc), sends a request to the mdbx thread, and waits in a select for either the result or ctx.Done(). If the context is cancelled while the mdbx thread is executing the function, the caller takes the ctx.Done() path and abandons rc. The mdbx thread then tries to send the result to the unbuffered rc, but nobody is reading — blocking the mdbx-locked goroutine forever. This manifests as exec3_parallel.go:180 stuck on "chan send" for the entire test timeout (56+ minutes), with the mdbx thread permanently locked. Fix: make rc buffered with capacity 1 so the mdbx thread can always complete its send even if the caller has abandoned the channel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The force-schedule fallback (stall prevention) was setting skipCheck[nextTx] = true, which tells the validator to accept the result without checking state consistency. This is only correct for the task at maxValidated+1 (the validation frontier), where all predecessors are guaranteed validated. For a non-frontier task, this can accept results computed against stale state, producing wrong trie roots. Remove the skipCheck flag from the force-schedule path. The task is still dispatched and executed; its result goes through normal validation and gets re-executed if invalid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
execution/exec/txtask.go): Addsync.PoolforQueueWithRetrywith aRelease()method that drains the 100K-element channel without closing it, preserving the ~1.6MB buffer across reuses.parallelExecutor.runcallsRelease()instead ofClose(); workers exit viactx.Done(). Cleanup ordering ensuresstopWorkers()completes beforeRelease()to avoid races.db/state/dirty_files.go): Cache compiled regexps in async.Mapkeyed by pattern string. Each unique(filenameBase, ext)pair compiles once instead of on every call.Profiling testlab test runs showed QueueWithRetry allocation at 79.5GB (18.8% of total allocs) and filterDirtyFiles regexp at 14.6GB (3.3%). Combined expected savings: ~94GB (22% of total allocations).
Test plan
go build ./...passesgo test ./execution/exec/...andgo test ./execution/stagedsync/...andgo test ./db/state/...🤖 Generated with Claude Code