[don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp by AlexeyAkhunov · Pull Request #19544 · erigontech/erigon

AlexeyAkhunov · 2026-03-01T09:11:55Z

Summary

QueueWithRetry pooling (execution/exec/txtask.go): Add sync.Pool for QueueWithRetry with a Release() method that drains the 100K-element channel without closing it, preserving the ~1.6MB buffer across reuses. parallelExecutor.run calls Release() instead of Close(); workers exit via ctx.Done(). Cleanup ordering ensures stopWorkers() completes before Release() to avoid races.
filterDirtyFiles regexp cache (db/state/dirty_files.go): Cache compiled regexps in a sync.Map keyed by pattern string. Each unique (filenameBase, ext) pair compiles once instead of on every call.

Profiling testlab test runs showed QueueWithRetry allocation at 79.5GB (18.8% of total allocs) and filterDirtyFiles regexp at 14.6GB (3.3%). Combined expected savings: ~94GB (22% of total allocations).

Test plan

Verify go build ./... passes
Run go test ./execution/exec/... and go test ./execution/stagedsync/... and go test ./db/state/...
Run full CI suite to check for regressions

🤖 Generated with Claude Code

QueueWithRetry (79.5GB, 18.8% of allocs): pool via sync.Pool with Release() that drains the 100K-element channel without closing it, preserving the 1.6MB buffer across reuses. parallelExecutor.run uses Release() instead of Close(); workers exit via context cancellation when the exec loop goroutine defers execLoopCtxCancel(). Cleanup ordering ensures stopWorkers() completes before Release(). filterDirtyFiles regexp (14.6GB, 3.3% of allocs): cache compiled regexps in sync.Map keyed by pattern string. Each unique (filenameBase, ext) pair compiles once instead of per-call. Combined expected savings: ~94GB (22% of total allocations). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…imeout and leak detection ExecModuleTester.Close() now dumps all goroutine stacks to stderr if bgComponentsEg.Wait() takes longer than 30s, helping identify which goroutines are blocked during test cleanup. TestExecutionSpecBlockchainDevnet now tracks goroutine counts before/after each subtest and logs a warning when the delta exceeds 5, identifying which subtests leak goroutines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AlexeyAkhunov · 2026-03-01T14:08:28Z

Looks like changes in this PR make the flaky test reproducible. I am using it to debug that flaky test

…k noise Cancel mock.Ctx 5 seconds before the test binary deadline so that background goroutines (sentry pump loops, exec workers) exit on their own even when the test function is stuck in UpdateForkChoice. This makes the timeout goroutine dump show only the truly deadlocked goroutines instead of dozens of sentry pump loops. Remove per-subtest goroutine leak detection from block_test.go — it confirmed that every test leaks temporarily (all clean up within 30s) and the 8,810 warnings were noise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When a test starts near the binary deadline (e.g. after a stuck test is unblocked by auto-cancel), time.Until(deadline)-5s is zero or negative, causing time.NewTimer to fire immediately. This cancels the brand-new context before RecvMessageLoop can establish the sentry stream, leaving StreamWg.Wait() stuck forever. Skip the auto-cancel when remaining time is not positive. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…Execution skips all pending tasks Two new diagnostic mechanisms: 1. scheduleExecution stall detector: logs task state when all pending tasks are skipped by the speculative check and none are sent to workers 2. execLoop stall timer: dumps full blockExecutor state if no results or requests arrive for 30 seconds Both log maxValidated, maxExecComplete, pending/inProgress/complete counts, and per-task incarnation/abort/fail details to identify the exact deadlock scenario. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rce-schedule on stall Two fixes for the scheduleExecution stall that causes execution/tests hangs: 1. Clear phantom inProgress state: when the speculative check skips a task, call clearInProgress to undo the takeNextPending move. Previously, skipped tasks remained in both inProgress AND pending, which could prevent removeDependency from correctly re-scheduling them. 2. Force-schedule on stall: when all pending tasks are skipped by the speculative check (scheduled==0 && skipped>0), force-schedule the first pending task with skipCheck=true. Re-executing a task is always correct — the spec check is purely a performance optimization. A wasted re-execution is infinitely better than a deadlock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When asyncTx.Apply/asyncRwTx.Apply/ApplyRw is called, it creates an unbuffered result channel (rc), sends a request to the mdbx thread, and waits in a select for either the result or ctx.Done(). If the context is cancelled while the mdbx thread is executing the function, the caller takes the ctx.Done() path and abandons rc. The mdbx thread then tries to send the result to the unbuffered rc, but nobody is reading — blocking the mdbx-locked goroutine forever. This manifests as exec3_parallel.go:180 stuck on "chan send" for the entire test timeout (56+ minutes), with the mdbx thread permanently locked. Fix: make rc buffered with capacity 1 so the mdbx thread can always complete its send even if the caller has abandoned the channel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The force-schedule fallback (stall prevention) was setting skipCheck[nextTx] = true, which tells the validator to accept the result without checking state consistency. This is only correct for the task at maxValidated+1 (the validation frontier), where all predecessors are guaranteed validated. For a non-frontier task, this can accept results computed against stale state, producing wrong trie roots. Remove the skipCheck flag from the force-schedule path. The task is still dispatched and executed; its result goes through normal validation and gets re-executed if invalid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AlexeyAkhunov requested review from AskAlexSharov, mh0lt, sudeepdino008 and yperbasis as code owners March 1, 2026 09:11

AskAlexSharov approved these changes Mar 1, 2026

View reviewed changes

AskAlexSharov enabled auto-merge (squash) March 1, 2026 09:50

AlexeyAkhunov force-pushed the pool-queue-cache-regexp branch from 2272f03 to aabd358 Compare March 1, 2026 10:33

AlexeyAkhunov force-pushed the pool-queue-cache-regexp branch from aabd358 to b6d591f Compare March 1, 2026 12:10

Alexey Sharp and others added 5 commits March 1, 2026 15:36

Fix from gemini :)

3c7385e

AlexeyAkhunov changed the title ~~Pool QueueWithRetry and cache filterDirtyFiles regexp~~ [don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp Mar 1, 2026

Alexey Sharp and others added 2 commits March 2, 2026 06:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp#19544

[don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp#19544
AlexeyAkhunov wants to merge 9 commits intomainfrom
pool-queue-cache-regexp

AlexeyAkhunov commented Mar 1, 2026

Uh oh!

AlexeyAkhunov commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexeyAkhunov commented Mar 1, 2026

Summary

Test plan

Uh oh!

AlexeyAkhunov commented Mar 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants