Skip to content

[don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp#19544

Open
AlexeyAkhunov wants to merge 9 commits intomainfrom
pool-queue-cache-regexp
Open

[don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp#19544
AlexeyAkhunov wants to merge 9 commits intomainfrom
pool-queue-cache-regexp

Conversation

@AlexeyAkhunov
Copy link
Contributor

Summary

  • QueueWithRetry pooling (execution/exec/txtask.go): Add sync.Pool for QueueWithRetry with a Release() method that drains the 100K-element channel without closing it, preserving the ~1.6MB buffer across reuses. parallelExecutor.run calls Release() instead of Close(); workers exit via ctx.Done(). Cleanup ordering ensures stopWorkers() completes before Release() to avoid races.
  • filterDirtyFiles regexp cache (db/state/dirty_files.go): Cache compiled regexps in a sync.Map keyed by pattern string. Each unique (filenameBase, ext) pair compiles once instead of on every call.

Profiling testlab test runs showed QueueWithRetry allocation at 79.5GB (18.8% of total allocs) and filterDirtyFiles regexp at 14.6GB (3.3%). Combined expected savings: ~94GB (22% of total allocations).

Test plan

  • Verify go build ./... passes
  • Run go test ./execution/exec/... and go test ./execution/stagedsync/... and go test ./db/state/...
  • Run full CI suite to check for regressions

🤖 Generated with Claude Code

@AskAlexSharov AskAlexSharov enabled auto-merge (squash) March 1, 2026 09:50
@AlexeyAkhunov AlexeyAkhunov force-pushed the pool-queue-cache-regexp branch from 2272f03 to aabd358 Compare March 1, 2026 10:33
QueueWithRetry (79.5GB, 18.8% of allocs): pool via sync.Pool with
Release() that drains the 100K-element channel without closing it,
preserving the 1.6MB buffer across reuses. parallelExecutor.run uses
Release() instead of Close(); workers exit via context cancellation
when the exec loop goroutine defers execLoopCtxCancel(). Cleanup
ordering ensures stopWorkers() completes before Release().

filterDirtyFiles regexp (14.6GB, 3.3% of allocs): cache compiled
regexps in sync.Map keyed by pattern string. Each unique
(filenameBase, ext) pair compiles once instead of per-call.

Combined expected savings: ~94GB (22% of total allocations).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexeyAkhunov AlexeyAkhunov force-pushed the pool-queue-cache-regexp branch from aabd358 to b6d591f Compare March 1, 2026 12:10
…imeout and leak detection

ExecModuleTester.Close() now dumps all goroutine stacks to stderr if
bgComponentsEg.Wait() takes longer than 30s, helping identify which
goroutines are blocked during test cleanup.

TestExecutionSpecBlockchainDevnet now tracks goroutine counts before/after
each subtest and logs a warning when the delta exceeds 5, identifying
which subtests leak goroutines.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexeyAkhunov
Copy link
Contributor Author

Looks like changes in this PR make the flaky test reproducible. I am using it to debug that flaky test

Alexey Sharp and others added 5 commits March 1, 2026 15:36
…k noise

Cancel mock.Ctx 5 seconds before the test binary deadline so that
background goroutines (sentry pump loops, exec workers) exit on their
own even when the test function is stuck in UpdateForkChoice. This
makes the timeout goroutine dump show only the truly deadlocked
goroutines instead of dozens of sentry pump loops.

Remove per-subtest goroutine leak detection from block_test.go — it
confirmed that every test leaks temporarily (all clean up within 30s)
and the 8,810 warnings were noise.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a test starts near the binary deadline (e.g. after a stuck test
is unblocked by auto-cancel), time.Until(deadline)-5s is zero or
negative, causing time.NewTimer to fire immediately. This cancels the
brand-new context before RecvMessageLoop can establish the sentry
stream, leaving StreamWg.Wait() stuck forever.

Skip the auto-cancel when remaining time is not positive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Execution skips all pending tasks

Two new diagnostic mechanisms:
1. scheduleExecution stall detector: logs task state when all pending tasks
   are skipped by the speculative check and none are sent to workers
2. execLoop stall timer: dumps full blockExecutor state if no results or
   requests arrive for 30 seconds

Both log maxValidated, maxExecComplete, pending/inProgress/complete counts,
and per-task incarnation/abort/fail details to identify the exact deadlock
scenario.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rce-schedule on stall

Two fixes for the scheduleExecution stall that causes execution/tests hangs:

1. Clear phantom inProgress state: when the speculative check skips a
   task, call clearInProgress to undo the takeNextPending move. Previously,
   skipped tasks remained in both inProgress AND pending, which could
   prevent removeDependency from correctly re-scheduling them.

2. Force-schedule on stall: when all pending tasks are skipped by the
   speculative check (scheduled==0 && skipped>0), force-schedule the
   first pending task with skipCheck=true. Re-executing a task is always
   correct — the spec check is purely a performance optimization.
   A wasted re-execution is infinitely better than a deadlock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlexeyAkhunov AlexeyAkhunov changed the title Pool QueueWithRetry and cache filterDirtyFiles regexp [don't merge yet] Pool QueueWithRetry and cache filterDirtyFiles regexp Mar 1, 2026
Alexey Sharp and others added 2 commits March 2, 2026 06:40
When asyncTx.Apply/asyncRwTx.Apply/ApplyRw is called, it creates an
unbuffered result channel (rc), sends a request to the mdbx thread, and
waits in a select for either the result or ctx.Done(). If the context is
cancelled while the mdbx thread is executing the function, the caller
takes the ctx.Done() path and abandons rc. The mdbx thread then tries to
send the result to the unbuffered rc, but nobody is reading — blocking
the mdbx-locked goroutine forever.

This manifests as exec3_parallel.go:180 stuck on "chan send" for the
entire test timeout (56+ minutes), with the mdbx thread permanently
locked.

Fix: make rc buffered with capacity 1 so the mdbx thread can always
complete its send even if the caller has abandoned the channel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The force-schedule fallback (stall prevention) was setting
skipCheck[nextTx] = true, which tells the validator to accept the
result without checking state consistency. This is only correct for
the task at maxValidated+1 (the validation frontier), where all
predecessors are guaranteed validated. For a non-frontier task, this
can accept results computed against stale state, producing wrong
trie roots.

Remove the skipCheck flag from the force-schedule path. The task is
still dispatched and executed; its result goes through normal
validation and gets re-executed if invalid.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants