Skip to content

Improve block proposal cancellation to (best effort) avoid concurrency issues#10219

Open
fab-10 wants to merge 2 commits intobesu-eth:mainfrom
fab-10:fix-cancelled-block-creation-exception
Open

Improve block proposal cancellation to (best effort) avoid concurrency issues#10219
fab-10 wants to merge 2 commits intobesu-eth:mainfrom
fab-10:fix-cancelled-block-creation-exception

Conversation

@fab-10
Copy link
Copy Markdown
Contributor

@fab-10 fab-10 commented Apr 10, 2026

Summary

Today I spent some time reviewing the exception reported below, and it ended up to be an interesting investigation, that led to an improvement of the cancellation path to avoid (as much as possible) this concurrency issue.
Usually, like in the reported case, the exception is not an issue, since there were previous proposals available to return, but in the case the cancellation happens on the very first block creation iteration, then it could result in an empty proposal returned, instead of something better.

"@timestamp":"2026-03-12T06:10:01,044","level":"INFO","thread":"vert.x-worker-thread-0","class":"MergeCoordinator","message":"Start building proposals for block 2401377 identified by 0x1c0bb588326184d4","throwable":""}
{"@timestamp":"2026-03-12T06:10:01,049","level":"INFO","thread":"vert.x-worker-thread-0","class":"AbstractEngineForkchoiceUpdated","message":"FCU(VALID) | head: 650a0.....c9b9d | safe: 82091.....26b9b | finalized: 3b6d9.....8994>
{"@timestamp":"2026-03-12T06:10:01,576","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x86faf7360d4e8783d0b9f575>
{"@timestamp":"2026-03-12T06:10:02,055","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x19c96da5d3bb50ce2c7c9fe0>
{"@timestamp":"2026-03-12T06:10:03,061","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x83a7f9a28e62a77c4e6a6ee3>
{"@timestamp":"2026-03-12T06:10:03,740","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0xde4f0918573c99276894e3e8>
{"@timestamp":"2026-03-12T06:10:04,520","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0xeca9b4dbd0ef06a5a7952d39>
{"@timestamp":"2026-03-12T06:10:05,410","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x5fe621c357a9c541fb3da0f6>
{"@timestamp":"2026-03-12T06:10:06,267","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0xde4cae0d767ec02060b6ae2c>
{"@timestamp":"2026-03-12T06:10:07,089","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x6ed9e875c8d875c73f2d48aa>
{"@timestamp":"2026-03-12T06:10:08,031","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0xee609340f1e4e6228ac660ed>
{"@timestamp":"2026-03-12T06:10:09,203","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x0f6bae3b7d0e19a3f0d98b60>
{"@timestamp":"2026-03-12T06:10:10,343","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x83adaef273d73dece22c1f65>
{"@timestamp":"2026-03-12T06:10:11,429","level":"INFO","thread":"EthScheduler-BlockCreation-0-2401377","class":"PostMergeContext","message":"New proposal for payloadId 0x1c0bb588326184d4 block 2401377 (0x4befc6807ed116dc658fe0f1>
{"@timestamp":"2026-03-12T06:10:12,029","level":"INFO","thread":"vert.x-worker-thread-0","class":"AbstractEngineGetPayload","message":"Produced #2,401,377  (4befc.....66005)|   37 tx | 16 ws | 50,704,142 (84.5%) gas in 1.585s | >
{"@timestamp":"2026-03-12T06:10:12,029","level":"WARN","thread":"EthScheduler-BlockCreation-0-2401377","class":"MergeCoordinator","message":"Something went wrong creating block for payload id 0x1c0bb588326184d4, error java.lang.>
        at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:362)
        at org.hyperledger.besu.consensus.merge.blockcreation.MergeBlockCreator.createBlock(MergeBlockCreator.java:87)
        at org.hyperledger.besu.consensus.merge.blockcreation.MergeCoordinator.lambda$tryToBuildBetterBlock$5(MergeCoordinator.java:434)
        at org.hyperledger.besu.consensus.merge.blockcreation.MergeCoordinator.recoverableBlockCreation(MergeCoordinator.java:529)
        at org.hyperledger.besu.consensus.merge.blockcreation.MergeCoordinator.retryBlockCreationUntilUseful(MergeCoordinator.java:496)
        at org.hyperledger.besu.consensus.merge.blockcreation.MergeCoordinator.lambda$tryToBuildBetterBlock$6(MergeCoordinator.java:461)
        at org.hyperledger.besu.ethereum.eth.manager.EthScheduler.lambda$scheduleBlockCreationTask$11(EthScheduler.java:234)
        at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.util.ConcurrentModificationException
        at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1605)
        at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1628)
        at org.hyperledger.besu.ethereum.trie.pathbased.common.worldview.accumulator.PathBasedWorldStateUpdateAccumulator.commit(PathBasedWorldStateUpdateAccumulator.java:354)
        at org.hyperledger.besu.ethereum.mainnet.WithdrawalsProcessor.processWithdrawals(WithdrawalsProcessor.java:45)
        at org.hyperledger.besu.ethereum.blockcreation.AbstractBlockCreator.createBlock(AbstractBlockCreator.java:281)
        ... 10 more
","throwable":""}

When block creation is cancelled or times out, the selection thread may still be running briefly after FutureTask.get() returns. Without proper synchronisation, post-selection steps (withdrawals, EL requests, rewards processing) can race against the still-running thread on the shared world state.
Exactly what happened in the above exception, the race condition between the withdrawals processing and the cancelled but still running tx selection thread.

  • Add CountDownLatch to internal tx selection — mirrors the existing mechanism in plugin selection; the finally block counts down after the selection loop completes, giving a reliable signal that the thread has stopped.
  • Extract waitForCancellationToBeProcessed — shared helper called after both plugin and internal selection phases when the latch is non-zero. Correctly handles a negative remaining-time value (logs and returns immediately instead of silently passing a negative timeout to CountDownLatch.await) and logs the outcome of the wait.
  • Split exception handling by typerollback() is now only called for ExecutionException, where the callable has already thrown and the finally block is guaranteed to have run (latch at 0, selection thread stopped). CancellationException and InterruptedException no longer trigger a rollback, removing a potential race where the selection thread could still be mutating selectionPendingActions and the world state updaters.
  • Demote post-cancellation exceptions in MergeCoordinator — exceptions thrown after isBlockCreationCancelled is true are now logged at INFO with a message guiding the user to report if unexpected, rather than at WARN, reducing noise from the expected concurrency edge cases during block proposal cancellation.

Test plan

  • Existing unit and integration tests for BlockTransactionSelector and MergeCoordinator pass
  • Manually verify no WARN-level noise during normal block proposal cancellation (e.g. engine_forkchoiceUpdated followed by a new payload)
  • Verify INFO log appears and includes stack trace when an exception is thrown after cancellation

🤖 Generated with Claude Code

…ection

When block creation is cancelled or times out, the selection thread may still
be running briefly. This change adds a CountDownLatch to internal tx selection
(mirroring the existing plugin selection mechanism) and extracts a shared
waitForCancellationToBeProcessed method that correctly handles negative
remaining-time values and logs the outcome of the wait.

Exception handling in both selection phases is split by type so that
rollback() is only called for ExecutionException, where the selection thread
is guaranteed to have finished. CancellationException and InterruptedException
no longer trigger a rollback, removing a potential race on shared world state.

In MergeCoordinator, exceptions thrown after a cancellation are now logged at
INFO with guidance to report if unexpected, rather than at WARN, reducing noise
from the expected concurrency edge cases during block proposal cancellation.

Signed-off-by: Fabio Di Fabio <fabio.difabio@consensys.net>
@fab-10 fab-10 changed the title Fix concurrency issues when block proposal is cancelled during tx selection Improve block proposal cancellation to (best effort) avoid concurrency issues Apr 10, 2026
- Add test verifying CancellationException during plugin selection is
  handled gracefully (no exception propagated to caller)
- Add test verifying internal selection CountDownLatch causes
  buildTransactionListForBlock() to wait for the selection thread
- Add test verifying Throwable thrown after block creation cancellation
  is handled gracefully (logged at INFO, not propagated)
- Remove early return from timeLimitedSelection when isCancelled is
  true: the guard was causing validPendingTransactionIsNotIncludedIf
  SelectionCancelled to fail because evaluatePendingTransaction (which
  marks each tx as SELECTION_CANCELLED) was never reached; the check
  is unnecessary since evaluatePendingTransaction already handles
  isCancelled on every iteration without touching world state

Signed-off-by: Fabio Di Fabio <fabio.difabio@consensys.net>
@fab-10 fab-10 force-pushed the fix-cancelled-block-creation-exception branch from f391044 to cf708bc Compare April 10, 2026 16:31
@fab-10 fab-10 marked this pull request as ready for review April 10, 2026 17:00
Copilot AI review requested due to automatic review settings April 10, 2026 17:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improves block proposal cancellation behavior to reduce race conditions between transaction selection threads and post-selection world state mutations, and reduces log noise for expected cancellation-related exceptions.

Changes:

  • Add a CountDownLatch to internal tx selection and a shared waitForCancellationToBeProcessed(...) helper to best-effort wait for selection threads to stop.
  • Refine exception handling to avoid rollback on cancellation/interruption paths and to handle plugin-selection cancellation gracefully.
  • Demote post-cancellation block creation exceptions in MergeCoordinator from WARN to INFO and add tests covering these scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
ethereum/blockcreation/src/main/java/org/hyperledger/besu/ethereum/blockcreation/txselection/BlockTransactionSelector.java Adds latches + wait helper and refines cancellation/interrupt/exception handling during selection.
ethereum/blockcreation/src/test/java/org/hyperledger/besu/ethereum/blockcreation/AbstractBlockTransactionSelectorTest.java Adds tests for external cancellation during plugin selection and waiting for internal selection thread completion.
consensus/merge/src/main/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinator.java Logs post-cancellation exceptions at INFO with guidance; keeps WARN for non-cancelled failures.
consensus/merge/src/test/java/org/hyperledger/besu/consensus/merge/blockcreation/MergeCoordinatorTest.java Adds a regression test ensuring post-cancellation exceptions don’t escape the background task.

context,
nanosToMillis(maxWaitTimeNanos),
ex);
throw new RuntimeException(ex);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This converts InterruptedException into a RuntimeException without restoring the thread interrupt flag. If an interrupt happens here, best practice is to call Thread.currentThread().interrupt() and then return (best-effort) or rethrow InterruptedException so upper layers can handle cancellation correctly. Throwing a new runtime exception here can also defeat the goal of 'best effort' completion during cancellation paths.

Suggested change
throw new RuntimeException(ex);
Thread.currentThread().interrupt();
return;

Copilot uses AI. Check for mistakes.
} catch (InterruptedException e) {
LOG.debug(
"Transaction selection interrupted during execution, finalizing with current progress",
e);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both cancellation and interruption are swallowed here. For InterruptedException, consider restoring the interrupt status (Thread.currentThread().interrupt()) so higher-level code can observe the interruption. Right now the interrupt is cleared by catching the exception, which can cause the rest of block creation to continue even though the thread was interrupted.

Suggested change
e);
e);
Thread.currentThread().interrupt();

Copilot uses AI. Check for mistakes.
Comment on lines +1267 to +1280
CompletableFuture.runAsync(
() -> {
try {
pluginTaskStarted.await();
selectorRef.get().cancel();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
});

// must complete without throwing; the old code propagated CancellationException out of
// pluginTimeLimitedSelection, which would bubble up through buildTransactionListForBlock()
final var results = selectorRef.get().buildTransactionListForBlock();
assertThat(results).isNotNull();
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This async task is never awaited and pluginTaskStarted.await() is unbounded. If the latch is never counted down (e.g., due to a failure before plugin selection starts), this can leak a blocked task into the common pool and make the test suite flaky. Consider keeping the returned CompletableFuture, using a bounded await (timeout), and ensuring the task completes (e.g., join with timeout) before the test ends.

Suggested change
CompletableFuture.runAsync(
() -> {
try {
pluginTaskStarted.await();
selectorRef.get().cancel();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
});
// must complete without throwing; the old code propagated CancellationException out of
// pluginTimeLimitedSelection, which would bubble up through buildTransactionListForBlock()
final var results = selectorRef.get().buildTransactionListForBlock();
assertThat(results).isNotNull();
final CompletableFuture<Void> cancellationTask =
CompletableFuture.runAsync(
() -> {
try {
if (pluginTaskStarted.await(5, java.util.concurrent.TimeUnit.SECONDS)) {
selectorRef.get().cancel();
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
});
// must complete without throwing; the old code propagated CancellationException out of
// pluginTimeLimitedSelection, which would bubble up through buildTransactionListForBlock()
final var results = selectorRef.get().buildTransactionListForBlock();
assertThat(results).isNotNull();
cancellationTask.orTimeout(5, java.util.concurrent.TimeUnit.SECONDS).join();

Copilot uses AI. Check for mistakes.
Comment on lines +1310 to +1324
// On processing: cancel the selector (from within the selection thread), sleep to simulate
// work that continues after the interrupt, then mark as finished.
final Answer<TransactionProcessingResult> slowCancelAnswer =
invocation -> {
selectorRef.get().cancel(); // triggers FutureTask.cancel(true) on current task
try {
Thread.sleep(300); // wakes immediately via InterruptedException
} catch (InterruptedException e) {
try {
Thread.sleep(200); // simulate cleanup still running after interrupt
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
}
}
selectionThreadFinished.set(true);
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test relies on fixed Thread.sleep(...) timing to model post-interrupt work, which is prone to flakiness under load (different schedulers/CI environments). Prefer coordination primitives (e.g., latches/barriers) to deterministically block/unblock the selection thread and to signal 'cleanup finished' without depending on wall-clock sleeps.

Copilot uses AI. Check for mistakes.
// concurrency issues, so inform the user how to interpret that possibility
LOG.info(
"Got an exception after cancellation of block creation for payload id {}. "
+ "This is expected if previous log alerted about that, otherwise please report",
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INFO message is hard to interpret (“previous log alerted about that” is ambiguous and grammatically awkward). Consider making it actionable by referencing the actual condition the user should look for (e.g., timeout/cancellation log line name) and clearly stating when to report (e.g., 'If you do not see a prior cancellation/timeout log for this payload, please report this stack trace').

Suggested change
+ "This is expected if previous log alerted about that, otherwise please report",
+ "This is expected if you already saw the earlier "
+ "\"Block creation for payload id {} has been cancelled, reason {}\" log for "
+ "this payload. If you do not see that earlier cancellation log for this "
+ "payload, please report this stack trace.",

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants