RepeatedTaskQueue: add interrupt_on_shutdown to abort in-flight tasks by sokolikp · Pull Request #3803 · cashapp/misk

sokolikp · 2026-05-06T13:41:26Z

Summary

Adds an opt-in RepeatedTaskQueueConfig.interrupt_on_shutdown flag that makes RepeatedTaskQueue.triggerShutdown() interrupt in-flight task lambdas via taskExecutor.shutdownNow(), so awaitTerminated() actually waits for the lambda to unwind rather than just for the dispatch loop.

Motivation

RepeatedTaskQueue extends AbstractExecutionThreadService, which only manages the single dispatch thread that drives run(). When stopAsync() triggers shutdown:

running is flipped to false and a wakeup task is added to the DelayQueue.
The dispatch loop wakes, sees running == false, returns.
The service is marked TERMINATED — awaitTerminated() returns.

The taskExecutor and any worker thread currently executing a previously-submitted task lambda are never touched. For tasks that perform long-blocking I/O (e.g. an SQS receiveMessage long-poll, which can park inside an HTTP call for up to 20s), this means the caller's awaitTerminated() returns essentially immediately while the lambda is still mid-I/O.

If the caller then closes a downstream resource (HTTP connection pool, AWS SDK client, DB connection, etc.) right after awaitTerminated() — which is the textbook lifecycle — it races the still-in-flight call. We've been chasing exactly this race in cash-server's SQS consumer (IllegalStateException: Connection pool shut down from SqsConsumer.processMessages at pod shutdown).

The current options are all unsatisfying:

Don't close the downstream resource at all and rely on JVM/OS shutdown to reclaim it. Works, but leaves a leak across restarts and obscures the lifecycle contract.
Catch the resulting exception inside the task lambda. Treats the symptom; the call is still aborted mid-flight.
Track in-flight count manually with a ReentrantLock + Condition and have the caller block on it before close. Works, but ~150 lines of bespoke concurrency code that every consumer of RepeatedTaskQueue would need to re-derive.

The standard JDK pattern for "stop the workers now" is ExecutorService.shutdownNow(), which calls Thread.interrupt() on each active worker. AWS SDK v2, Apache HTTP Client, and most blocking JDK calls (Thread.sleep, Object.wait, BlockingQueue.poll, etc.) respond to interrupts. This change wires that pattern into RepeatedTaskQueue behind an opt-in flag.

Change

`RepeatedTaskQueueConfig`

New field:

val interrupt_on_shutdown: Boolean = false

`RepeatedTaskQueue.triggerShutdown()`

After the existing running.set(false) + wakeup-task path, if the flag is set, call taskExecutor.shutdownNow(). This sends Thread.interrupt() to every active worker.

`RepeatedTaskQueue.schedule()` and `scheduleWithBackoff()`

Both catch (Throwable) blocks now check whether the throwable indicates a shutdown interrupt (direct InterruptedException, an InterruptedException somewhere in the cause chain, or Thread.currentThread().isInterrupted == true). If so, restore the interrupt flag and return Status.NO_RESCHEDULE instead of Status.FAILED. This prevents an interrupted shutdown from silently re-enqueueing the task and immediately re-running it on the still-shutting-down executor.

`RepeatedTaskQueue.run()`

Wrap taskExecutor.submit { ... } in a try/catch (RejectedExecutionException). With shutdownNow() enabled, there's a tiny window where the dispatch loop has passed its running.get() check but the executor is already shut down — submit would then throw and tear down the dispatch thread as FAILED rather than TERMINATED. We swallow the REE only when running == false (i.e. clearly during shutdown).

Why opt-in

Tasks that swallow InterruptedException without surfacing it, or perform genuinely uninterruptible work that must complete (e.g. a critical-section DB write), would have their work silently abandoned at shutdown. The default of false preserves existing behavior exactly. Only callers that explicitly opt in get the new semantics.

Tests

New RepeatedTaskQueueShutdownInterruptTest (uses real, not direct, executors so interrupt behavior actually exercises the worker thread):

awaitTerminated returns promptly when interrupt_on_shutdown is enabled — schedules a task that sleeps for 60s, calls stopAsync() + awaitTerminated(), asserts the call returns in well under 5s and the task observed InterruptedException.
interrupted task is not rescheduled — verifies the catch block's NO_RESCHEDULE path: an interrupted task runs exactly once, never re-enqueued.
awaitTerminated may not wait for in-flight task when interrupt_on_shutdown is disabled — pins down the historical default: the in-flight task is not interrupted by us, the service still terminates.

All 20 existing RepeatedTaskQueueTest tests still pass.

Public API

misk/api/misk.api updated. The change is the canonical Kotlin "data class + new param with default" diff: a new (JJIZ)V ctor and a copy(JJIZ) replacing copy(JJI). Source-compat is preserved for callers using named arguments; the only callers affected are those that called RepeatedTaskQueueConfig.copy(...) positionally with all three existing args, which appears to be no one in this repo.

Verification

bin/gradle :misk:compileKotlin --warn ✅
bin/gradle :misk:test --tests "misk.tasks.RepeatedTaskQueueShutdownInterruptTest" ✅ (3/3)
bin/gradle :misk:test --tests "misk.tasks.RepeatedTaskQueueTest" ✅ (20/20)
bin/gradle :misk:apiCheck --warn ✅
bin/gradle :misk:detekt --warn ✅

RepeatedTaskQueue.awaitTerminated() only joins the dispatch loop thread, not the underlying taskExecutor's worker carrying the task lambda. For tasks that perform long-blocking I/O (e.g. an SQS receiveMessage long-poll) this means awaitTerminated returns essentially immediately on stopAsync(), while the task lambda is still parked inside its I/O call. Callers that want to release downstream resources (HTTP client pools, etc.) right after awaitTerminated then race with the still-in- flight call, often producing 'connection pool shut down' style errors. This change adds an opt-in RepeatedTaskQueueConfig.interrupt_on_shutdown flag. When true: - triggerShutdown() additionally calls taskExecutor.shutdownNow(), which Thread.interrupt()s any worker currently running a task lambda. Blocking I/O calls that respect interrupts (Apache HTTP Client, AWS SDK v2, Thread.sleep, etc.) unwind immediately. - The catch (Throwable) blocks in schedule() and scheduleWithBackoff() detect InterruptedException (directly, via the thread's interrupt flag, or as a wrapped cause), restore the interrupt flag, and return Status.NO_RESCHEDULE instead of marking the task FAILED and rescheduling it. - The dispatch loop's submit() call is wrapped in a try/catch for RejectedExecutionException, since the executor may be shut down between the running.get() check and the submit. This avoids the dispatch thread terminating with a failure state at shutdown. Defaults to false. Existing callers see no behavior change. Add RepeatedTaskQueueShutdownInterruptTest covering: interrupt-enabled shutdown completes promptly, interrupted tasks are not rescheduled, and interrupt-disabled (the default) preserves historical behavior. Amp-Thread-ID: https://ampcode.com/threads/T-019dfd53-8e08-768b-bbf8-866aa2fca7c6 Co-authored-by: Amp <amp@ampcode.com>

…shutdown

sokolikp and others added 2 commits May 6, 2026 09:40

Merge branch 'master' into ssokolik/repeated-task-queue-interrupt-on-…

c29d3e4

…shutdown

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RepeatedTaskQueue: add interrupt_on_shutdown to abort in-flight tasks#3803

RepeatedTaskQueue: add interrupt_on_shutdown to abort in-flight tasks#3803
sokolikp wants to merge 2 commits into
cashapp:masterfrom
sokolikp:ssokolik/repeated-task-queue-interrupt-on-shutdown

sokolikp commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sokolikp commented May 6, 2026

Summary

Motivation

Change

RepeatedTaskQueueConfig

RepeatedTaskQueue.triggerShutdown()

RepeatedTaskQueue.schedule() and scheduleWithBackoff()

RepeatedTaskQueue.run()

Why opt-in

Tests

Public API

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`RepeatedTaskQueueConfig`

`RepeatedTaskQueue.triggerShutdown()`

`RepeatedTaskQueue.schedule()` and `scheduleWithBackoff()`

`RepeatedTaskQueue.run()`