Skip to content

RepeatedTaskQueue: add interrupt_on_shutdown to abort in-flight tasks#3803

Open
sokolikp wants to merge 2 commits into
cashapp:masterfrom
sokolikp:ssokolik/repeated-task-queue-interrupt-on-shutdown
Open

RepeatedTaskQueue: add interrupt_on_shutdown to abort in-flight tasks#3803
sokolikp wants to merge 2 commits into
cashapp:masterfrom
sokolikp:ssokolik/repeated-task-queue-interrupt-on-shutdown

Conversation

@sokolikp
Copy link
Copy Markdown

@sokolikp sokolikp commented May 6, 2026

Summary

Adds an opt-in RepeatedTaskQueueConfig.interrupt_on_shutdown flag that makes RepeatedTaskQueue.triggerShutdown() interrupt in-flight task lambdas via taskExecutor.shutdownNow(), so awaitTerminated() actually waits for the lambda to unwind rather than just for the dispatch loop.

Motivation

RepeatedTaskQueue extends AbstractExecutionThreadService, which only manages the single dispatch thread that drives run(). When stopAsync() triggers shutdown:

  1. running is flipped to false and a wakeup task is added to the DelayQueue.
  2. The dispatch loop wakes, sees running == false, returns.
  3. The service is marked TERMINATEDawaitTerminated() returns.

The taskExecutor and any worker thread currently executing a previously-submitted task lambda are never touched. For tasks that perform long-blocking I/O (e.g. an SQS receiveMessage long-poll, which can park inside an HTTP call for up to 20s), this means the caller's awaitTerminated() returns essentially immediately while the lambda is still mid-I/O.

If the caller then closes a downstream resource (HTTP connection pool, AWS SDK client, DB connection, etc.) right after awaitTerminated() — which is the textbook lifecycle — it races the still-in-flight call. We've been chasing exactly this race in cash-server's SQS consumer (IllegalStateException: Connection pool shut down from SqsConsumer.processMessages at pod shutdown).

The current options are all unsatisfying:

  • Don't close the downstream resource at all and rely on JVM/OS shutdown to reclaim it. Works, but leaves a leak across restarts and obscures the lifecycle contract.
  • Catch the resulting exception inside the task lambda. Treats the symptom; the call is still aborted mid-flight.
  • Track in-flight count manually with a ReentrantLock + Condition and have the caller block on it before close. Works, but ~150 lines of bespoke concurrency code that every consumer of RepeatedTaskQueue would need to re-derive.

The standard JDK pattern for "stop the workers now" is ExecutorService.shutdownNow(), which calls Thread.interrupt() on each active worker. AWS SDK v2, Apache HTTP Client, and most blocking JDK calls (Thread.sleep, Object.wait, BlockingQueue.poll, etc.) respond to interrupts. This change wires that pattern into RepeatedTaskQueue behind an opt-in flag.

Change

RepeatedTaskQueueConfig

New field:

val interrupt_on_shutdown: Boolean = false

RepeatedTaskQueue.triggerShutdown()

After the existing running.set(false) + wakeup-task path, if the flag is set, call taskExecutor.shutdownNow(). This sends Thread.interrupt() to every active worker.

RepeatedTaskQueue.schedule() and scheduleWithBackoff()

Both catch (Throwable) blocks now check whether the throwable indicates a shutdown interrupt (direct InterruptedException, an InterruptedException somewhere in the cause chain, or Thread.currentThread().isInterrupted == true). If so, restore the interrupt flag and return Status.NO_RESCHEDULE instead of Status.FAILED. This prevents an interrupted shutdown from silently re-enqueueing the task and immediately re-running it on the still-shutting-down executor.

RepeatedTaskQueue.run()

Wrap taskExecutor.submit { ... } in a try/catch (RejectedExecutionException). With shutdownNow() enabled, there's a tiny window where the dispatch loop has passed its running.get() check but the executor is already shut down — submit would then throw and tear down the dispatch thread as FAILED rather than TERMINATED. We swallow the REE only when running == false (i.e. clearly during shutdown).

Why opt-in

Tasks that swallow InterruptedException without surfacing it, or perform genuinely uninterruptible work that must complete (e.g. a critical-section DB write), would have their work silently abandoned at shutdown. The default of false preserves existing behavior exactly. Only callers that explicitly opt in get the new semantics.

Tests

New RepeatedTaskQueueShutdownInterruptTest (uses real, not direct, executors so interrupt behavior actually exercises the worker thread):

  1. awaitTerminated returns promptly when interrupt_on_shutdown is enabled — schedules a task that sleeps for 60s, calls stopAsync() + awaitTerminated(), asserts the call returns in well under 5s and the task observed InterruptedException.
  2. interrupted task is not rescheduled — verifies the catch block's NO_RESCHEDULE path: an interrupted task runs exactly once, never re-enqueued.
  3. awaitTerminated may not wait for in-flight task when interrupt_on_shutdown is disabled — pins down the historical default: the in-flight task is not interrupted by us, the service still terminates.

All 20 existing RepeatedTaskQueueTest tests still pass.

Public API

misk/api/misk.api updated. The change is the canonical Kotlin "data class + new param with default" diff: a new (JJIZ)V ctor and a copy(JJIZ) replacing copy(JJI). Source-compat is preserved for callers using named arguments; the only callers affected are those that called RepeatedTaskQueueConfig.copy(...) positionally with all three existing args, which appears to be no one in this repo.

Verification

  • bin/gradle :misk:compileKotlin --warn
  • bin/gradle :misk:test --tests "misk.tasks.RepeatedTaskQueueShutdownInterruptTest" ✅ (3/3)
  • bin/gradle :misk:test --tests "misk.tasks.RepeatedTaskQueueTest" ✅ (20/20)
  • bin/gradle :misk:apiCheck --warn
  • bin/gradle :misk:detekt --warn

sokolikp and others added 2 commits May 6, 2026 09:40
RepeatedTaskQueue.awaitTerminated() only joins the dispatch loop thread,
not the underlying taskExecutor's worker carrying the task lambda. For
tasks that perform long-blocking I/O (e.g. an SQS receiveMessage
long-poll) this means awaitTerminated returns essentially immediately
on stopAsync(), while the task lambda is still parked inside its I/O
call. Callers that want to release downstream resources (HTTP client
pools, etc.) right after awaitTerminated then race with the still-in-
flight call, often producing 'connection pool shut down' style errors.

This change adds an opt-in RepeatedTaskQueueConfig.interrupt_on_shutdown
flag. When true:

  - triggerShutdown() additionally calls taskExecutor.shutdownNow(),
    which Thread.interrupt()s any worker currently running a task
    lambda. Blocking I/O calls that respect interrupts (Apache HTTP
    Client, AWS SDK v2, Thread.sleep, etc.) unwind immediately.

  - The catch (Throwable) blocks in schedule() and scheduleWithBackoff()
    detect InterruptedException (directly, via the thread's interrupt
    flag, or as a wrapped cause), restore the interrupt flag, and
    return Status.NO_RESCHEDULE instead of marking the task FAILED and
    rescheduling it.

  - The dispatch loop's submit() call is wrapped in a try/catch for
    RejectedExecutionException, since the executor may be shut down
    between the running.get() check and the submit. This avoids the
    dispatch thread terminating with a failure state at shutdown.

Defaults to false. Existing callers see no behavior change.

Add RepeatedTaskQueueShutdownInterruptTest covering: interrupt-enabled
shutdown completes promptly, interrupted tasks are not rescheduled,
and interrupt-disabled (the default) preserves historical behavior.

Amp-Thread-ID: https://ampcode.com/threads/T-019dfd53-8e08-768b-bbf8-866aa2fca7c6
Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant