Drain polled tasks on shutdown#2261
Conversation
| select { | ||
| case bw.taskQueueCh <- &polledTask{task: task, permit: slotPermit}: | ||
| didSendTask = true | ||
| case <-bw.stopCh: |
There was a problem hiding this comment.
Just checking the stop channel is still used elsewhere since we removed it in two spots
There was a problem hiding this comment.
yep! still used in plenty of other places
|
recheck |
b930672 to
29b355a
Compare
| bw.limiterContextCancel() | ||
|
|
||
| // Wait for pollers to finish. (pollTaskServiceTimeOut) bounds this if the connection is broken. | ||
| bw.pollerWG.Wait() |
There was a problem hiding this comment.
Unbounded pollerWG.Wait() bypasses user-configured stopTimeout
Medium Severity
bw.pollerWG.Wait() in Stop() blocks without any timeout, and runs before awaitWaitGroup(&bw.stopWG, bw.options.stopTimeout). Combined with doPoll now waiting unconditionally on <-doneC (bounded only by pollTaskServiceTimeOut = 70s), Stop() can block for up to 70 seconds before the user's stopTimeout even begins counting. Previously, a 5-second fallback cancellation bounded this. In failure scenarios (broken gRPC connection, unresponsive server), total Stop() duration becomes ~70s + stopTimeout instead of just stopTimeout.
Additional Locations (1)
This reverts commit be5c0e4.


What was changed
Refactored worker shutdown to use a two-stage approach: pollers shut down first, then the task dispatcher drains any remaining tasks before exiting. This ensures tasks polled during shutdown are processed rather than silently dropped.
Key changes:
Why?
PR #2199 changed shutdown to let the server complete in-flight polls, instead of cancelling them. This exposed a pre-existing race when a poller receives a task during shutdown, Go would silently dropping the task. The dispatcher had the same issue — it could exit on stopCh before reading pending tasks from the channel.
This aligns the Go SDK's shutdown with how Core SDK handles it:
Checklist
Closes Drain polled tasks on shutdown #1197
How was this tested:
Any docs updates needed?
No
Note
Medium Risk
Changes worker shutdown ordering and channel lifecycle, which is concurrency-sensitive and could impact task processing/Stop() behavior under load or broken connections.
Overview
Refactors worker shutdown into a two-stage drain: pollers are tracked via a new
pollerWG, and once all pollers exit the worker closestaskQueueChso the dispatcher canrangeand drain remaining tasks before exiting.Adjusts polling/shutdown semantics to avoid dropped tasks:
pollTasknow always sends polled tasks totaskQueueCh(no stop-select drop),runTaskDispatcherprocesses queued polled tasks even after limiter cancellation, andbasePoller.doPollremoves the prior 5s shutdown timer hack and simply waits for the poll to complete whenworkerPollCompleteOnShutdownis enabled.Adds
TestTaskNotDroppedDuringShutdownto validate a task returned during shutdown is still dispatched/processed, and tweaks an integration-test sleep comment/timing guidance.Reviewed by Cursor Bugbot for commit 57fdffe. Bugbot is set up for automated code reviews on this repo. Configure here.