fix: worker thread getting stuck #105

gagantrivedi · 2025-10-24T08:21:08Z

Problem

TaskRunner threads were appearing "dead" and unable to process new tasks after some tasks timed out. This occurred because the ThreadPoolExecutor's context manager was blocking indefinitely when tasks exceeded their
timeout but continued running.

Root Cause

When a task times out in _run_task():

future.result(timeout=N) correctly raises TimeoutError after N seconds
The exception is caught and handled properly
BUT when exiting the with ThreadPoolExecutor(...) context manager, it calls executor.shutdown(wait=True) by default
shutdown(wait=True) blocks the calling thread until all worker threads complete
If the timed-out task continues running (e.g., iterating over thousands of S3 objects), the TaskRunner thread blocks indefinitely
The blocked TaskRunner thread appears "dead" - it can't process new tasks or update health checks

Over time, all 12 TaskRunner threads would get stuck, causing task processing to halt completely.

Solution

Changed from using the ThreadPoolExecutor context manager to explicit executor management with shutdown(wait=False):

Before:

try:
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(task.run)
        future.result(timeout=timeout)
    # __exit__ calls shutdown(wait=True) - BLOCKS HERE!
except TimeoutError:
    # Handle timeout...

After:

executor = None
try:
    executor = ThreadPoolExecutor(max_workers=1)
    future = executor.submit(task.run)
    future.result(timeout=timeout)
except TimeoutError:
    # Handle timeout...
finally:
    if executor is not None:
        executor.shutdown(wait=False)  # Don't wait - return immediately

With shutdown(wait=False), the TaskRunner thread returns immediately after the timeout, allowing it to continue processing other tasks.

Changes

src/task_processor/processor.py: Replaced with ThreadPoolExecutor(...) with explicit executor management and added finally block calling shutdown(wait=False)
tests/unit/task_processor/test_unit_task_processor_processor.py: Added test_run_task_does_not_block_on_timeout() to verify TaskRunner threads return quickly (< 2s) even when tasks would run for 10+ seconds

Why Existing Tests Didn't Catch This

Existing timeout tests use tasks that sleep for 1 second. While they timeout correctly, the 1-second blocking period is imperceptible. The bug only becomes apparent with longer-running tasks (10+ seconds), as seen
in production with S3 streaming tasks.

codecov-commenter · 2025-10-24T08:32:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.58%. Comparing base (4a42ef3) to head (318eb32).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #105      +/-   ##
==========================================
+ Coverage   94.53%   94.58%   +0.04%     
==========================================
  Files          75       75              
  Lines        2417     2437      +20     
==========================================
+ Hits         2285     2305      +20     
  Misses        132      132

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

khvn26

LGTM 👍

fix: worker thread getting stuck

0d768a9

gagantrivedi requested a review from a team as a code owner October 24, 2025 08:21

gagantrivedi requested review from khvn26 and removed request for a team October 24, 2025 08:21

fix: add type narrowing for error_details in test

318eb32

gagantrivedi mentioned this pull request Oct 24, 2025

fix(tests/timeout): increase sleep duration #106

Merged

khvn26 approved these changes Oct 24, 2025

View reviewed changes

gagantrivedi merged commit c0b8dde into main Oct 24, 2025
5 checks passed

flagsmithdev mentioned this pull request Oct 24, 2025

chore(main): release 2.2.2 #107

Merged

Copilot AI mentioned this pull request Oct 27, 2025

deps: Update flagsmith-common to v2.2.2 Flagsmith/flagsmith#6205

Closed

gagantrivedi deleted the fix/threadpool-executor-blocking branch October 31, 2025 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: worker thread getting stuck #105

fix: worker thread getting stuck #105

Uh oh!

gagantrivedi commented Oct 24, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 24, 2025

Uh oh!

khvn26 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: worker thread getting stuck #105

fix: worker thread getting stuck #105

Uh oh!

Conversation

gagantrivedi commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 24, 2025

Codecov Report

Uh oh!

khvn26 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gagantrivedi commented Oct 24, 2025 •

edited

Loading