Skip to content

Conversation

@gagantrivedi
Copy link
Member

@gagantrivedi gagantrivedi commented Oct 24, 2025

Problem

TaskRunner threads were appearing "dead" and unable to process new tasks after some tasks timed out. This occurred because the ThreadPoolExecutor's context manager was blocking indefinitely when tasks exceeded their
timeout but continued running.

Root Cause

When a task times out in _run_task():

  1. future.result(timeout=N) correctly raises TimeoutError after N seconds
  2. The exception is caught and handled properly
  3. BUT when exiting the with ThreadPoolExecutor(...) context manager, it calls executor.shutdown(wait=True) by default
  4. shutdown(wait=True) blocks the calling thread until all worker threads complete
  5. If the timed-out task continues running (e.g., iterating over thousands of S3 objects), the TaskRunner thread blocks indefinitely
  6. The blocked TaskRunner thread appears "dead" - it can't process new tasks or update health checks

Over time, all 12 TaskRunner threads would get stuck, causing task processing to halt completely.

Solution

Changed from using the ThreadPoolExecutor context manager to explicit executor management with shutdown(wait=False):

Before:

try:
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(task.run)
        future.result(timeout=timeout)
    # __exit__ calls shutdown(wait=True) - BLOCKS HERE!
except TimeoutError:
    # Handle timeout...

After:

executor = None
try:
    executor = ThreadPoolExecutor(max_workers=1)
    future = executor.submit(task.run)
    future.result(timeout=timeout)
except TimeoutError:
    # Handle timeout...
finally:
    if executor is not None:
        executor.shutdown(wait=False)  # Don't wait - return immediately

With shutdown(wait=False), the TaskRunner thread returns immediately after the timeout, allowing it to continue processing other tasks.

Changes

  • src/task_processor/processor.py: Replaced with ThreadPoolExecutor(...) with explicit executor management and added finally block calling shutdown(wait=False)
  • tests/unit/task_processor/test_unit_task_processor_processor.py: Added test_run_task_does_not_block_on_timeout() to verify TaskRunner threads return quickly (< 2s) even when tasks would run for 10+ seconds

Why Existing Tests Didn't Catch This

Existing timeout tests use tasks that sleep for 1 second. While they timeout correctly, the 1-second blocking period is imperceptible. The bug only becomes apparent with longer-running tasks (10+ seconds), as seen
in production with S3 streaming tasks.

@gagantrivedi gagantrivedi requested a review from a team as a code owner October 24, 2025 08:21
@gagantrivedi gagantrivedi requested review from khvn26 and removed request for a team October 24, 2025 08:21
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.58%. Comparing base (4a42ef3) to head (318eb32).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #105      +/-   ##
==========================================
+ Coverage   94.53%   94.58%   +0.04%     
==========================================
  Files          75       75              
  Lines        2417     2437      +20     
==========================================
+ Hits         2285     2305      +20     
  Misses        132      132              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@khvn26 khvn26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@gagantrivedi gagantrivedi merged commit c0b8dde into main Oct 24, 2025
5 checks passed
@gagantrivedi gagantrivedi deleted the fix/threadpool-executor-blocking branch October 31, 2025 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants