Skip to content

[BUG] Task remains assigned on engine after worker heartbeat timeout interrupts execution loop #2432

@matheusgr

Description

@matheusgr

Describe the issue

Hatchet engine assigns a task to an worker that miss a heartbeat due to DEADLINE EXCEEDED.

Hatchet thinks that worker is running this task, but the worker itself doesn't not assign or acknowledge the task as executing.

Environment

  • SDK: python 1.17.1
  • Engine: Cloud

Expected behavior

Task should be assigned or fail as the heartbeat fails.

Code to Reproduce, Logs, or Screenshots

Relevant workflow description on hatchet cloud:

2025-10-16T16:56:23.218 - Task queued
2025-10-16T16:56:23.230 - Assigned to worker
2025-10-16T17:11:23.711 - Execution timed out (Task exceeded timeout of 15m)

Relevant log on worker:

2025-10-16T16:56:30.206699778Z - [WARNING]\t🪓 -- 2025-10-16 16:56:30,206 - interrupted read_with_interrupt task of action listener

debug_error_string = \"UNKNOWN:Error received from peer  {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2025-10-16T16:56:30.205401749+00:00\"
details = \"Deadline Exceeded\"
status = StatusCode.DEADLINE_EXCEEDED

grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    raise _create_rpc_error("}
  File \"/app/lib/python3.11/site-packages/grpc/aio/_call.py\", line 327, in __await__
    await self.aio_client.Heartbeat(
  File \"/app/lib/python3.11/site-packages/hatchet_sdk/clients/dispatcher/action_listener.py\", line 110, in heartbeat
Traceback (most recent call last):

2025-10-16T16:56:30.206659795Z - [ERROR]\t🪓 -- 2025-10-16 16:56:30,205 - failed to send heartbeat

2025-10-16T16:56:23.195498103Z - [INFO]\t🪓 -- 2025-10-16 16:56:23,195 - finished step run: XXXXXXXXXXX

Additional context

It seems that the assigned action’s execution path is never completed because the heartbeat failure triggers an interrupt that prematurely cancels the action listener loop.

When a DEADLINE_EXCEEDED occurs during the Heartbeat() RPC:

  1. The heartbeat() coroutine catches the error and calls self.interrupt.set().
  2. The main _generator() loop (which is awaiting self.interrupt.wait()) wakes up.
  3. It sees that its read_with_interrupt task isn’t done yet, so it:
    • Cancels the read task (t.cancel()),
    • Cancels the gRPC listener (listener.cancel()),
    • Breaks out of the loop.
  4. This stops the action listener coroutine entirely, so the currently assigned action never finishes execution or reports completion.
  5. Meanwhile, the server still thinks the worker is alive and running the action, since no error or unsubscribe message is sent.

Maybe hatchet should avoid interrupting the listener on transient heartbeat failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions