[BUG] Task remains assigned on engine after worker heartbeat timeout interrupts execution loop

**Describe the issue**

Hatchet engine assigns a task to an worker that miss a heartbeat due to DEADLINE EXCEEDED.

Hatchet thinks that worker is running this task, but the worker itself doesn't not assign or acknowledge the task as executing.

**Environment**
- SDK: python 1.17.1
- Engine: Cloud

**Expected behavior**

Task should be assigned or fail as the heartbeat fails.

**Code to Reproduce, Logs, or Screenshots**

Relevant workflow description on hatchet cloud:

```
2025-10-16T16:56:23.218 - Task queued
2025-10-16T16:56:23.230 - Assigned to worker
2025-10-16T17:11:23.711 - Execution timed out (Task exceeded timeout of 15m)
```

Relevant log on worker:

```
2025-10-16T16:56:30.206699778Z - [WARNING]\t🪓 -- 2025-10-16 16:56:30,206 - interrupted read_with_interrupt task of action listener

debug_error_string = \"UNKNOWN:Error received from peer  {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2025-10-16T16:56:30.205401749+00:00\"
details = \"Deadline Exceeded\"
status = StatusCode.DEADLINE_EXCEEDED

grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
    raise _create_rpc_error("}
  File \"/app/lib/python3.11/site-packages/grpc/aio/_call.py\", line 327, in __await__
    await self.aio_client.Heartbeat(
  File \"/app/lib/python3.11/site-packages/hatchet_sdk/clients/dispatcher/action_listener.py\", line 110, in heartbeat
Traceback (most recent call last):

2025-10-16T16:56:30.206659795Z - [ERROR]\t🪓 -- 2025-10-16 16:56:30,205 - failed to send heartbeat

2025-10-16T16:56:23.195498103Z - [INFO]\t🪓 -- 2025-10-16 16:56:23,195 - finished step run: XXXXXXXXXXX
```

**Additional context**

It seems that the assigned action’s execution path is **never completed** because the **heartbeat failure** triggers an interrupt that prematurely **cancels the action listener loop**.

When a `DEADLINE_EXCEEDED` occurs during the `Heartbeat()` RPC:
1. The `heartbeat()` coroutine catches the error and calls `self.interrupt.set()`.
2. The main `_generator()` loop (which is awaiting `self.interrupt.wait()`) wakes up.
3. It sees that its `read_with_interrupt` task isn’t done yet, so it:
   - Cancels the read task (`t.cancel()`),
   - Cancels the gRPC listener (`listener.cancel()`),
   - Breaks out of the loop.
4. This stops the action listener coroutine entirely, so the currently assigned action **never finishes execution** or reports completion.
5. Meanwhile, the server still thinks the worker is alive and running the action, since no error or unsubscribe message is sent.

Maybe hatchet should avoid interrupting the listener on transient heartbeat failures. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Task remains assigned on engine after worker heartbeat timeout interrupts execution loop #2432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Task remains assigned on engine after worker heartbeat timeout interrupts execution loop #2432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions