-
Notifications
You must be signed in to change notification settings - Fork 272
Description
Describe the issue
Hatchet engine assigns a task to an worker that miss a heartbeat due to DEADLINE EXCEEDED.
Hatchet thinks that worker is running this task, but the worker itself doesn't not assign or acknowledge the task as executing.
Environment
- SDK: python 1.17.1
- Engine: Cloud
Expected behavior
Task should be assigned or fail as the heartbeat fails.
Code to Reproduce, Logs, or Screenshots
Relevant workflow description on hatchet cloud:
2025-10-16T16:56:23.218 - Task queued
2025-10-16T16:56:23.230 - Assigned to worker
2025-10-16T17:11:23.711 - Execution timed out (Task exceeded timeout of 15m)
Relevant log on worker:
2025-10-16T16:56:30.206699778Z - [WARNING]\t🪓 -- 2025-10-16 16:56:30,206 - interrupted read_with_interrupt task of action listener
debug_error_string = \"UNKNOWN:Error received from peer {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2025-10-16T16:56:30.205401749+00:00\"
details = \"Deadline Exceeded\"
status = StatusCode.DEADLINE_EXCEEDED
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
raise _create_rpc_error("}
File \"/app/lib/python3.11/site-packages/grpc/aio/_call.py\", line 327, in __await__
await self.aio_client.Heartbeat(
File \"/app/lib/python3.11/site-packages/hatchet_sdk/clients/dispatcher/action_listener.py\", line 110, in heartbeat
Traceback (most recent call last):
2025-10-16T16:56:30.206659795Z - [ERROR]\t🪓 -- 2025-10-16 16:56:30,205 - failed to send heartbeat
2025-10-16T16:56:23.195498103Z - [INFO]\t🪓 -- 2025-10-16 16:56:23,195 - finished step run: XXXXXXXXXXX
Additional context
It seems that the assigned action’s execution path is never completed because the heartbeat failure triggers an interrupt that prematurely cancels the action listener loop.
When a DEADLINE_EXCEEDED occurs during the Heartbeat() RPC:
- The
heartbeat()coroutine catches the error and callsself.interrupt.set(). - The main
_generator()loop (which is awaitingself.interrupt.wait()) wakes up. - It sees that its
read_with_interrupttask isn’t done yet, so it:- Cancels the read task (
t.cancel()), - Cancels the gRPC listener (
listener.cancel()), - Breaks out of the loop.
- Cancels the read task (
- This stops the action listener coroutine entirely, so the currently assigned action never finishes execution or reports completion.
- Meanwhile, the server still thinks the worker is alive and running the action, since no error or unsubscribe message is sent.
Maybe hatchet should avoid interrupting the listener on transient heartbeat failures.