-
Notifications
You must be signed in to change notification settings - Fork 119
Activity not cancelled after start_to_close_timeout when heartbeat_timeout is shorter #1188
Description
Bug Description
When an activity has both startToCloseTimeout and heartbeatTimeout set, and heartbeatTimeout < startToCloseTimeout, the Core SDK's local timeout timer never fires because successful heartbeats keep resetting it. The activity runs indefinitely on the worker even after the server has timed it out.
This affects all SDKs built on the Core SDK (Python, TypeScript). The Go SDK is not affected because it uses context.WithDeadline based on startToCloseTimeout directly.
Reproduction
Setup: Activity that polls in a loop, heartbeats every 2 seconds, never completes.
| Setting | Value |
|---|---|
startToCloseTimeout |
10s |
heartbeatTimeout |
5s |
retryPolicy.maximumAttempts |
1 |
Expected: Activity receives cancellation ~10s after start (when startToCloseTimeout fires).
Actual: Activity runs forever. No cancellation delivered. Heartbeats succeed with no errors even after the server has timed out the activity.
Tested on:
- Python SDK v1.23.0 with
temporalio.testing.WorkflowEnvironment.start_local()(Temporal Server v1.30.2) — not cancelled - TypeScript SDK (latest) with
TestWorkflowEnvironment.createLocal()(Temporal Server v1.30.2) — not cancelled - Go SDK v1.41.1 with real local server — properly cancelled via
context.WithDeadline
Python Reproduction
import asyncio
import time
from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker
activity_cancelled = False
@activity.defn
async def stuck_activity() -> str:
global activity_cancelled
# Background heartbeat
async def hb_loop():
while True:
try:
activity.heartbeat()
except Exception:
pass
await asyncio.sleep(2)
hb_task = asyncio.create_task(hb_loop())
try:
while True:
await asyncio.sleep(1)
except asyncio.CancelledError:
activity_cancelled = True
raise
finally:
hb_task.cancel()
@workflow.defn
class TestWf:
@workflow.run
async def run(self) -> str:
try:
return await workflow.execute_activity(
stuck_activity,
start_to_close_timeout=timedelta(seconds=10),
heartbeat_timeout=timedelta(seconds=5),
retry_policy=RetryPolicy(maximum_attempts=1),
)
except Exception as e:
return f"failed: {e}"
async def main():
async with await WorkflowEnvironment.start_local() as env:
async with Worker(env.client, task_queue="q", workflows=[TestWf], activities=[stuck_activity]):
result = await env.client.execute_workflow(TestWf.run, id="test", task_queue="q", execution_timeout=timedelta(seconds=30))
print(f"Result: {result}")
print(f"Activity cancelled: {activity_cancelled}")
# Prints: activity_cancelled = False
asyncio.run(main())TypeScript Reproduction
// Same pattern — activity polls with setInterval heartbeat, never completes.
// startToCloseTimeout=10s, heartbeatTimeout=5s, maximumAttempts=1.
// Result: activity NOT cancelled after timeout. Only cancelled on worker shutdown.Go SDK (NOT affected)
// Same pattern — activity heartbeats in a loop.
// startToCloseTimeout=10s, heartbeatTimeout=5s, maximumAttempts=1.
// Result: ctx.Done() fires after exactly 10s with "context deadline exceeded".
// Go SDK uses context.WithDeadline — works correctly.Root Cause
In crates/sdk-core/src/worker/activities.rs (lines ~585-638), the local timeout implementation picks only the minimum of heartbeat_timeout and start_to_close_timeout:
let timeout_at = [
(HEARTBEAT_TYPE, task.resp.heartbeat_timeout),
("start_to_close", task.resp.start_to_close_timeout),
]
.into_iter()
.filter_map(...)
.min_by(|(_, d1), (_, d2)| d1.cmp(d2));When heartbeat_timeout (5s) < start_to_close_timeout (10s):
- Only the heartbeat timer is created (type =
HEARTBEAT_TYPE) - A
timeout_resetter(Notify) is created for the heartbeat timer - Every successful heartbeat calls
resetter.notify_one(), resetting the timer - The
start_to_close_timeouttimer is never created - As long as heartbeats succeed, the activity runs forever
What the Go SDK Does Differently
The Go SDK (internal/internal_task_handlers.go) sets a hard context deadline:
ctx, dlCancelFunc := context.WithDeadline(ctx, info.deadline)Where info.deadline is computed from startToCloseTimeout (not heartbeat_timeout). This fires regardless of heartbeating.
Suggested Fix
Create both timers when both timeouts are set:
- A resettable heartbeat timer that resets on each heartbeat
- A non-resettable
start_to_close_timeouttimer that always fires
This was explicitly requested by @cretz in the PR #620 review:
"Ideally we can have a start to close timer that will fail task AND a heartbeat timer that will fail the task, the latter simply getting reset every heartbeat call."
Production Impact
This caused a production incident where activities (with startToCloseTimeout=15min, heartbeatTimeout=3min) ran for 5+ hours as zombies on a worker after the server timed them out. Each zombie held a worker task slot, eventually saturating all 20 slots and rendering the worker unable to process any new activities.
Related
- Local timeouts for remote activities #620 — PR that added local timeout enforcement (this bug is in its implementation)
- [Feature Request] Auto cancel activities after they timeout #490 — Original feature request for local timeout enforcement
- [Feature Request] Worker enforcement of activity timeouts features#170 — Feature spec
- [Bug] No clean way to cancel an activity and wait until it's cancelled sdk-python#700 — Related cancellation UX gap