Skip to content

Activity not cancelled after start_to_close_timeout when heartbeat_timeout is shorter #1188

@zeev-finaloop

Description

@zeev-finaloop

Bug Description

When an activity has both startToCloseTimeout and heartbeatTimeout set, and heartbeatTimeout < startToCloseTimeout, the Core SDK's local timeout timer never fires because successful heartbeats keep resetting it. The activity runs indefinitely on the worker even after the server has timed it out.

This affects all SDKs built on the Core SDK (Python, TypeScript). The Go SDK is not affected because it uses context.WithDeadline based on startToCloseTimeout directly.

Reproduction

Setup: Activity that polls in a loop, heartbeats every 2 seconds, never completes.

Setting Value
startToCloseTimeout 10s
heartbeatTimeout 5s
retryPolicy.maximumAttempts 1

Expected: Activity receives cancellation ~10s after start (when startToCloseTimeout fires).

Actual: Activity runs forever. No cancellation delivered. Heartbeats succeed with no errors even after the server has timed out the activity.

Tested on:

  • Python SDK v1.23.0 with temporalio.testing.WorkflowEnvironment.start_local() (Temporal Server v1.30.2) — not cancelled
  • TypeScript SDK (latest) with TestWorkflowEnvironment.createLocal() (Temporal Server v1.30.2) — not cancelled
  • Go SDK v1.41.1 with real local server — properly cancelled via context.WithDeadline

Python Reproduction

import asyncio
import time
from datetime import timedelta
from temporalio import activity, workflow
from temporalio.common import RetryPolicy
from temporalio.testing import WorkflowEnvironment
from temporalio.worker import Worker

activity_cancelled = False

@activity.defn
async def stuck_activity() -> str:
    global activity_cancelled
    # Background heartbeat
    async def hb_loop():
        while True:
            try:
                activity.heartbeat()
            except Exception:
                pass
            await asyncio.sleep(2)
    hb_task = asyncio.create_task(hb_loop())
    try:
        while True:
            await asyncio.sleep(1)
    except asyncio.CancelledError:
        activity_cancelled = True
        raise
    finally:
        hb_task.cancel()

@workflow.defn
class TestWf:
    @workflow.run
    async def run(self) -> str:
        try:
            return await workflow.execute_activity(
                stuck_activity,
                start_to_close_timeout=timedelta(seconds=10),
                heartbeat_timeout=timedelta(seconds=5),
                retry_policy=RetryPolicy(maximum_attempts=1),
            )
        except Exception as e:
            return f"failed: {e}"

async def main():
    async with await WorkflowEnvironment.start_local() as env:
        async with Worker(env.client, task_queue="q", workflows=[TestWf], activities=[stuck_activity]):
            result = await env.client.execute_workflow(TestWf.run, id="test", task_queue="q", execution_timeout=timedelta(seconds=30))
            print(f"Result: {result}")
            print(f"Activity cancelled: {activity_cancelled}")
            # Prints: activity_cancelled = False

asyncio.run(main())

TypeScript Reproduction

// Same pattern — activity polls with setInterval heartbeat, never completes.
// startToCloseTimeout=10s, heartbeatTimeout=5s, maximumAttempts=1.
// Result: activity NOT cancelled after timeout. Only cancelled on worker shutdown.

Go SDK (NOT affected)

// Same pattern — activity heartbeats in a loop.
// startToCloseTimeout=10s, heartbeatTimeout=5s, maximumAttempts=1.
// Result: ctx.Done() fires after exactly 10s with "context deadline exceeded".
// Go SDK uses context.WithDeadline — works correctly.

Root Cause

In crates/sdk-core/src/worker/activities.rs (lines ~585-638), the local timeout implementation picks only the minimum of heartbeat_timeout and start_to_close_timeout:

let timeout_at = [
    (HEARTBEAT_TYPE, task.resp.heartbeat_timeout),
    ("start_to_close", task.resp.start_to_close_timeout),
]
.into_iter()
.filter_map(...)
.min_by(|(_, d1), (_, d2)| d1.cmp(d2));

When heartbeat_timeout (5s) < start_to_close_timeout (10s):

  1. Only the heartbeat timer is created (type = HEARTBEAT_TYPE)
  2. A timeout_resetter (Notify) is created for the heartbeat timer
  3. Every successful heartbeat calls resetter.notify_one(), resetting the timer
  4. The start_to_close_timeout timer is never created
  5. As long as heartbeats succeed, the activity runs forever

What the Go SDK Does Differently

The Go SDK (internal/internal_task_handlers.go) sets a hard context deadline:

ctx, dlCancelFunc := context.WithDeadline(ctx, info.deadline)

Where info.deadline is computed from startToCloseTimeout (not heartbeat_timeout). This fires regardless of heartbeating.

Suggested Fix

Create both timers when both timeouts are set:

  1. A resettable heartbeat timer that resets on each heartbeat
  2. A non-resettable start_to_close_timeout timer that always fires

This was explicitly requested by @cretz in the PR #620 review:

"Ideally we can have a start to close timer that will fail task AND a heartbeat timer that will fail the task, the latter simply getting reset every heartbeat call."

Production Impact

This caused a production incident where activities (with startToCloseTimeout=15min, heartbeatTimeout=3min) ran for 5+ hours as zombies on a worker after the server timed them out. Each zombie held a worker task slot, eventually saturating all 20 slots and rendering the worker unable to process any new activities.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions