Skip to content

ChildWorkflowFuture.Get hangs forever (and leaks a goroutine) if parent workflow times out #1407

Open
@petergardfjall

Description

@petergardfjall

Describe the bug
If a workflow times out while calling Get on a child workflow future (as illustrated below), the Get call never returns and a goroutine is leaked.

func MyWorkflow(ctx workflow.Context) error {
    // .. snip
    future := workflow.ExecuteChildWorkflow(childCtx, childWorkflow)
    future.Get(childCtx, nil) // <- hangs forever if MyWorkflow exceeds START_TO_CLOSE timeout.
    // .. snip
}

Notably, all other involved "entities" behave the way (I think) they shuld:

  • The caller of the parent workflow (MyWorkflow above) correctly sees a TimeoutType: START_TO_CLOSE.
  • The child workflow (childWorkflow above) correctly sees a CanceledError.
  • Any activities started by the child workflow correctly see a context canceled.

Steps to reproduce the behavior:
Find a sample application attached (sample.tar.gz) that can be used to reproduce and illustrate the issue. Assuming that a Cadence cluster is serving a cadence-frontend on localhost:7833 the following steps can be used to reproduce the issue:

  1. Start a workflow worker:
    go run ./worker/
  2. Execute the parent workflow with a 60s start-to-close timeout:
    go run ./client/

After one minute the client will fail (as expected):

{"level":"info","time":"2024-12-03T09:26:42+01:00","message":"starting workflow ..."}
{"level":"info","time":"2024-12-03T09:26:42+01:00","message":"awaiting workflow 1733214402 ..."}
{"level":"error","error":"TimeoutType: START_TO_CLOSE","time":"2024-12-03T09:27:42+01:00","message":"parent workflow execution failed: TimeoutType: START_TO_CLOSE (*internal.TimeoutError)\n&internal.TimeoutError{timeoutType:0, details:internal.ErrorDetailsValues(nil)}"}

Looking at the worker output, we will see that the child workflow failed (as expected):

{"level":"error","error":"CanceledError","time":"2024-12-03T09:27:42+01:00","message":"ChildWorkflow.Run: doActivity failed: CanceledError (*internal.CanceledError)\n&internal.CanceledError{details:internal.ErrorDetailsValues(nil)}"}

Similarly, we will see that the activity timed out (as expected):

{"level":"error","error":"context canceled","time":"2024-12-03T09:27:56+01:00","message":"ChildWorkflow.doActivity: context done: context canceled (*errors.errorString)"}

But, and here is the big BUT, looking at the debug output from pprof on http://localhost:6060/debug/pprof/goroutine?debug=1 we will see that the parent workflow still hangs on the Get call. Something like:


1 @ 0x47672e 0x40b8dc 0x40b492 0xbf87df 0xbf780f 0xbf7787 0xbfc3f5 0xedae65 0x4d26c6 0x4d17d9 0xc03cb9 0xbeea52 0xbf6ecc 0xbf9119 0x47e781
# 0xbf87de  go.uber.org/cadence/internal.(*coroutineState).initialYield+0x5e      /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:798
# 0xbf780e  go.uber.org/cadence/internal.(*coroutineState).yield+0x22e        /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:808
# 0xbf7786  go.uber.org/cadence/internal.(*channelImpl).Receive+0x1a6       /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:623
# 0xbfc3f4  go.uber.org/cadence/internal.(*decodeFutureImpl).Get+0x54       /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:1301
# 0xedae64  main.(*ParentWorkflow).Run+0x144              /home/peterg/dev/cadence-bug/worker/parent_workflow.go:51

That call never returns, which I think violates the intended semantics (in the documentation) and results in a resource leak.

Expected behavior
I would expect the Get call made by the (timed out) parent workflow to eventually return with TimeoutError.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions