Description
Describe the bug
If a workflow times out while calling Get
on a child workflow future (as illustrated below), the Get
call never returns and a goroutine is leaked.
func MyWorkflow(ctx workflow.Context) error {
// .. snip
future := workflow.ExecuteChildWorkflow(childCtx, childWorkflow)
future.Get(childCtx, nil) // <- hangs forever if MyWorkflow exceeds START_TO_CLOSE timeout.
// .. snip
}
Notably, all other involved "entities" behave the way (I think) they shuld:
- The caller of the parent workflow (
MyWorkflow
above) correctly sees aTimeoutType: START_TO_CLOSE
. - The child workflow (
childWorkflow
above) correctly sees aCanceledError
. - Any activities started by the child workflow correctly see a
context canceled
.
Steps to reproduce the behavior:
Find a sample application attached (sample.tar.gz) that can be used to reproduce and illustrate the issue. Assuming that a Cadence cluster is serving a cadence-frontend
on localhost:7833
the following steps can be used to reproduce the issue:
- Start a workflow worker:
go run ./worker/
- Execute the parent workflow with a
60s
start-to-close timeout:go run ./client/
After one minute the client will fail (as expected):
{"level":"info","time":"2024-12-03T09:26:42+01:00","message":"starting workflow ..."}
{"level":"info","time":"2024-12-03T09:26:42+01:00","message":"awaiting workflow 1733214402 ..."}
{"level":"error","error":"TimeoutType: START_TO_CLOSE","time":"2024-12-03T09:27:42+01:00","message":"parent workflow execution failed: TimeoutType: START_TO_CLOSE (*internal.TimeoutError)\n&internal.TimeoutError{timeoutType:0, details:internal.ErrorDetailsValues(nil)}"}
Looking at the worker output, we will see that the child workflow failed (as expected):
{"level":"error","error":"CanceledError","time":"2024-12-03T09:27:42+01:00","message":"ChildWorkflow.Run: doActivity failed: CanceledError (*internal.CanceledError)\n&internal.CanceledError{details:internal.ErrorDetailsValues(nil)}"}
Similarly, we will see that the activity timed out (as expected):
{"level":"error","error":"context canceled","time":"2024-12-03T09:27:56+01:00","message":"ChildWorkflow.doActivity: context done: context canceled (*errors.errorString)"}
But, and here is the big BUT, looking at the debug output from pprof on http://localhost:6060/debug/pprof/goroutine?debug=1
we will see that the parent workflow still hangs on the Get
call. Something like:
1 @ 0x47672e 0x40b8dc 0x40b492 0xbf87df 0xbf780f 0xbf7787 0xbfc3f5 0xedae65 0x4d26c6 0x4d17d9 0xc03cb9 0xbeea52 0xbf6ecc 0xbf9119 0x47e781
# 0xbf87de go.uber.org/cadence/internal.(*coroutineState).initialYield+0x5e /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:798
# 0xbf780e go.uber.org/cadence/internal.(*coroutineState).yield+0x22e /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:808
# 0xbf7786 go.uber.org/cadence/internal.(*channelImpl).Receive+0x1a6 /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:623
# 0xbfc3f4 go.uber.org/cadence/internal.(*decodeFutureImpl).Get+0x54 /home/peterg/go/pkg/mod/go.uber.org/[email protected]/internal/internal_workflow.go:1301
# 0xedae64 main.(*ParentWorkflow).Run+0x144 /home/peterg/dev/cadence-bug/worker/parent_workflow.go:51
That call never returns, which I think violates the intended semantics (in the documentation) and results in a resource leak.
Expected behavior
I would expect the Get
call made by the (timed out) parent workflow to eventually return with TimeoutError
.