fix(test): eliminate data race in TestBatchWorkflowV2_TuneSignal#7892
Conversation
The test had a data race between the activity mock goroutine writing to capturedParams and the main test goroutine reading it for assertions. The first activity goroutine blocked on firstActivityDone which was closed by t.Cleanup — but cleanup runs after the test function returns, meaning the goroutine could still be running during assertions. Replace t.Cleanup with explicit synchronization: - Close firstActivityDone before assertions so the blocked goroutine can complete - Add a sync.WaitGroup to track activity mock goroutines and call Wait() before reading capturedParams - This ensures all concurrent writes are finished before any read Verified with: go test -race -count=5 ./service/worker/batcher/... -run TestBatchWorkflowV2_TuneSignal Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
| close(firstActivityDone) | ||
| activityWG.Wait() | ||
|
|
||
| mu.Lock() |
There was a problem hiding this comment.
If all activities have finished, we don't need this lock, right?
There was a problem hiding this comment.
Removed the final read-side lock. We still need the mutex while the two activity goroutines append into capturedParams, but once activityWG.Wait() returns the post-wait copy can read the slice directly.
Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
505be2a to
7376ab7
Compare
| activityWG.Add(1) | ||
| defer activityWG.Done() |
There was a problem hiding this comment.
💡 Bug: WaitGroup.Add(1) called inside goroutine instead of before it
At line 166, activityWG.Add(1) is called inside the activity callback goroutine rather than before it starts. Per the sync.WaitGroup documentation, "calls with a positive delta that occur when the counter is zero must happen before a Wait." If the Go scheduler delays the goroutine start, activityWG.Wait() (line 204) could observe a zero counter and return before Add(1) has executed, reading capturedParams while the goroutine is still writing to it.
In practice this is very unlikely because the Cadence test framework invokes the second activity synchronously during ExecuteWorkflow (which completes before Wait is called), and the first activity must have started to block on the channel. Still, it violates the documented contract and could theoretically cause the race detector to fire under extreme scheduling.
A cleaner approach: add both counts eagerly before the mock is invoked, or use activityWG.Add(1) inside a wrapper that is called synchronously by the framework before spawning the goroutine.
Suggested fix:
// Before registering the mock, add the expected count:
activityWG.Add(2) // we expect exactly 2 invocations
env.OnActivity(batchActivityV2Name, mock.Anything, mock.Anything).
Return(func(_ context.Context, p BatchParams) (HeartBeatDetails, error) {
defer activityWG.Done()
// ... rest unchanged
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review 👍 Approved with suggestions 0 resolved / 1 findingsEliminates a data race in TestBatchWorkflowV2_TuneSignal by removing redundant locking after WaitGroup operations. Consider moving WaitGroup.Add(1) outside the goroutine to prevent potential race conditions during concurrent execution. 💡 Bug: WaitGroup.Add(1) called inside goroutine instead of before it📄 service/worker/batcher/workflow_v2_test.go:166-167 📄 service/worker/batcher/workflow_v2_test.go:204 At line 166, In practice this is very unlikely because the Cadence test framework invokes the second activity synchronously during A cleaner approach: add both counts eagerly before the mock is invoked, or use Suggested fix🤖 Prompt for agentsRules ✅ All requirements metRepository Rules
2 rules not applicable. Show all rules by commenting OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
This is superseded by #7898 on |
What changed?
Fixed a data race in
TestBatchWorkflowV2_TuneSignalthat caused flaky test failures under the race detector.Relates to #7868, #7864, #7865.
Why?
The test used
t.Cleanup(func() { close(firstActivityDone) })to release a blocked goroutine, butt.Cleanupruns after the test function returns — at which point the test was already readingcapturedParamswithout synchronization. The race detector flagged concurrent read/write on thecapturedParamsslice because the first activity goroutine could still be appending to it while the test was asserting on it.The fix replaces
t.Cleanupwith explicit synchronization:close(firstActivityDone)is called after all assertions on the workflow result, so the first activity goroutine is released at a controlled point.sync.WaitGrouptracks running activity mock goroutines, andactivityWG.Wait()ensures all goroutines have finished before readingcapturedParams.capturedParams, including the final read afterWait().How did you test it?
go test -race -count=5 ./service/worker/batcher/... -run TestBatchWorkflowV2_TuneSignalAll 5 iterations passed cleanly with no race detector warnings.
Potential risks
None. This is a test-only change with no impact on production code.
Release notes
N/A — test-only fix.
Documentation Changes
N/A