Skip to content

fix(test): eliminate data race in TestBatchWorkflowV2_TuneSignal#7892

Open
officialasishkumar wants to merge 2 commits intocadence-workflow:masterfrom
officialasishkumar:fix/flaky-batch-workflow-v2-tune-signal
Open

fix(test): eliminate data race in TestBatchWorkflowV2_TuneSignal#7892
officialasishkumar wants to merge 2 commits intocadence-workflow:masterfrom
officialasishkumar:fix/flaky-batch-workflow-v2-tune-signal

Conversation

@officialasishkumar
Copy link
Copy Markdown
Contributor

What changed?

Fixed a data race in TestBatchWorkflowV2_TuneSignal that caused flaky test failures under the race detector.

Relates to #7868, #7864, #7865.

Why?

The test used t.Cleanup(func() { close(firstActivityDone) }) to release a blocked goroutine, but t.Cleanup runs after the test function returns — at which point the test was already reading capturedParams without synchronization. The race detector flagged concurrent read/write on the capturedParams slice because the first activity goroutine could still be appending to it while the test was asserting on it.

The fix replaces t.Cleanup with explicit synchronization:

  • close(firstActivityDone) is called after all assertions on the workflow result, so the first activity goroutine is released at a controlled point.
  • A sync.WaitGroup tracks running activity mock goroutines, and activityWG.Wait() ensures all goroutines have finished before reading capturedParams.
  • A mutex protects all access to capturedParams, including the final read after Wait().

How did you test it?

go test -race -count=5 ./service/worker/batcher/... -run TestBatchWorkflowV2_TuneSignal

All 5 iterations passed cleanly with no race detector warnings.

Potential risks

None. This is a test-only change with no impact on production code.

Release notes

N/A — test-only fix.

Documentation Changes

N/A

The test had a data race between the activity mock goroutine writing to
capturedParams and the main test goroutine reading it for assertions.
The first activity goroutine blocked on firstActivityDone which was
closed by t.Cleanup — but cleanup runs after the test function returns,
meaning the goroutine could still be running during assertions.

Replace t.Cleanup with explicit synchronization:
- Close firstActivityDone before assertions so the blocked goroutine
  can complete
- Add a sync.WaitGroup to track activity mock goroutines and call
  Wait() before reading capturedParams
- This ensures all concurrent writes are finished before any read

Verified with: go test -race -count=5 ./service/worker/batcher/... -run TestBatchWorkflowV2_TuneSignal

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
close(firstActivityDone)
activityWG.Wait()

mu.Lock()
Copy link
Copy Markdown
Member

@Shaddoll Shaddoll Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all activities have finished, we don't need this lock, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the final read-side lock. We still need the mutex while the two activity goroutines append into capturedParams, but once activityWG.Wait() returns the post-wait copy can read the slice directly.

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
@officialasishkumar officialasishkumar force-pushed the fix/flaky-batch-workflow-v2-tune-signal branch from 505be2a to 7376ab7 Compare April 6, 2026 17:35
Comment on lines +166 to +167
activityWG.Add(1)
defer activityWG.Done()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Bug: WaitGroup.Add(1) called inside goroutine instead of before it

At line 166, activityWG.Add(1) is called inside the activity callback goroutine rather than before it starts. Per the sync.WaitGroup documentation, "calls with a positive delta that occur when the counter is zero must happen before a Wait." If the Go scheduler delays the goroutine start, activityWG.Wait() (line 204) could observe a zero counter and return before Add(1) has executed, reading capturedParams while the goroutine is still writing to it.

In practice this is very unlikely because the Cadence test framework invokes the second activity synchronously during ExecuteWorkflow (which completes before Wait is called), and the first activity must have started to block on the channel. Still, it violates the documented contract and could theoretically cause the race detector to fire under extreme scheduling.

A cleaner approach: add both counts eagerly before the mock is invoked, or use activityWG.Add(1) inside a wrapper that is called synchronously by the framework before spawning the goroutine.

Suggested fix:

// Before registering the mock, add the expected count:
activityWG.Add(2) // we expect exactly 2 invocations

env.OnActivity(batchActivityV2Name, mock.Anything, mock.Anything).
    Return(func(_ context.Context, p BatchParams) (HeartBeatDetails, error) {
        defer activityWG.Done()
        // ... rest unchanged

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 6, 2026

Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Eliminates a data race in TestBatchWorkflowV2_TuneSignal by removing redundant locking after WaitGroup operations. Consider moving WaitGroup.Add(1) outside the goroutine to prevent potential race conditions during concurrent execution.

💡 Bug: WaitGroup.Add(1) called inside goroutine instead of before it

📄 service/worker/batcher/workflow_v2_test.go:166-167 📄 service/worker/batcher/workflow_v2_test.go:204

At line 166, activityWG.Add(1) is called inside the activity callback goroutine rather than before it starts. Per the sync.WaitGroup documentation, "calls with a positive delta that occur when the counter is zero must happen before a Wait." If the Go scheduler delays the goroutine start, activityWG.Wait() (line 204) could observe a zero counter and return before Add(1) has executed, reading capturedParams while the goroutine is still writing to it.

In practice this is very unlikely because the Cadence test framework invokes the second activity synchronously during ExecuteWorkflow (which completes before Wait is called), and the first activity must have started to block on the channel. Still, it violates the documented contract and could theoretically cause the race detector to fire under extreme scheduling.

A cleaner approach: add both counts eagerly before the mock is invoked, or use activityWG.Add(1) inside a wrapper that is called synchronously by the framework before spawning the goroutine.

Suggested fix
// Before registering the mock, add the expected count:
activityWG.Add(2) // we expect exactly 2 invocations

env.OnActivity(batchActivityV2Name, mock.Anything, mock.Anything).
    Return(func(_ context.Context, p BatchParams) (HeartBeatDetails, error) {
        defer activityWG.Done()
        // ... rest unchanged
🤖 Prompt for agents
Code Review: Eliminates a data race in TestBatchWorkflowV2_TuneSignal by removing redundant locking after WaitGroup operations. Consider moving WaitGroup.Add(1) outside the goroutine to prevent potential race conditions during concurrent execution.

1. 💡 Bug: WaitGroup.Add(1) called inside goroutine instead of before it
   Files: service/worker/batcher/workflow_v2_test.go:166-167, service/worker/batcher/workflow_v2_test.go:204

   At line 166, `activityWG.Add(1)` is called inside the activity callback goroutine rather than before it starts. Per the `sync.WaitGroup` documentation, "calls with a positive delta that occur when the counter is zero must happen before a Wait." If the Go scheduler delays the goroutine start, `activityWG.Wait()` (line 204) could observe a zero counter and return before `Add(1)` has executed, reading `capturedParams` while the goroutine is still writing to it.
   
   In practice this is very unlikely because the Cadence test framework invokes the second activity synchronously during `ExecuteWorkflow` (which completes before `Wait` is called), and the first activity must have started to block on the channel. Still, it violates the documented contract and could theoretically cause the race detector to fire under extreme scheduling.
   
   A cleaner approach: add both counts eagerly before the mock is invoked, or use `activityWG.Add(1)` inside a wrapper that is called synchronously by the framework before spawning the goroutine.

   Suggested fix:
   // Before registering the mock, add the expected count:
   activityWG.Add(2) // we expect exactly 2 invocations
   
   env.OnActivity(batchActivityV2Name, mock.Anything, mock.Anything).
       Return(func(_ context.Context, p BatchParams) (HeartBeatDetails, error) {
           defer activityWG.Done()
           // ... rest unchanged

Rules ✅ All requirements met

Repository Rules

PR Description Quality Standards: The PR description includes all required sections with substantive, high-quality content following the template guidance.

2 rules not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@officialasishkumar
Copy link
Copy Markdown
Contributor Author

This is superseded by #7898 on master (8e5b4b31), which rewrote the same test around the activity-start signal path. I did not push a new branch change here because rebasing this WaitGroup-based approach now just conflicts with the upstream fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants