feat(simulation): add history shard movement simulation with chaos monkey by arzonus · Pull Request #7959 · cadence-workflow/cadence

arzonus · 2026-04-15T14:28:56Z

What changed?

Adds a history shard movement simulation that exercises host stop/start while
workflows run. Relates to #7953.

New components:

host/chaos.go: generic ChaosMonkey driven by a HostController interface; configurable stop/start probability, jitter, and a MinHosts floor to guarantee recovery
host/history_chaos.go: HistoryChaosMonkey wraps ChaosMonkey with a historyCadenceController adapter that maps the Cadence interface to HostController
host/membership_hashring.go: AddHost/RemoveHost for live ring mutation
host/membership_resolver.go: NotifySubscribers so surviving history nodes receive membership change events; NewSimpleResolverWithHashrings to share rings across dynamically started hosts
host/onebox.go: StopHistoryHost/StartHistoryHost — removes/adds the host from the shared hashring, notifies all resolvers, then stops/starts the service
simulation/history/history_simulation_test.go: runs HistoryChaosMonkey in the background while 20 workflows execute across 3 hosts / 16 shards; waits for all workflows to complete before stopping chaos
simulation/history/testdata/history_simulation_shard_movement.yaml: 10-min scenario
simulation/history/testdata/history_simulation_shard_movement_quick.yaml: 2-min scenario for fast iteration

Why?

There was no automated way to verify that history shard movement does not lose tasks or leave workflows stuck. This simulation closes that gap: workflows must complete successfully despite hosts being repeatedly stopped and restarted, forcing shard ownership to migrate mid-flight.

The HostController abstraction also keeps ChaosMonkey reusable — a matching-service or frontend chaos monkey can be wired in with a one-liner adapter.

How did you test it?

Local CI checks passed:

make .idl-status  # submodule on master
make pr           # tidy → go-generate → fmt → lint — exit 0, no file drift
make build        # exit 0
make test         # all packages pass (cmd/server/cadence SQLite schema failure is pre-existing, unrelated)

Both simulation scenarios ran end-to-end:

# 2-min quick scenario
./simulation/history/run.sh --scenario shard_movement_quick --dockerfile-suffix .local
# → Passed

# 10-min full scenario (3 hosts, 16 shards, chaos every 5 s)
./simulation/history/run.sh --scenario shard_movement --dockerfile-suffix .local
# → Passed

All 20 workflows completed. Unexecuted timer tasks in the summary are retention timers (type 4, 1 day out) and a few backoff timers (type 3) created near test shutdown — both are expected.

Potential risks

None — simulation-only change. No production code paths modified; the history task executor logging additions are guarded by the existing simulation event flag.

Release notes

N/A — internal change.

Documentation Changes

N/A

gitar-bot · 2026-04-15T14:30:47Z

+	// Reuse the shared hashrings stored during Start()
+	resolver := NewSimpleResolverWithHashrings(params.Name, c.serviceRings, hostport)
+	params.MembershipResolver = resolver
+	c.allResolvers = append(c.allResolvers, resolver.(*simpleResolver))


⚠️ Bug: StartHistoryHost appends to allResolvers without cleanup

Each call to StartHistoryHost appends a new *simpleResolver to c.allResolvers (line 756), but StopHistoryHost never removes it. After repeated stop/start cycles, allResolvers grows unboundedly. Each subsequent NotifySubscribers call iterates over all accumulated resolvers, including stale ones from stopped hosts that still hold subscriber channels.

This is both a memory leak and a correctness concern: stale resolvers receive notifications even after their host is stopped, which may cause unexpected behavior if the subscriber channels are still being read by a previously-stopped service's goroutines during drain.

Suggested fix:

Track resolvers per host index (e.g. `historyResolvers []*simpleResolver`) and replace/remove them on stop/start cycles, or clear the resolver's subscribers on stop.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

…nkey Introduce an end-to-end simulation that exercises history shard movement by randomly stopping and starting history hosts while workflows run. Key changes: - host/chaos.go: generic ChaosMonkey driven by a HostController interface; configurable stop/start chances, jitter, and MinHosts floor - host/history_chaos.go: HistoryChaosMonkey adapter wiring Cadence → HostController for history hosts - host/membership_hashring.go: AddHost/RemoveHost for live hashring mutation - host/membership_resolver.go: NotifySubscribers so surviving history hosts react to membership changes; NewSimpleResolverWithHashrings to share hashrings across dynamically started hosts - host/onebox.go: StopHistoryHost/StartHistoryHost — removes/adds host from shared hashring, notifies all resolvers, then stops/starts the service - simulation/history: TestHistorySimulation now runs HistoryChaosMonkey in the background while 20 workflows execute across 3 hosts / 16 shards - testdata/history_simulation_shard_movement.yaml: 10-min scenario (3 hosts, 16 shards, chaos every 5 s) - testdata/history_simulation_shard_movement_quick.yaml: 2-min scenario for fast iteration - service/history/simulation/event.go: HostStopped/HostStarted event names for structured simulation logs Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>

Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>

gitar-bot · 2026-04-17T07:07:59Z

Code Review ⚠️ Changes requested 2 resolved / 3 findings

Integrates history shard movement simulations with chaos monkey, correcting the silent error returns in StartHistoryHost. The implementation currently leaks resolvers because StartHistoryHost appends to allResolvers without cleanup.

⚠️

Bug: StartHistoryHost appends to allResolvers without cleanup

📄 host/onebox.go:756 📄 host/onebox.go:717-719 📄 host/onebox.go:818-820

Each call to StartHistoryHost appends a new *simpleResolver to c.allResolvers (line 756), but StopHistoryHost never removes it. After repeated stop/start cycles, allResolvers grows unboundedly. Each subsequent NotifySubscribers call iterates over all accumulated resolvers, including stale ones from stopped hosts that still hold subscriber channels.

This is both a memory leak and a correctness concern: stale resolvers receive notifications even after their host is stopped, which may cause unexpected behavior if the subscriber channels are still being read by a previously-stopped service's goroutines during drain.

Suggested fix

Track resolvers per host index (e.g. `historyResolvers []*simpleResolver`) and replace/remove them on stop/start cycles, or clear the resolver's subscribers on stop.

✅ 2 resolved

✅ Edge Case: StartHistoryHost silently returns on error instead of failing

📄 host/onebox.go:793-797 📄 host/chaos.go:197-202
StartHistoryHost logs errors and returns silently on failure (e.g., line 795 when history.NewService fails). Since this is called by the ChaosMonkey which then marks the host as running (c.running[idx] = true in chaos.go:202), a silent failure means the chaos monkey's bookkeeping will be wrong — it will think the host is running when it's not. This could violate the MinHosts invariant and lead to all hosts being stopped.

Consider returning an error from StartHistoryHost (or panicking, since this is test infrastructure), and having the ChaosMonkey handle the failure by not updating its running state.

✅ Bug: StopHistoryHost/StartHistoryHost have no synchronization

📄 host/onebox.go:696-710 📄 host/onebox.go:728-742
Both StopHistoryHost and StartHistoryHost read and write shared fields (c.historyServices, c.allResolvers, c.historyHashring) on cadenceImpl without any mutex. The ChaosMonkey calls these methods from its goroutine, while Stop() reads c.historyServices from the test goroutine. If chaos is still running when the test tears down (or if two ticks overlap due to a slow Stop/Start), this is a data race.

While simpleHashring and simpleResolver individually have mutexes, the compound operations in these methods (e.g., check-nil-then-stop, append to allResolvers, set historyServices[index]) are unprotected.

Consider adding a sync.Mutex to cadenceImpl and locking it around the critical sections in StopHistoryHost, StartHistoryHost, and Stop.

🤖 Prompt for agents

Code Review: Integrates history shard movement simulations with chaos monkey, correcting the silent error returns in StartHistoryHost. The implementation currently leaks resolvers because StartHistoryHost appends to allResolvers without cleanup.

1. ⚠️ Bug: StartHistoryHost appends to allResolvers without cleanup
   Files: host/onebox.go:756, host/onebox.go:717-719, host/onebox.go:818-820

   Each call to `StartHistoryHost` appends a new `*simpleResolver` to `c.allResolvers` (line 756), but `StopHistoryHost` never removes it. After repeated stop/start cycles, `allResolvers` grows unboundedly. Each subsequent `NotifySubscribers` call iterates over all accumulated resolvers, including stale ones from stopped hosts that still hold subscriber channels.
   
   This is both a memory leak and a correctness concern: stale resolvers receive notifications even after their host is stopped, which may cause unexpected behavior if the subscriber channels are still being read by a previously-stopped service's goroutines during drain.

   Suggested fix:
   Track resolvers per host index (e.g. `historyResolvers []*simpleResolver`) and replace/remove them on stop/start cycles, or clear the resolver's subscribers on stop.

Rules ✅ All requirements met

Repository Rules

✅ GitHub Issue Linking Requirement: The PR explicitly links to issue #7953 in the description, satisfying the issue linking requirement.

✅ PR Description Quality Standards: The PR description clearly explains the technical rationale, provides specific technical details of the changes, and includes concrete command-line examples for verifying the changes.

_{1 rule not applicable. Show all rules by commenting gitar display:verbose.}

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Shaddoll · 2026-04-17T18:15:31Z

 	c.frontendService.Stop()
 	for _, historyService := range c.historyServices {
-		historyService.Stop()
+		if historyService != nil {


How come it is empty?

Shaddoll · 2026-04-17T18:26:13Z

 }

 // NewSimpleResolver returns a membership resolver interface
 func NewSimpleResolver(serviceName string, hosts map[string][]membership.HostInfo, currentHost membership.HostInfo) membership.Resolver {


Could this be deleted?

Shaddoll · 2026-04-17T18:28:21Z

 }

-func (c *cadenceImpl) startHistory(hosts map[string][]membership.HostInfo, startWG *sync.WaitGroup) {
+func (c *cadenceImpl) startHistory(rings map[string]*simpleHashring, startWG *sync.WaitGroup) {


Some logic are duplicated inside StartHistoryHost method, could this method reuse that method?

arzonus requested review from Assem-Uber, Shaddoll, abhishekj720, adhityamamallan, bowenxia, c-warren, demirkayaender, dkrotx, eleonoradgr, fimanishi, gazi-yestemirova, jakobht, macrotim, natemort, neil-xie, shijiesheng and timl3136 as code owners April 15, 2026 14:28

gitar-bot Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread host/onebox.go

gitar-bot Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread host/onebox.go

arzonus force-pushed the add-shard-movement-in-history-simulation branch from 7ec3665 to bffed9d Compare April 16, 2026 11:32

arzonus added 2 commits April 17, 2026 09:06

style(host): gofmt fix comment alignment in chaos_test.go

c9e40d2

Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>

arzonus force-pushed the add-shard-movement-in-history-simulation branch from bffed9d to c9e40d2 Compare April 17, 2026 07:07

Shaddoll reviewed Apr 17, 2026

View reviewed changes

Shaddoll approved these changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(simulation): add history shard movement simulation with chaos monkey#7959

feat(simulation): add history shard movement simulation with chaos monkey#7959
arzonus wants to merge 2 commits intocadence-workflow:masterfrom
arzonus:add-shard-movement-in-history-simulation

arzonus commented Apr 15, 2026

Uh oh!

Uh oh!

gitar-bot Bot Apr 15, 2026

Uh oh!

Uh oh!

gitar-bot Bot commented Apr 17, 2026 •

edited

Loading

Repository Rules

Uh oh!

Shaddoll Apr 17, 2026

Uh oh!

Shaddoll Apr 17, 2026

Uh oh!

Shaddoll Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arzonus commented Apr 15, 2026

What changed?

Why?

How did you test it?

Potential risks

Release notes

Documentation Changes

Uh oh!

Uh oh!

gitar-bot Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gitar-bot Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repository Rules

Uh oh!

Shaddoll Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Shaddoll Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Shaddoll Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gitar-bot Bot commented Apr 17, 2026 •

edited

Loading