feat(simulation): add history shard movement simulation with chaos monkey#7959
feat(simulation): add history shard movement simulation with chaos monkey#7959arzonus wants to merge 2 commits intocadence-workflow:masterfrom
Conversation
| // Reuse the shared hashrings stored during Start() | ||
| resolver := NewSimpleResolverWithHashrings(params.Name, c.serviceRings, hostport) | ||
| params.MembershipResolver = resolver | ||
| c.allResolvers = append(c.allResolvers, resolver.(*simpleResolver)) |
There was a problem hiding this comment.
⚠️ Bug: StartHistoryHost appends to allResolvers without cleanup
Each call to StartHistoryHost appends a new *simpleResolver to c.allResolvers (line 756), but StopHistoryHost never removes it. After repeated stop/start cycles, allResolvers grows unboundedly. Each subsequent NotifySubscribers call iterates over all accumulated resolvers, including stale ones from stopped hosts that still hold subscriber channels.
This is both a memory leak and a correctness concern: stale resolvers receive notifications even after their host is stopped, which may cause unexpected behavior if the subscriber channels are still being read by a previously-stopped service's goroutines during drain.
Suggested fix:
Track resolvers per host index (e.g. `historyResolvers []*simpleResolver`) and replace/remove them on stop/start cycles, or clear the resolver's subscribers on stop.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
7ec3665 to
bffed9d
Compare
…nkey Introduce an end-to-end simulation that exercises history shard movement by randomly stopping and starting history hosts while workflows run. Key changes: - host/chaos.go: generic ChaosMonkey driven by a HostController interface; configurable stop/start chances, jitter, and MinHosts floor - host/history_chaos.go: HistoryChaosMonkey adapter wiring Cadence → HostController for history hosts - host/membership_hashring.go: AddHost/RemoveHost for live hashring mutation - host/membership_resolver.go: NotifySubscribers so surviving history hosts react to membership changes; NewSimpleResolverWithHashrings to share hashrings across dynamically started hosts - host/onebox.go: StopHistoryHost/StartHistoryHost — removes/adds host from shared hashring, notifies all resolvers, then stops/starts the service - simulation/history: TestHistorySimulation now runs HistoryChaosMonkey in the background while 20 workflows execute across 3 hosts / 16 shards - testdata/history_simulation_shard_movement.yaml: 10-min scenario (3 hosts, 16 shards, chaos every 5 s) - testdata/history_simulation_shard_movement_quick.yaml: 2-min scenario for fast iteration - service/history/simulation/event.go: HostStopped/HostStarted event names for structured simulation logs Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
bffed9d to
c9e40d2
Compare
Code Review
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
| c.frontendService.Stop() | ||
| for _, historyService := range c.historyServices { | ||
| historyService.Stop() | ||
| if historyService != nil { |
| } | ||
|
|
||
| // NewSimpleResolver returns a membership resolver interface | ||
| func NewSimpleResolver(serviceName string, hosts map[string][]membership.HostInfo, currentHost membership.HostInfo) membership.Resolver { |
| } | ||
|
|
||
| func (c *cadenceImpl) startHistory(hosts map[string][]membership.HostInfo, startWG *sync.WaitGroup) { | ||
| func (c *cadenceImpl) startHistory(rings map[string]*simpleHashring, startWG *sync.WaitGroup) { |
There was a problem hiding this comment.
Some logic are duplicated inside StartHistoryHost method, could this method reuse that method?
What changed?
Adds a history shard movement simulation that exercises host stop/start while
workflows run. Relates to #7953.
New components:
host/chaos.go: genericChaosMonkeydriven by aHostControllerinterface; configurable stop/start probability, jitter, and aMinHostsfloor to guarantee recoveryhost/history_chaos.go:HistoryChaosMonkeywrapsChaosMonkeywith ahistoryCadenceControlleradapter that maps theCadenceinterface toHostControllerhost/membership_hashring.go:AddHost/RemoveHostfor live ring mutationhost/membership_resolver.go:NotifySubscribersso surviving history nodes receive membership change events;NewSimpleResolverWithHashringsto share rings across dynamically started hostshost/onebox.go:StopHistoryHost/StartHistoryHost— removes/adds the host from the shared hashring, notifies all resolvers, then stops/starts the servicesimulation/history/history_simulation_test.go: runsHistoryChaosMonkeyin the background while 20 workflows execute across 3 hosts / 16 shards; waits for all workflows to complete before stopping chaossimulation/history/testdata/history_simulation_shard_movement.yaml: 10-min scenariosimulation/history/testdata/history_simulation_shard_movement_quick.yaml: 2-min scenario for fast iterationWhy?
There was no automated way to verify that history shard movement does not lose tasks or leave workflows stuck. This simulation closes that gap: workflows must complete successfully despite hosts being repeatedly stopped and restarted, forcing shard ownership to migrate mid-flight.
The
HostControllerabstraction also keepsChaosMonkeyreusable — a matching-service or frontend chaos monkey can be wired in with a one-liner adapter.How did you test it?
Local CI checks passed:
Both simulation scenarios ran end-to-end:
All 20 workflows completed. Unexecuted timer tasks in the summary are retention timers (type 4, 1 day out) and a few backoff timers (type 3) created near test shutdown — both are expected.
Potential risks
None — simulation-only change. No production code paths modified; the history task executor logging additions are guarded by the existing simulation event flag.
Release notes
N/A — internal change.
Documentation Changes
N/A