Skip to content

feat(simulation): add history shard movement simulation with chaos monkey#7959

Open
arzonus wants to merge 2 commits intocadence-workflow:masterfrom
arzonus:add-shard-movement-in-history-simulation
Open

feat(simulation): add history shard movement simulation with chaos monkey#7959
arzonus wants to merge 2 commits intocadence-workflow:masterfrom
arzonus:add-shard-movement-in-history-simulation

Conversation

@arzonus
Copy link
Copy Markdown
Contributor

@arzonus arzonus commented Apr 15, 2026

What changed?

Adds a history shard movement simulation that exercises host stop/start while
workflows run. Relates to #7953.

New components:

  • host/chaos.go: generic ChaosMonkey driven by a HostController interface; configurable stop/start probability, jitter, and a MinHosts floor to guarantee recovery
  • host/history_chaos.go: HistoryChaosMonkey wraps ChaosMonkey with a historyCadenceController adapter that maps the Cadence interface to HostController
  • host/membership_hashring.go: AddHost/RemoveHost for live ring mutation
  • host/membership_resolver.go: NotifySubscribers so surviving history nodes receive membership change events; NewSimpleResolverWithHashrings to share rings across dynamically started hosts
  • host/onebox.go: StopHistoryHost/StartHistoryHost — removes/adds the host from the shared hashring, notifies all resolvers, then stops/starts the service
  • simulation/history/history_simulation_test.go: runs HistoryChaosMonkey in the background while 20 workflows execute across 3 hosts / 16 shards; waits for all workflows to complete before stopping chaos
  • simulation/history/testdata/history_simulation_shard_movement.yaml: 10-min scenario
  • simulation/history/testdata/history_simulation_shard_movement_quick.yaml: 2-min scenario for fast iteration

Why?

There was no automated way to verify that history shard movement does not lose tasks or leave workflows stuck. This simulation closes that gap: workflows must complete successfully despite hosts being repeatedly stopped and restarted, forcing shard ownership to migrate mid-flight.

The HostController abstraction also keeps ChaosMonkey reusable — a matching-service or frontend chaos monkey can be wired in with a one-liner adapter.

How did you test it?

Local CI checks passed:

make .idl-status  # submodule on master
make pr           # tidy → go-generate → fmt → lint — exit 0, no file drift
make build        # exit 0
make test         # all packages pass (cmd/server/cadence SQLite schema failure is pre-existing, unrelated)

Both simulation scenarios ran end-to-end:

# 2-min quick scenario
./simulation/history/run.sh --scenario shard_movement_quick --dockerfile-suffix .local
# → Passed

# 10-min full scenario (3 hosts, 16 shards, chaos every 5 s)
./simulation/history/run.sh --scenario shard_movement --dockerfile-suffix .local
# → Passed

All 20 workflows completed. Unexecuted timer tasks in the summary are retention timers (type 4, 1 day out) and a few backoff timers (type 3) created near test shutdown — both are expected.

Potential risks

None — simulation-only change. No production code paths modified; the history task executor logging additions are guarded by the existing simulation event flag.

Release notes

N/A — internal change.

Documentation Changes

N/A

Comment thread host/onebox.go
Comment thread host/onebox.go
// Reuse the shared hashrings stored during Start()
resolver := NewSimpleResolverWithHashrings(params.Name, c.serviceRings, hostport)
params.MembershipResolver = resolver
c.allResolvers = append(c.allResolvers, resolver.(*simpleResolver))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: StartHistoryHost appends to allResolvers without cleanup

Each call to StartHistoryHost appends a new *simpleResolver to c.allResolvers (line 756), but StopHistoryHost never removes it. After repeated stop/start cycles, allResolvers grows unboundedly. Each subsequent NotifySubscribers call iterates over all accumulated resolvers, including stale ones from stopped hosts that still hold subscriber channels.

This is both a memory leak and a correctness concern: stale resolvers receive notifications even after their host is stopped, which may cause unexpected behavior if the subscriber channels are still being read by a previously-stopped service's goroutines during drain.

Suggested fix:

Track resolvers per host index (e.g. `historyResolvers []*simpleResolver`) and replace/remove them on stop/start cycles, or clear the resolver's subscribers on stop.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

Comment thread host/onebox.go
@arzonus arzonus force-pushed the add-shard-movement-in-history-simulation branch from 7ec3665 to bffed9d Compare April 16, 2026 11:32
arzonus added 2 commits April 17, 2026 09:06
…nkey

Introduce an end-to-end simulation that exercises history shard movement
by randomly stopping and starting history hosts while workflows run.

Key changes:
- host/chaos.go: generic ChaosMonkey driven by a HostController interface;
  configurable stop/start chances, jitter, and MinHosts floor
- host/history_chaos.go: HistoryChaosMonkey adapter wiring Cadence →
  HostController for history hosts
- host/membership_hashring.go: AddHost/RemoveHost for live hashring mutation
- host/membership_resolver.go: NotifySubscribers so surviving history hosts
  react to membership changes; NewSimpleResolverWithHashrings to share
  hashrings across dynamically started hosts
- host/onebox.go: StopHistoryHost/StartHistoryHost — removes/adds host from
  shared hashring, notifies all resolvers, then stops/starts the service
- simulation/history: TestHistorySimulation now runs HistoryChaosMonkey in
  the background while 20 workflows execute across 3 hosts / 16 shards
- testdata/history_simulation_shard_movement.yaml: 10-min scenario (3 hosts,
  16 shards, chaos every 5 s)
- testdata/history_simulation_shard_movement_quick.yaml: 2-min scenario for
  fast iteration
- service/history/simulation/event.go: HostStopped/HostStarted event names
  for structured simulation logs

Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
@arzonus arzonus force-pushed the add-shard-movement-in-history-simulation branch from bffed9d to c9e40d2 Compare April 17, 2026 07:07
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 17, 2026

Code Review ⚠️ Changes requested 2 resolved / 3 findings

Integrates history shard movement simulations with chaos monkey, correcting the silent error returns in StartHistoryHost. The implementation currently leaks resolvers because StartHistoryHost appends to allResolvers without cleanup.

⚠️ Bug: StartHistoryHost appends to allResolvers without cleanup

📄 host/onebox.go:756 📄 host/onebox.go:717-719 📄 host/onebox.go:818-820

Each call to StartHistoryHost appends a new *simpleResolver to c.allResolvers (line 756), but StopHistoryHost never removes it. After repeated stop/start cycles, allResolvers grows unboundedly. Each subsequent NotifySubscribers call iterates over all accumulated resolvers, including stale ones from stopped hosts that still hold subscriber channels.

This is both a memory leak and a correctness concern: stale resolvers receive notifications even after their host is stopped, which may cause unexpected behavior if the subscriber channels are still being read by a previously-stopped service's goroutines during drain.

Suggested fix
Track resolvers per host index (e.g. `historyResolvers []*simpleResolver`) and replace/remove them on stop/start cycles, or clear the resolver's subscribers on stop.
✅ 2 resolved
Edge Case: StartHistoryHost silently returns on error instead of failing

📄 host/onebox.go:793-797 📄 host/chaos.go:197-202
StartHistoryHost logs errors and returns silently on failure (e.g., line 795 when history.NewService fails). Since this is called by the ChaosMonkey which then marks the host as running (c.running[idx] = true in chaos.go:202), a silent failure means the chaos monkey's bookkeeping will be wrong — it will think the host is running when it's not. This could violate the MinHosts invariant and lead to all hosts being stopped.

Consider returning an error from StartHistoryHost (or panicking, since this is test infrastructure), and having the ChaosMonkey handle the failure by not updating its running state.

Bug: StopHistoryHost/StartHistoryHost have no synchronization

📄 host/onebox.go:696-710 📄 host/onebox.go:728-742
Both StopHistoryHost and StartHistoryHost read and write shared fields (c.historyServices, c.allResolvers, c.historyHashring) on cadenceImpl without any mutex. The ChaosMonkey calls these methods from its goroutine, while Stop() reads c.historyServices from the test goroutine. If chaos is still running when the test tears down (or if two ticks overlap due to a slow Stop/Start), this is a data race.

While simpleHashring and simpleResolver individually have mutexes, the compound operations in these methods (e.g., check-nil-then-stop, append to allResolvers, set historyServices[index]) are unprotected.

Consider adding a sync.Mutex to cadenceImpl and locking it around the critical sections in StopHistoryHost, StartHistoryHost, and Stop.

🤖 Prompt for agents
Code Review: Integrates history shard movement simulations with chaos monkey, correcting the silent error returns in StartHistoryHost. The implementation currently leaks resolvers because StartHistoryHost appends to allResolvers without cleanup.

1. ⚠️ Bug: StartHistoryHost appends to allResolvers without cleanup
   Files: host/onebox.go:756, host/onebox.go:717-719, host/onebox.go:818-820

   Each call to `StartHistoryHost` appends a new `*simpleResolver` to `c.allResolvers` (line 756), but `StopHistoryHost` never removes it. After repeated stop/start cycles, `allResolvers` grows unboundedly. Each subsequent `NotifySubscribers` call iterates over all accumulated resolvers, including stale ones from stopped hosts that still hold subscriber channels.
   
   This is both a memory leak and a correctness concern: stale resolvers receive notifications even after their host is stopped, which may cause unexpected behavior if the subscriber channels are still being read by a previously-stopped service's goroutines during drain.

   Suggested fix:
   Track resolvers per host index (e.g. `historyResolvers []*simpleResolver`) and replace/remove them on stop/start cycles, or clear the resolver's subscribers on stop.

Rules ✅ All requirements met

Repository Rules

GitHub Issue Linking Requirement: The PR explicitly links to issue #7953 in the description, satisfying the issue linking requirement.
PR Description Quality Standards: The PR description clearly explains the technical rationale, provides specific technical details of the changes, and includes concrete command-line examples for verifying the changes.

1 rule not applicable. Show all rules by commenting gitar display:verbose.

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Comment thread host/onebox.go
c.frontendService.Stop()
for _, historyService := range c.historyServices {
historyService.Stop()
if historyService != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come it is empty?

}

// NewSimpleResolver returns a membership resolver interface
func NewSimpleResolver(serviceName string, hosts map[string][]membership.HostInfo, currentHost membership.HostInfo) membership.Resolver {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be deleted?

Comment thread host/onebox.go
}

func (c *cadenceImpl) startHistory(hosts map[string][]membership.HostInfo, startWG *sync.WaitGroup) {
func (c *cadenceImpl) startHistory(rings map[string]*simpleHashring, startWG *sync.WaitGroup) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some logic are duplicated inside StartHistoryHost method, could this method reuse that method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants