Disconnect tasklist pollers on domain failover using callback #6903

fimanishi · 2025-05-09T21:37:02Z

What changed?
Created a callback that disconnects pollers for all tasklists after domain failover. It disconnects on the newly active side and in the newly passive side.

Why?
For an active-passive global domain, cadence only processes tasks on the domain's active cluster. In the passive/standby cluster, all task polls made by pollers are redirected by matching to be matched to query tasks. They will attempt to match to query tasks until the poll times out. When a failover happens, polls that were redirected in the previously standby cluster will have to wait for the timeouts (by default 60s) to occur before they are released to poll for decision and activity tasks. That may cause a worst case 60s delay in task processing after a failover. By having a callback and disconnecting pollers on domain failover, we are trying to minimize as much as possible the delay in task processing after a failover.

How did you test it?
Unit tests and local testing.

Potential risks
If the checks are not correctly made, it'd be possible that on domain update (not necessarily a failover) the pollers could be disconnected for no reason, delaying task processing.

Release notes

Documentation Changes

# Conflicts: # common/cache/domainCache_test.go

natemort · 2025-05-12T18:20:28Z

service/matching/tasklist/task_list_manager.go

+	if domainActiveCluster != nil {
+		c.domainActiveCluster = *domainActiveCluster
+	}
+	c.matcher.DisconnectBlockedPollers()


Isn't this going to cancel the cancelCtx on the matcher, which will result in all future pollers getting immediately cancelled? I think we need to change the behavior in matcher to replace the context with a new one when we do this.

My understanding is that we add that cancelation method to new incoming requests' contexts. So the function cancels all the existing contexts, but future contexts can still be create and associated to the cancel function. They will only be cancelled if the canceFunc is called again. I did tests locally of multiple failovers (because it also cancels the pollers on the previously active side) and they were still completed after failing back

Can you add a unit test to cover this?

The context on the matcher is a normal context:

cancelCtx, cancelFunc := context.WithCancel(context.Background())

DisonnectBlockedPollers is just:

func (tm *taskMatcherImpl) DisconnectBlockedPollers() { tm.cancelFunc() }

Unless I'm missing something, once it's been called once then cancelCtx is permanently cancelled.

Yeah, @natemort is correct. The way it's implemented it takes into account if the context has been cancelled for the matcher. I've added a function to refresh the cancellation context

taylanisikdemir · 2025-05-12T18:33:45Z

service/matching/handler/engine.go

+	newNotificationVersion := e.notificationVersion
+
+	for _, domain := range nextDomains {
+		if !isDomainEligibleToDisconnectPollers(domain, e.notificationVersion) {


does domain notification version only change when active-> passive switch happens?

That is true. I guess it can be more efficient, I'll change that

notification version should change for every domain change I should think.

I guess my question is here: what's the use-case or thing you're guarding against here?

I was trying to be more efficient when getting domain updates, but if we are monotonically increasing the value I'm not sure if it adds any value. If we always get values that were higher than the stored one, it'll not make any difference. Is my understanding correct here?

took me a second to get my head around the code structure, that makes sense. No concerns.

I guess this could be more efficient by checking the failover version of the domain. From my understanding, notification version is also updated when the domain metadata is updated.

But the failover version is independent for each domain, right? I'd have to keep track of each domain failover version independently in the manager, not in the engine. I guess I could use that to track failover instead of using the domain's active name. Does that make sense?

Tested this and using failoverNotificationVersion will not only be more efficient but also reduce complexity/changes. Thanks @Shaddoll

taylanisikdemir · 2025-05-12T18:49:26Z

service/matching/handler/engine.go

@@ -162,6 +163,7 @@ func NewEngine(
 }

 func (e *matchingEngineImpl) Start() {
+	e.registerDomainFailoverCallback()


Matching engine is not created on-demand so it doesn't matter probably but for consistency reasons let's unregister during Stop.

The matching engine can be completely removed and everything can be put inside handler directly, but the change is not unnecessary.

taylanisikdemir · 2025-05-12T18:49:47Z

simulation/replication/testdata/replication_simulation_default.yaml

@@ -29,7 +29,7 @@ operations:
    # failoverTimeoutSec: 5 # unset means force failover. setting it means graceful failover request

  - op: validate
-    at: 120s # todo: this should work at 40s mark
+    at: 40s


davidporter-id-au

I didn't quite understand Nate's concerns, but otherwise lgtm

service/matching/handler/engine.go

Shaddoll · 2025-05-12T22:20:56Z

service/matching/handler/engine.go

+	newNotificationVersion := e.notificationVersion
+
+	for _, domain := range nextDomains {
+		if !isDomainEligibleToDisconnectPollers(domain, e.notificationVersion) {


I guess this could be more efficient by checking the failover version of the domain. From my understanding, notification version is also updated when the domain metadata is updated.

…before the poll arrived

natemort · 2025-05-13T15:56:01Z

service/matching/tasklist/task_list_manager.go

+
+func (c *taskListManagerImpl) DisconnectBlockedPollers(domainActiveCluster *string) {
+	if domainActiveCluster != nil {
+		c.domainActiveCluster = *domainActiveCluster


This isn't thread-safe

natemort · 2025-05-13T15:56:29Z

service/matching/tasklist/task_list_manager.go

+		c.domainActiveCluster = *domainActiveCluster
+	}
+	c.matcher.DisconnectBlockedPollers()
+	c.matcher.RefreshCancelContext()


Nit: I think we could do this as part of DisconnectBlockedPollers

I'm not sure what was the initial intention of the DisconnectBlockedPollers but I agree with it

Ended up not bundling them together to avoid context leak on shutdown.

natemort · 2025-05-13T15:57:03Z

service/matching/tasklist/matcher.go

@@ -488,6 +492,10 @@ func (tm *taskMatcherImpl) Rate() float64 {
 	return rate
 }

+func (tm *taskMatcherImpl) RefreshCancelContext() {
+	tm.cancelCtx, tm.cancelFunc = context.WithCancel(context.Background())


This isn't thread-safe

taylanisikdemir · 2025-05-15T03:19:28Z

service/matching/handler/engine.go

@@ -1425,6 +1428,82 @@ func (e *matchingEngineImpl) isShuttingDown() bool {
 	}
 }

+func (e *matchingEngineImpl) domainChangeCallback(nextDomains []*cache.DomainCacheEntry) {


nit side note: Domain cache calls these callbacks one by one and waits for them to return. This is the one and only domain change callback in a matching service instance so sync processing here is fine. It's a bigger problem on history side that all the callbacks have to be invoked sequentially before domain cache continues its work. This can potentially delay processing of new domain updates. Something to consider revisiting at some point if we decide to shorten domain cache refresh interval

fimanishi added 9 commits May 9, 2025 14:34

Cleanup and add comment to CatchUpFn

4adcd83

Lint

9b675bd

Generalize domain failover callback registration

2645cd3

# Conflicts: # common/cache/domainCache_test.go

Disconnect tasklist pollers on domain failover using callback

378b387

Pulling latest idl

6f24d28

Fix matching handler tests

237b41c

Fix engine integration tests

c2c6426

Fix DisconnectBlockedPollers logic in task_list_manager

803bb59

Improve test

8627f29

fimanishi marked this pull request as ready for review May 12, 2025 18:11

fimanishi requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners May 12, 2025 18:11

natemort reviewed May 12, 2025

View reviewed changes

taylanisikdemir reviewed May 12, 2025

View reviewed changes

fimanishi added 3 commits May 12, 2025 13:39

Improve efficiency of engine notification version updates

dcd86ce

Add unregister callback method on engine stop

c036145

Add unregister mock calls and fix tests

ec8d6b5

davidporter-id-au approved these changes May 12, 2025

View reviewed changes

fimanishi added 3 commits May 12, 2025 14:31

Fix membership_test

ef8cb30

Fix more tests in membership_test

8292668

Fix more more tests in membership_test

f3f2a53

Shaddoll reviewed May 12, 2025

View reviewed changes

fimanishi added 3 commits May 12, 2025 15:35

Check for error and log the error to avoid panics

cdd8f8d

Add RefreshCancelContext to matching to remove the cancellation made …

fb32497

…before the poll arrived

Add mock call to TestDisconnectBlockedPollers

69957f5

natemort reviewed May 13, 2025

View reviewed changes

fimanishi added 3 commits May 14, 2025 13:56

Improvement on thread safety and check domain failover logic

9f65af9

Fix bug in domainChangeCallback in engine

02c47eb

Fix task_list_manager_test

08d395c

taylanisikdemir reviewed May 15, 2025

View reviewed changes

taylanisikdemir approved these changes May 15, 2025

View reviewed changes

Shaddoll approved these changes May 15, 2025

View reviewed changes

fimanishi merged commit 1ed0401 into cadence-workflow:master May 16, 2025
23 checks passed

Disconnect tasklist pollers on domain failover using callback #6903

Disconnect tasklist pollers on domain failover using callback #6903

Uh oh!

Conversation

fimanishi commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidporter-id-au left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fimanishi commented May 9, 2025 •

edited

Loading