Skip to content

Disconnect tasklist pollers on domain failover using callback #6903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

fimanishi
Copy link
Member

@fimanishi fimanishi commented May 9, 2025

What changed?
Created a callback that disconnects pollers for all tasklists after domain failover. It disconnects on the newly active side and in the newly passive side.

Why?
For an active-passive global domain, cadence only processes tasks on the domain's active cluster. In the passive/standby cluster, all task polls made by pollers are redirected by matching to be matched to query tasks. They will attempt to match to query tasks until the poll times out. When a failover happens, polls that were redirected in the previously standby cluster will have to wait for the timeouts (by default 60s) to occur before they are released to poll for decision and activity tasks. That may cause a worst case 60s delay in task processing after a failover. By having a callback and disconnecting pollers on domain failover, we are trying to minimize as much as possible the delay in task processing after a failover.

How did you test it?
Unit tests and local testing.

Potential risks
If the checks are not correctly made, it'd be possible that on domain update (not necessarily a failover) the pollers could be disconnected for no reason, delaying task processing.

Release notes

Documentation Changes

if domainActiveCluster != nil {
c.domainActiveCluster = *domainActiveCluster
}
c.matcher.DisconnectBlockedPollers()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to cancel the cancelCtx on the matcher, which will result in all future pollers getting immediately cancelled? I think we need to change the behavior in matcher to replace the context with a new one when we do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that we add that cancelation method to new incoming requests' contexts. So the function cancels all the existing contexts, but future contexts can still be create and associated to the cancel function. They will only be cancelled if the canceFunc is called again. I did tests locally of multiple failovers (because it also cancels the pollers on the previously active side) and they were still completed after failing back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test to cover this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context on the matcher is a normal context:

cancelCtx, cancelFunc := context.WithCancel(context.Background())

DisonnectBlockedPollers is just:

func (tm *taskMatcherImpl) DisconnectBlockedPollers() {
	tm.cancelFunc()
}

Unless I'm missing something, once it's been called once then cancelCtx is permanently cancelled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, @natemort is correct. The way it's implemented it takes into account if the context has been cancelled for the matcher. I've added a function to refresh the cancellation context

newNotificationVersion := e.notificationVersion

for _, domain := range nextDomains {
if !isDomainEligibleToDisconnectPollers(domain, e.notificationVersion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does domain notification version only change when active-> passive switch happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true. I guess it can be more efficient, I'll change that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notification version should change for every domain change I should think.

I guess my question is here: what's the use-case or thing you're guarding against here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be more efficient when getting domain updates, but if we are monotonically increasing the value I'm not sure if it adds any value. If we always get values that were higher than the stored one, it'll not make any difference. Is my understanding correct here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took me a second to get my head around the code structure, that makes sense. No concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be more efficient by checking the failover version of the domain. From my understanding, notification version is also updated when the domain metadata is updated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the failover version is independent for each domain, right? I'd have to keep track of each domain failover version independently in the manager, not in the engine. I guess I could use that to track failover instead of using the domain's active name. Does that make sense?

@@ -162,6 +163,7 @@ func NewEngine(
}

func (e *matchingEngineImpl) Start() {
e.registerDomainFailoverCallback()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching engine is not created on-demand so it doesn't matter probably but for consistency reasons let's unregister during Stop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matching engine can be completely removed and everything can be put inside handler directly, but the change is not unnecessary.

@@ -29,7 +29,7 @@ operations:
# failoverTimeoutSec: 5 # unset means force failover. setting it means graceful failover request

- op: validate
at: 120s # todo: this should work at 40s mark
at: 40s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand Nate's concerns, but otherwise lgtm

newNotificationVersion := e.notificationVersion

for _, domain := range nextDomains {
if !isDomainEligibleToDisconnectPollers(domain, e.notificationVersion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this could be more efficient by checking the failover version of the domain. From my understanding, notification version is also updated when the domain metadata is updated.


func (c *taskListManagerImpl) DisconnectBlockedPollers(domainActiveCluster *string) {
if domainActiveCluster != nil {
c.domainActiveCluster = *domainActiveCluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't thread-safe

c.domainActiveCluster = *domainActiveCluster
}
c.matcher.DisconnectBlockedPollers()
c.matcher.RefreshCancelContext()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we could do this as part of DisconnectBlockedPollers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what was the initial intention of the DisconnectBlockedPollers but I agree with it

@@ -488,6 +492,10 @@ func (tm *taskMatcherImpl) Rate() float64 {
return rate
}

func (tm *taskMatcherImpl) RefreshCancelContext() {
tm.cancelCtx, tm.cancelFunc = context.WithCancel(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't thread-safe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants