-
Notifications
You must be signed in to change notification settings - Fork 831
Disconnect tasklist pollers on domain failover using callback #6903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Disconnect tasklist pollers on domain failover using callback #6903
Conversation
# Conflicts: # common/cache/domainCache_test.go
if domainActiveCluster != nil { | ||
c.domainActiveCluster = *domainActiveCluster | ||
} | ||
c.matcher.DisconnectBlockedPollers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this going to cancel the cancelCtx on the matcher, which will result in all future pollers getting immediately cancelled? I think we need to change the behavior in matcher to replace the context with a new one when we do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we add that cancelation method to new incoming requests' contexts. So the function cancels all the existing contexts, but future contexts can still be create and associated to the cancel function. They will only be cancelled if the canceFunc is called again. I did tests locally of multiple failovers (because it also cancels the pollers on the previously active side) and they were still completed after failing back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a unit test to cover this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context on the matcher is a normal context:
cancelCtx, cancelFunc := context.WithCancel(context.Background())
DisonnectBlockedPollers is just:
func (tm *taskMatcherImpl) DisconnectBlockedPollers() {
tm.cancelFunc()
}
Unless I'm missing something, once it's been called once then cancelCtx is permanently cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, @natemort is correct. The way it's implemented it takes into account if the context has been cancelled for the matcher. I've added a function to refresh the cancellation context
newNotificationVersion := e.notificationVersion | ||
|
||
for _, domain := range nextDomains { | ||
if !isDomainEligibleToDisconnectPollers(domain, e.notificationVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does domain notification version only change when active-> passive switch happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true. I guess it can be more efficient, I'll change that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notification version should change for every domain change I should think.
I guess my question is here: what's the use-case or thing you're guarding against here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to be more efficient when getting domain updates, but if we are monotonically increasing the value I'm not sure if it adds any value. If we always get values that were higher than the stored one, it'll not make any difference. Is my understanding correct here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took me a second to get my head around the code structure, that makes sense. No concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could be more efficient by checking the failover version of the domain. From my understanding, notification version is also updated when the domain metadata is updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the failover version is independent for each domain, right? I'd have to keep track of each domain failover version independently in the manager, not in the engine. I guess I could use that to track failover instead of using the domain's active name. Does that make sense?
@@ -162,6 +163,7 @@ func NewEngine( | |||
} | |||
|
|||
func (e *matchingEngineImpl) Start() { | |||
e.registerDomainFailoverCallback() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching engine is not created on-demand so it doesn't matter probably but for consistency reasons let's unregister during Stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The matching engine can be completely removed and everything can be put inside handler directly, but the change is not unnecessary.
@@ -29,7 +29,7 @@ operations: | |||
# failoverTimeoutSec: 5 # unset means force failover. setting it means graceful failover request | |||
|
|||
- op: validate | |||
at: 120s # todo: this should work at 40s mark | |||
at: 40s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite understand Nate's concerns, but otherwise lgtm
newNotificationVersion := e.notificationVersion | ||
|
||
for _, domain := range nextDomains { | ||
if !isDomainEligibleToDisconnectPollers(domain, e.notificationVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this could be more efficient by checking the failover version of the domain. From my understanding, notification version is also updated when the domain metadata is updated.
|
||
func (c *taskListManagerImpl) DisconnectBlockedPollers(domainActiveCluster *string) { | ||
if domainActiveCluster != nil { | ||
c.domainActiveCluster = *domainActiveCluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't thread-safe
c.domainActiveCluster = *domainActiveCluster | ||
} | ||
c.matcher.DisconnectBlockedPollers() | ||
c.matcher.RefreshCancelContext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we could do this as part of DisconnectBlockedPollers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what was the initial intention of the DisconnectBlockedPollers but I agree with it
@@ -488,6 +492,10 @@ func (tm *taskMatcherImpl) Rate() float64 { | |||
return rate | |||
} | |||
|
|||
func (tm *taskMatcherImpl) RefreshCancelContext() { | |||
tm.cancelCtx, tm.cancelFunc = context.WithCancel(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't thread-safe
What changed?
Created a callback that disconnects pollers for all tasklists after domain failover. It disconnects on the newly active side and in the newly passive side.
Why?
For an active-passive global domain, cadence only processes tasks on the domain's active cluster. In the passive/standby cluster, all task polls made by pollers are redirected by matching to be matched to query tasks. They will attempt to match to query tasks until the poll times out. When a failover happens, polls that were redirected in the previously standby cluster will have to wait for the timeouts (by default 60s) to occur before they are released to poll for decision and activity tasks. That may cause a worst case 60s delay in task processing after a failover. By having a callback and disconnecting pollers on domain failover, we are trying to minimize as much as possible the delay in task processing after a failover.
How did you test it?
Unit tests and local testing.
Potential risks
If the checks are not correctly made, it'd be possible that on domain update (not necessarily a failover) the pollers could be disconnected for no reason, delaying task processing.
Release notes
Documentation Changes