-
Notifications
You must be signed in to change notification settings - Fork 4.6k
rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8720 +/- ##
==========================================
- Coverage 83.42% 83.21% -0.21%
==========================================
Files 418 414 -4
Lines 32897 32754 -143
==========================================
- Hits 27443 27255 -188
- Misses 4069 4087 +18
- Partials 1385 1412 +27
🚀 New features to boost your workflow:
|
Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY. Fixes grpc#8693
e2406a9 to
a6fcb7e
Compare
- Add testOnlyInitialReadyDone channel for proper test synchronization - Signal when monitoring goroutine processes initial READY state - Tests wait for this signal instead of using time.Sleep - All synchronization now uses channels/callbacks - no arbitrary sleeps - Tests pass consistently with race detector Addresses review feedback about removing time.Sleep for state transitions.
135b43d to
ed5ab2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes control channel connectivity monitoring in the RLS balancer to only reset backoff timers when genuinely recovering from a TRANSIENT_FAILURE state, not during benign state changes like READY → IDLE → READY.
- Adds explicit tracking of TRANSIENT_FAILURE state with a boolean flag
- Updates callback invocation logic to only trigger after recovery from TRANSIENT_FAILURE
- Adds comprehensive test coverage for various state transition scenarios
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| balancer/rls/control_channel.go | Implements TRANSIENT_FAILURE tracking with a boolean flag, adds nil check for callback, and includes test synchronization channel |
| balancer/rls/control_channel_test.go | Adds comprehensive test cases covering different connectivity state transition scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
balancer/rls/control_channel_test.go
Outdated
| } | ||
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be some way to improve the test. Maybe we can use waitGroups , but I will defer to @easwars for his opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be able to test this in an e2e style, we would need to do make it possible for the test to see the connectivity state changes on the control channel, but without adding hooks into the test code. I propose the following:
- Add a new variable that can be overridden by the test. This variable will point to a function that returns the subscriber to the passed to the
SubscribeToConnectivityStateChangesAPI
var newConnectivityStateSubscriber = connStateSubscriber- Set the above variable to the following piece of code in
control_channel.go
func connStateSubscriber(sub grpcsync.Subscriber) grpcsync.Subscriber {
return sub
}- When calling the
SubscribeToConnectivityStateChangesAPI, use the above function as follows:
ctrlCh.unsubscribe = internal.SubscribeToConnectivityStateChanges.(func(cc *grpc.ClientConn, s grpcsync.Subscriber) func())(ctrlCh.cc, newConnectivityStateSubscriber(ctrlCh))- In the test, create an implementation of the
grpcsync.Subscriberinterface that contains a delegate. This makes it possible for the test to delegate to the subscriber set by the non-test code, but make the connectivity state changes available to the test:
type wrappingConnectivityStateSubscriber struct {
delegate grpcsync.Subscriber
connStateCh chan connectivity.State
}
func (w *wrappingConnectivityStateSubscriber) OnMessage(msg any) {
w.delegate.OnMessage(msg)
w.connStateCh <- msg.(connectivity.State)
}- Modify
TestControlChannelConnectivityStateMonitoringor add a new test where we do the override. You would probably want to use a `testutils.
// Override the connectivity state subscriber.
wrappedSubscriber := &wrappingConnectivityStateSubscriber{connStateCh: make(chan connectivity.State, 1)}
origConnectivityStateSubscriber := newConnectivityStateSubscriber
newConnectivityStateSubscriber = func(delegate grpcsync.Subscriber) grpcsync.Subscriber {
wrappedSubscriber.delegate = delegate
return wrappedSubscriber
}
defer func() { newConnectivityStateSubscriber = origConnectivityStateSubscriber }()- In the test body, we need to check the appropriate state transitions in a bunch of places:
// Make sure an RLS request is sent out.
verifyRLSRequest(t, rlsReqCh, true)
// Verify that the control channel moves to READY.
wantStates := []connectivity.State{
connectivity.Connecting,
connectivity.Ready,
}
for _, wantState := range wantStates {
select {
case gotState := <-wrappedSubscriber.connStateCh:
if gotState != wantState {
t.Fatalf("Unexpected connectivity state: got %v, want %v", gotState, wantState)
}
case <-ctx.Done():
t.Fatalf("Timeout waiting for RLS control channel to become %q", wantState)
}
}
// Stop the RLS server.
lis.Stop()
// Verify that the control channel moves to IDLE.
wantStates = []connectivity.State{
connectivity.Idle,
}
for _, wantState := range wantStates {
select {
case gotState := <-wrappedSubscriber.connStateCh:
if gotState != wantState {
t.Fatalf("Unexpected connectivity state: got %v, want %v", gotState, wantState)
}
case <-ctx.Done():
t.Fatalf("Timeout waiting for RLS control channel to become %q", wantState)
}
}
// Make another RPC similar to the first one. Since the above cache entry
// would have expired by now, this should trigger another RLS request. And
// since the RLS server is down, RLS request will fail and the cache entry
// will enter backoff, and we have overridden the default backoff strategy to
// return a value which will keep this entry in backoff for the whole duration
// of the test.
makeTestRPCAndVerifyError(ctx, t, cc, codes.Unavailable, nil)
// Verify that the control channel moves to TRANSIENT_FAILURE.
wantStates = []connectivity.State{
connectivity.Connecting,
connectivity.TransientFailure,
}
for _, wantState := range wantStates {
select {
case gotState := <-wrappedSubscriber.connStateCh:
if gotState != wantState {
t.Fatalf("Unexpected connectivity state: got %v, want %v", gotState, wantState)
}
case <-ctx.Done():
t.Fatalf("Timeout waiting for RLS control channel to become %q", wantState)
}
}
// Restart the RLS server.
lis.Restart()The above will test the READY --> TF --> READY transition.
For the READY --> IDLE --> READY, we need to restart the RLS server once the control channel goes IDLE, and then wait for it to go READY before attempting another RPC and verifying that backoffs are not reset.
Let me know what you think about this approach.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this option not work out? I still see two new tests being added instead of modifying the existing one TestControlChannelConnectivityStateMonitoring which is more e2e style than the ones currently being added in this PR. Please let me know if you are running into any issues when trying to modify the existing test as per my suggestion. Thanks.
|
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
balancer/rls/control_channel_test.go
Outdated
| } | ||
|
|
||
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this option not work out? I still see two new tests being added instead of modifying the existing one TestControlChannelConnectivityStateMonitoring which is more e2e style than the ones currently being added in this PR. Please let me know if you are running into any issues when trying to modify the existing test as per my suggestion. Thanks.
…el-state-monitoring
|
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
|
This PR is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed. |
balancer/rls/balancer_test.go
Outdated
| t.Fatal("Timeout waiting for RLS control channel to become TRANSIENT_FAILURE") | ||
| } | ||
| } | ||
| transientFailureReached: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please use a labeled for statement and an associated break with that label, instead of goto. Thanks.
balancer/rls/balancer_test.go
Outdated
| t.Fatal("Timeout waiting for RLS control channel to become READY after TRANSIENT_FAILURE") | ||
| } | ||
| } | ||
| readyAfterTransientFailure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here as above. Let's not use goto please.
balancer/rls/balancer_test.go
Outdated
| select { | ||
| case <-resetBackoffDone: | ||
| t.Fatal("Backoff reset was triggered for initial READY state, want no reset") | ||
| default: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace the default case with a short deadline. Something like case <-time.After(10* time.Millisecond)?
Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY.
RELEASE NOTES: N/A
Fixes #8693