Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 19 additions & 2 deletions balancer/rls/control_channel.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This is the exact text that describes the expected behavior:

The policy will monitor the state of the control plane channel. When the state transitions to TRANSIENT_FAILURE, it will record that transition, and the next time it transitions to READY, the policy will iterate through the cache to reset the backoff timeouts in all cache entries. Specifically, this means that it will reset the backoff state and cancel the pending backoff timer. Note that when cancelling the backoff timer, just like when the backoff timer fires normally, a new picker is returned to the channel, to force it to re-process any wait-for-ready RPCs that may still be queued if we failed them while we were in backoff. However, we should optimize this case by returning only one new picker, regardless of how many backoff timers are cancelled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the above text, we don't even have to wait for the first time the control channel goes READY. This means, that we can simplify the code quite a bit and not even have a control channel connectivity state monitoring goroutine. All we need is the following:

  • Continue to subscribe to connectivity state changes as we do today when we create the RLS control channel:
    ctrlCh.unsubscribe = internal.SubscribeToConnectivityStateChanges.(func(cc *grpc.ClientConn, s grpcsync.Subscriber) func())(ctrlCh.cc, ctrlCh)
  • In the implementation of the grpcsync.Subscriber interface, we currently push the received connectivity state update on to an unbounded buffer here:
    cc.connectivityStateCh.Put(st)
  • The above buffer is read from the for loop in the monitoring goroutine here:
    for s, ok := <-cc.connectivityStateCh.Get(); s != connectivity.Ready; s, ok = <-cc.connectivityStateCh.Get() {
  • Instead, what we can do is:
func (cc *controlChannel) OnMessage(msg any) {
	st, ok := msg.(connectivity.State)
	if !ok {
		panic(fmt.Sprintf("Unexpected message type %T , wanted connectectivity.State type", msg))
	}

    - If new connectivity state is READY, and we have previously seen TRANSIENT_FAILURE:
      - set the boolean for tracking previously seen TRANSIENT_FAILURE to false
      - reset backoffs by invoking the `backToReadyFunc`
    - else if new connectivity state is TRANSIENT_FAILURE
      - set the boolean for tracking previously seen TRANSIENT_FAILURE to true
    - else
      - do nothing 
}

The above if-elseif-else can also be implemented as a switch and the linter might complain if that is not the case.

Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,11 @@ func (cc *controlChannel) monitorConnectivityState() {
cc.connectivityStateCh.Load()
cc.logger.Infof("Connectivity state is READY")

// Track whether we've seen TRANSIENT_FAILURE since the last READY state.
// We only want to reset backoff when recovering from an actual failure,
// not when transitioning through benign states like IDLE.
seenTransientFailure := false

for {
s, ok := <-cc.connectivityStateCh.Get()
if !ok {
Expand All @@ -197,9 +202,21 @@ func (cc *controlChannel) monitorConnectivityState() {
if s == connectivity.Shutdown {
return
}

// Track if we've entered TRANSIENT_FAILURE state
if s == connectivity.TransientFailure {
seenTransientFailure = true
}

// Only reset backoff if we're returning to READY after a failure
if s == connectivity.Ready {
cc.logger.Infof("Control channel back to READY")
cc.backToReadyFunc()
if seenTransientFailure {
cc.logger.Infof("Control channel back to READY after TRANSIENT_FAILURE")
cc.backToReadyFunc()
seenTransientFailure = false
} else {
cc.logger.Infof("Control channel back to READY (no prior failure)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment can be improved and made a little more explicit for ease of users.

}
}

cc.logger.Infof("Connectivity state is %s", s)
Expand Down
93 changes: 93 additions & 0 deletions balancer/rls/control_channel_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,13 +26,15 @@ import (
"fmt"
"os"
"regexp"
"sync"
"testing"
"time"

"github.com/google/go-cmp/cmp"
"google.golang.org/grpc"
"google.golang.org/grpc/balancer"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/connectivity"
"google.golang.org/grpc/credentials"
"google.golang.org/grpc/internal"
rlspb "google.golang.org/grpc/internal/proto/grpc_lookup_v1"
Expand Down Expand Up @@ -463,3 +465,94 @@ func (s) TestNewControlChannelUnsupportedCredsBundle(t *testing.T) {
t.Fatal("newControlChannel succeeded when expected to fail")
}
}

// TestControlChannelConnectivityStateTransitions verifies that the control
// channel only resets backoff when recovering from TRANSIENT_FAILURE, not
// when going through benign state changes like READY → IDLE → READY.
func (s) TestControlChannelConnectivityStateTransitions(t *testing.T) {
tests := []struct {
name string
states []connectivity.State
wantCallbackCount int
}{
{
name: "READY → TRANSIENT_FAILURE → READY triggers callback",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subtest names show up in logs with the test names and can be used to filter test, we prefer names that can be easily used to filter and put the extra text (if we want any) in description, and also we try to avoid using spaces in the names. So for example for the 1st test case,
name : ready_after_transient_failure
description : ready after transient failure triggers callback to reset the timer.

Then the way we will see the name in logs will be Test/ControlChannelConnectivityStateTransitions/ready_after_transient_failure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

states: []connectivity.State{
connectivity.TransientFailure,
connectivity.Ready,
},
wantCallbackCount: 1,
},
{
name: "READY → IDLE → READY does not trigger callback",
states: []connectivity.State{
connectivity.Idle,
connectivity.Ready,
},
wantCallbackCount: 0,
},
{
name: "Multiple failures trigger callback each time",
states: []connectivity.State{
connectivity.TransientFailure,
connectivity.Ready,
connectivity.TransientFailure,
connectivity.Ready,
},
wantCallbackCount: 2,
},
{
name: "IDLE between failures doesn't affect callback",
states: []connectivity.State{
connectivity.TransientFailure,
connectivity.Idle,
connectivity.Ready,
},
wantCallbackCount: 1,
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be some way to improve the test. Maybe we can use waitGroups , but I will defer to @easwars for his opinion on this.

// Start an RLS server
rlsServer, _ := rlstest.SetupFakeRLSServer(t, nil)

// Setup callback to count invocations
callbackCount := 0
var mu sync.Mutex
callback := func() {
mu.Lock()
callbackCount++
mu.Unlock()
}

// Create control channel
ctrlCh, err := newControlChannel(rlsServer.Address, "", defaultTestTimeout, balancer.BuildOptions{}, callback)
if err != nil {
t.Fatalf("Failed to create control channel: %v", err)
}
defer ctrlCh.close()

// Give the channel time to reach initial READY state
time.Sleep(100 * time.Millisecond)

// Inject the test state sequence
for _, state := range tt.states {
ctrlCh.OnMessage(state)
// Give time for the monitoring goroutine to process the state
time.Sleep(50 * time.Millisecond)
}

// Give extra time for any pending callbacks
time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think adding time.Sleep() to wait for state changes is a good idea, since it can make tests flake if sometimes the state transitions take longer, we should look for better guarantees to make sure the states have transitioned.
cc : @easwars


mu.Lock()
gotCallbackCount := callbackCount
mu.Unlock()

if gotCallbackCount != tt.wantCallbackCount {
t.Errorf("Got %d callback invocations, want %d", gotCallbackCount, tt.wantCallbackCount)
}
})
}
}
Loading