Skip to content

Conversation

@mbissa
Copy link
Contributor

@mbissa mbissa commented Jan 7, 2026

This PR changes the way rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to use async gauge framework instead of synchronoush capturing.
This also does cleanup to remove the NoopMetricsRecorder struct along with addressing some old nits.

RELEASE NOTES:

  • rls: Change rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to be recorded asynchronously going forward.

@mbissa mbissa added this to the 1.79 Release milestone Jan 7, 2026
@mbissa mbissa requested a review from arjan-bal January 7, 2026 02:27
@mbissa mbissa self-assigned this Jan 7, 2026
@mbissa mbissa added the Type: Feature New features or improvements in behavior label Jan 7, 2026
@codecov
Copy link

codecov bot commented Jan 7, 2026

Codecov Report

❌ Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.40%. Comparing base (c6d5e5e) to head (07a7921).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
balancer/rls/balancer.go 81.81% 1 Missing and 1 partial ⚠️
balancer/rls/cache.go 85.71% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8808      +/-   ##
==========================================
+ Coverage   83.26%   83.40%   +0.14%     
==========================================
  Files         418      418              
  Lines       33004    33014      +10     
==========================================
+ Hits        27480    27535      +55     
+ Misses       4109     4075      -34     
+ Partials     1415     1404      -11     
Files with missing lines Coverage Δ
internal/testutils/stats/test_metrics_recorder.go 84.05% <ø> (ø)
balancer/rls/balancer.go 86.24% <81.81%> (+0.47%) ⬆️
balancer/rls/cache.go 87.28% <85.71%> (-1.70%) ⬇️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@arjan-bal arjan-bal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can clarify the meaning of asynchronous in the release notes by adding something like: "The metric is recorded once per collection cycle, rather than every time its value changes."

lb.logger = internalgrpclog.NewPrefixLogger(logger, fmt.Sprintf("[rls-experimental-lb %p] ", lb))
lb.dataCache = newDataCache(maxCacheSize, lb.logger, cc.MetricsRecorder(), opts.Target.String())
lb.dataCache = newDataCache(maxCacheSize, lb.logger, opts.Target.String())
if metricsRecorder := cc.MetricsRecorder(); metricsRecorder != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the nil check here. cc.MetricsRecorder() must always return a non-nil metrics recorder. The ClientConn implementation provided by gRPC ensures this, and we enforce that LB policies embed a valid ClientConn when wrapping. A nil MetricsRecorder would indicate a bug in a custom LB policy that is explicitly returning nil.

logger *internalgrpclog.PrefixLogger

// metricHandler is the function to deregister the async metric reporter.
metricHandler func()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we include unregister in the name to emphasize that it's a cleanup function? metricHandler seems a little generic.

if b.ctrlCh != nil {
b.ctrlCh.close()
}
if b.metricHandler != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should omit the nil check. The constructor guarantees metricHandler is non-nil. If it is nil, the balancer is in an invalid state, and we should let it crash to surface the bug.

b.cacheMu.Lock()
defer b.cacheMu.Unlock()

if b.dataCache == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we should omit the nil check.

Comment on lines +250 to +255
func (r *testAsyncMetricsRecorder) RecordInt64AsyncGauge(h *estats.Int64AsyncGaugeHandle, v int64, _ ...string) {
if r.data == nil {
r.data = make(map[string]int64)
}
r.data[h.Descriptor().Name] = v
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this functionality be added to the existing NewTestMetricsRecorder so it can be reused by other tests in the future?

Comment on lines +251 to +253
if r.data == nil {
r.data = make(map[string]int64)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure the object is initialized with a non-nil map during creation (via a constructor or otherwise). This makes it easier to reason about the state and avoids potential nil pointer dereferences.

}
if got, _ := tmr.Metric(cacheSizeKey); got != 15 {
t.Fatalf("Unexpected data for metric %v, got: %v, want: %v", cacheSizeKey, got, 15)
verifyMetrics := func(wantEntries, wantSize int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This closure seems like a assertions helper which is discouraged by the styleguide: https://google.github.io/styleguide/go/best-practices#leave-testing-to-the-test-function

It does have useful error messages though. I would still recommend inlining the logic since it's just a few lines anyways.

@arjan-bal arjan-bal removed their assignment Jan 7, 2026
@arjan-bal arjan-bal added the Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability label Jan 7, 2026
Comment on lines +717 to +718
b.cacheMu.Lock()
defer b.cacheMu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should avoid reporting metrics while holding the mutex. If the metrics collector is slow or hangs (due to slow I/O, for instance), it could potentially block the RLS from functioning. We can fetch the required data while holding the lock, but we should release it before calling the metrics reporter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants