rls: update rls cache metrics to use async gauge framework #8808

mbissa · 2026-01-07T02:27:43Z

This PR changes the way rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to use async gauge framework instead of synchronoush capturing.
This also does cleanup to remove the NoopMetricsRecorder struct along with addressing some old nits.

RELEASE NOTES:

rls: Change rls cache metrics ("grpc.lb.rls.cache_entries" & "grpc.lb.rls.cache_size") to be recorded asynchronously going forward.

codecov · 2026-01-07T02:31:14Z

Codecov Report

❌ Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.40%. Comparing base (c6d5e5e) to head (07a7921).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
balancer/rls/balancer.go	81.81%	1 Missing and 1 partial ⚠️
balancer/rls/cache.go	85.71%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8808      +/-   ##
==========================================
+ Coverage   83.26%   83.40%   +0.14%     
==========================================
  Files         418      418              
  Lines       33004    33014      +10     
==========================================
+ Hits        27480    27535      +55     
+ Misses       4109     4075      -34     
+ Partials     1415     1404      -11

Files with missing lines	Coverage Δ
internal/testutils/stats/test_metrics_recorder.go	`84.05% <ø> (ø)`
balancer/rls/balancer.go	`86.24% <81.81%> (+0.47%)`	⬆️
balancer/rls/cache.go	`87.28% <85.71%> (-1.70%)`	⬇️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

arjan-bal

I think we can clarify the meaning of asynchronous in the release notes by adding something like: "The metric is recorded once per collection cycle, rather than every time its value changes."

arjan-bal · 2026-01-07T16:43:59Z

balancer/rls/balancer.go

 	lb.logger = internalgrpclog.NewPrefixLogger(logger, fmt.Sprintf("[rls-experimental-lb %p] ", lb))
-	lb.dataCache = newDataCache(maxCacheSize, lb.logger, cc.MetricsRecorder(), opts.Target.String())
+	lb.dataCache = newDataCache(maxCacheSize, lb.logger, opts.Target.String())
+	if metricsRecorder := cc.MetricsRecorder(); metricsRecorder != nil {


We can remove the nil check here. cc.MetricsRecorder() must always return a non-nil metrics recorder. The ClientConn implementation provided by gRPC ensures this, and we enforce that LB policies embed a valid ClientConn when wrapping. A nil MetricsRecorder would indicate a bug in a custom LB policy that is explicitly returning nil.

arjan-bal · 2026-01-07T16:46:28Z

balancer/rls/balancer.go

 	logger             *internalgrpclog.PrefixLogger

+	// metricHandler is the function to deregister the async metric reporter.
+	metricHandler func()


nit: Can we include unregister in the name to emphasize that it's a cleanup function? metricHandler seems a little generic.

arjan-bal · 2026-01-07T16:49:18Z

balancer/rls/balancer.go

 	if b.ctrlCh != nil {
 		b.ctrlCh.close()
 	}
+	if b.metricHandler != nil {


We should omit the nil check. The constructor guarantees metricHandler is non-nil. If it is nil, the balancer is in an invalid state, and we should let it crash to surface the bug.

arjan-bal · 2026-01-07T16:50:12Z

balancer/rls/balancer.go

+	b.cacheMu.Lock()
+	defer b.cacheMu.Unlock()
+
+	if b.dataCache == nil {


Same here, we should omit the nil check.

arjan-bal · 2026-01-07T17:01:22Z

balancer/rls/cache_test.go

+func (r *testAsyncMetricsRecorder) RecordInt64AsyncGauge(h *estats.Int64AsyncGaugeHandle, v int64, _ ...string) {
+	if r.data == nil {
+		r.data = make(map[string]int64)
+	}
+	r.data[h.Descriptor().Name] = v
+}


Can this functionality be added to the existing NewTestMetricsRecorder so it can be reused by other tests in the future?

arjan-bal · 2026-01-07T17:03:52Z

balancer/rls/cache_test.go

+	if r.data == nil {
+		r.data = make(map[string]int64)
+	}


We should ensure the object is initialized with a non-nil map during creation (via a constructor or otherwise). This makes it easier to reason about the state and avoids potential nil pointer dereferences.

arjan-bal · 2026-01-07T17:15:08Z

balancer/rls/cache_test.go

-	}
-	if got, _ := tmr.Metric(cacheSizeKey); got != 15 {
-		t.Fatalf("Unexpected data for metric %v, got: %v, want: %v", cacheSizeKey, got, 15)
+	verifyMetrics := func(wantEntries, wantSize int64) {


This closure seems like a assertions helper which is discouraged by the styleguide: https://google.github.io/styleguide/go/best-practices#leave-testing-to-the-test-function

It does have useful error messages though. I would still recommend inlining the logic since it's just a few lines anyways.

arjan-bal · 2026-01-07T21:59:21Z

balancer/rls/balancer.go

+	b.cacheMu.Lock()
+	defer b.cacheMu.Unlock()


In my opinion, we should avoid reporting metrics while holding the mutex. If the metrics collector is slow or hangs (due to slow I/O, for instance), it could potentially block the RLS from functioning. We can fetch the required data while holding the lock, but we should release it before calling the metrics reporter.

moving rls cache metrics to use async gauge framework

07a7921

mbissa added this to the 1.79 Release milestone Jan 7, 2026

mbissa requested a review from arjan-bal January 7, 2026 02:27

mbissa self-assigned this Jan 7, 2026

mbissa added the Type: Feature New features or improvements in behavior label Jan 7, 2026

mbissa assigned arjan-bal Jan 7, 2026

arjan-bal reviewed Jan 7, 2026

View reviewed changes

arjan-bal removed their assignment Jan 7, 2026

arjan-bal added the Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability label Jan 7, 2026

arjan-bal reviewed Jan 7, 2026

View reviewed changes

Pranjali-2501 modified the milestones: 1.79 Release, 1.80 Release Jan 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rls: update rls cache metrics to use async gauge framework #8808

rls: update rls cache metrics to use async gauge framework #8808

Uh oh!

mbissa commented Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 •

edited

Loading

Uh oh!

arjan-bal left a comment

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

arjan-bal Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rls: update rls cache metrics to use async gauge framework #8808

Are you sure you want to change the base?

rls: update rls cache metrics to use async gauge framework #8808

Uh oh!

Conversation

mbissa commented Jan 7, 2026

Uh oh!

codecov bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

arjan-bal left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jan 7, 2026 •

edited

Loading