util: add grpc channelz collector for prometheus by zyguan · Pull Request #1938 · tikv/client-go

zyguan · 2026-04-02T09:44:31Z

The collector walks client-side channelz objects (Channel, Subchannel, Socket) via the gRPC channelz API and exports raw channel/socket metrics with configurable namespace, optional filtering, and optional local/remote socket labels. Channel state and trace metrics are opt-in via IncludeChannelState and IncludeChannelTrace, while collector self-metrics are limited to fetch error counters.

Summary by CodeRabbit

New Features
- Added a Prometheus collector that exports detailed gRPC Channelz metrics: channel/subchannel/socket activity, call counts and timestamps, optional connectivity-state and trace metrics, keepalive and flow-control data, address normalization, and configurable filtering/labeling.
Tests
- Added end-to-end tests validating metric emission, label sets, filtering/traversal behavior, timestamp handling, and fetch-error reporting.

Signed-off-by: zyguan <zhongyangguan@gmail.com>

coderabbitai · 2026-04-02T09:44:45Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c33009a1-5495-4a00-a5ec-7605ecb52b55

📥 Commits

Reviewing files that changed from the base of the PR and between f5a9f44 and 4931328.

📒 Files selected for processing (1)

util/collectors/channelz_test.go

✅ Files skipped from review due to trivial changes (1)

util/collectors/channelz_test.go

📝 Walkthrough

Walkthrough

Adds a Prometheus collector that traverses gRPC Channelz (TopChannels → Channels/Subchannels → Sockets) via a ChannelzClient, emitting metrics for calls, streams, sockets, optional channel state and trace events, with configurable filtering and per-scrape de-duplication.

Changes

Cohort / File(s)	Summary
Channelz Collector `util/collectors/channelz.go`	New Prometheus collector `NewChannelzCollector` that walks Channelz hierarchy, emits metrics for channels/subchannels/sockets, supports `ChannelzCollectorOpts` (Filter, emit state, emit trace, socket label toggles), de-duplication maps, and per-API fetch error counters.
Collector Tests `util/collectors/channelz_test.go`	End-to-end tests using an in-memory bufconn fake Channelz server; exercises pagination, per-method error injection, filter behavior, optional state/trace metrics, socket label toggles, timestamp handling, and asserts emitted metric families and labels.

Sequence Diagram(s)

sequenceDiagram
  participant Prometheus as Prometheus Scrape
  participant Collector as ChannelzCollector
  participant Channelz as Channelz gRPC Server
  participant Registry as Prometheus Registry

  Prometheus->>Collector: HTTP scrape -> Collect()
  Collector->>Channelz: GetTopChannels(page_token)
  Channelz-->>Collector: TopChannels(list, next_token)
  loop per channel/subchannel/socket
    Collector->>Channelz: GetChannel / GetSubchannel / GetSocket(id)
    Channelz-->>Collector: Channel/Subchannel/Socket data
    Collector->>Collector: apply Filter? decide emit & descend
    Collector->>Registry: expose metrics (labels, gauges, counters)
  end
  Collector-->>Prometheus: HTTP response (metrics)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hopped through channels, soft and fleet,
Counting calls with tiny feet,
Subchannels hummed, sockets chimed,
Metrics gathered, neatly timed,
A rabbit's scrape — tidy and sweet. 🐇✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a gRPC channelz collector for Prometheus monitoring, which aligns with the primary purpose of the new files introduced in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

util/collectors/channelz.go (3)

498-500: Remove unnecessary identity function.

The cDesc function simply returns its input unchanged and adds no value. Consider removing it and using the descriptors directly.

♻️ Suggested removal

-func cDesc(desc *prometheus.Desc) *prometheus.Desc {
-	return desc
-}

Then replace all cDesc(w.collector.channelCallsDesc) calls with just w.collector.channelCallsDesc.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@util/collectors/channelz.go` around lines 498 - 500, Remove the trivial
identity function cDesc which simply returns its input; delete the cDesc
function and replace all calls like cDesc(w.collector.channelCallsDesc) with the
descriptor directly (e.g., w.collector.channelCallsDesc) across the file (and
any other callers), updating imports/usages if necessary to compile. Ensure no
other logic depends on cDesc and run tests/build to confirm no references
remain.

233-240: Consider adding a context timeout for gRPC calls.

Using context.Background() without a timeout means the collector could hang indefinitely if the channelz server is unresponsive. While Prometheus enforces a scrape timeout, the underlying goroutine may leak. Consider accepting a context from Collect or using a fixed timeout.

This pattern repeats in walkChannel (lines 279, 287, 295) and walkSubchannel (lines 326, 334, 342).

♻️ Suggested approach

 func (c *channelzCollector) Collect(ch chan<- prometheus.Metric) {
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
 	w := channelzWalker{
 		collector:       c,
 		ch:              ch,
+		ctx:             ctx,
 		now:             time.Now(),
 		seenChannels:    make(map[int64]struct{}),
 		seenSubchannels: make(map[int64]struct{}),
 		seenSockets:     make(map[int64]struct{}),
 	}
 	w.walkTopChannels()
 	c.collectFetchErrorMetrics(ch)
 }

Then use w.ctx in all gRPC calls within the walker methods.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@util/collectors/channelz.go` around lines 233 - 240, The gRPC calls currently
use context.Background() which can hang; replace those calls to use a
cancellable context from the collector/walker instead (e.g., use w.ctx or a
context.WithTimeout(w.ctx, <reasonableDuration>)) so each call (GetTopChannels
in the initial loop and all gRPC calls inside walkChannel and walkSubchannel)
honors timeouts/cancellation. Ensure the walker has a ctx field initialized from
Collect (or accept a context parameter in Collect and pass it to the walker) and
use that ctx for all client calls (GetTopChannels, and the calls inside
walkChannel and walkSubchannel) so goroutines won’t leak if the server is
unresponsive.

435-446: Inconsistent timestamp validation between stream-created and message timestamps.

Lines 435-440 use hasUsableTimestamp() which checks for non-zero values, but lines 441-446 only check ts != nil. If a socket has never sent/received a message, the timestamp could be zero, leading to a metric with value 0 (Unix epoch 1970-01-01).

Consider using hasUsableTimestamp() consistently for all timestamp metrics, or document why the behavior differs.

♻️ Suggested fix for consistency

-	if ts := data.GetLastMessageSentTimestamp(); ts != nil {
+	if ts := data.GetLastMessageSentTimestamp(); hasUsableTimestamp(ts) {
 		w.ch <- prometheus.MustNewConstMetric(cDesc(w.collector.socketLastMessageDesc), prometheus.GaugeValue, timestampSeconds(ts.AsTime()), appendLabels(labels, "sent")...)
 	}
-	if ts := data.GetLastMessageReceivedTimestamp(); ts != nil {
+	if ts := data.GetLastMessageReceivedTimestamp(); hasUsableTimestamp(ts) {
 		w.ch <- prometheus.MustNewConstMetric(cDesc(w.collector.socketLastMessageDesc), prometheus.GaugeValue, timestampSeconds(ts.AsTime()), appendLabels(labels, "received")...)
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@util/collectors/channelz.go` around lines 435 - 446, The message timestamp
checks are inconsistent: replace the nil-only checks in the block using
data.GetLastMessageSentTimestamp() and data.GetLastMessageReceivedTimestamp()
with the same validation used for stream-created timestamps by calling
hasUsableTimestamp(ts) (or an equivalent non-zero check) before emitting metrics
to avoid reporting epoch-zero values; update the two if conditions that
currently use "ts != nil" to use hasUsableTimestamp(ts) and keep the rest of the
metric creation using cDesc(w.collector.socketLastMessageDesc) and
appendLabels(labels, "sent"/"received") unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@util/collectors/channelz_test.go`:
- Around line 486-493: The test currently calls grpc.DialContext to connect to
the bufconn listener (see grpc.DialContext usage around "bufnet" and the
WithContextDialer that calls listener.Dial); replace that call with
grpc.NewClient and use the passthrough:/// scheme (e.g.,
"passthrough:///bufnet") so the bufconn dialer is honored, keeping the existing
WithContextDialer(listener.Dial) and
grpc.WithTransportCredentials(insecure.NewCredentials()) options intact so the
test behavior remains the same.

---

Nitpick comments:
In `@util/collectors/channelz.go`:
- Around line 498-500: Remove the trivial identity function cDesc which simply
returns its input; delete the cDesc function and replace all calls like
cDesc(w.collector.channelCallsDesc) with the descriptor directly (e.g.,
w.collector.channelCallsDesc) across the file (and any other callers), updating
imports/usages if necessary to compile. Ensure no other logic depends on cDesc
and run tests/build to confirm no references remain.
- Around line 233-240: The gRPC calls currently use context.Background() which
can hang; replace those calls to use a cancellable context from the
collector/walker instead (e.g., use w.ctx or a context.WithTimeout(w.ctx,
<reasonableDuration>)) so each call (GetTopChannels in the initial loop and all
gRPC calls inside walkChannel and walkSubchannel) honors timeouts/cancellation.
Ensure the walker has a ctx field initialized from Collect (or accept a context
parameter in Collect and pass it to the walker) and use that ctx for all client
calls (GetTopChannels, and the calls inside walkChannel and walkSubchannel) so
goroutines won’t leak if the server is unresponsive.
- Around line 435-446: The message timestamp checks are inconsistent: replace
the nil-only checks in the block using data.GetLastMessageSentTimestamp() and
data.GetLastMessageReceivedTimestamp() with the same validation used for
stream-created timestamps by calling hasUsableTimestamp(ts) (or an equivalent
non-zero check) before emitting metrics to avoid reporting epoch-zero values;
update the two if conditions that currently use "ts != nil" to use
hasUsableTimestamp(ts) and keep the rest of the metric creation using
cDesc(w.collector.socketLastMessageDesc) and appendLabels(labels,
"sent"/"received") unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8640cd42-0329-4a49-817f-f851aaba6068

📥 Commits

Reviewing files that changed from the base of the PR and between 1b3c5b5 and f5a9f44.

📒 Files selected for processing (2)

util/collectors/channelz.go
util/collectors/channelz_test.go

Signed-off-by: zyguan <zhongyangguan@gmail.com>

zyguan · 2026-04-02T10:14:38Z

/retest

zyguan · 2026-04-02T14:29:22Z

@cfzjywxk PTAL

cfzjywxk

LGTM, is the next step adapt it to the client-go repo?

ti-chi-bot · 2026-04-03T02:58:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, lcwangchao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cfzjywxk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-03T02:58:47Z

[LGTM Timeline notifier]

Timeline:

2026-04-03 02:58:01.716665989 +0000 UTC m=+493086.922026046: ☑️ agreed by lcwangchao.
2026-04-03 02:58:45.995219488 +0000 UTC m=+493131.200579535: ☑️ agreed by cfzjywxk.

util: add grpc channelz collector for prometheus

f5a9f44

Signed-off-by: zyguan <zhongyangguan@gmail.com>

ti-chi-bot Bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Apr 2, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 2, 2026

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread util/collectors/channelz_test.go Outdated

fix the lint issue

4931328

Signed-off-by: zyguan <zhongyangguan@gmail.com>

cfzjywxk requested review from cfzjywxk, ekexium and lcwangchao April 3, 2026 02:05

lcwangchao approved these changes Apr 3, 2026

View reviewed changes

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 3, 2026

cfzjywxk approved these changes Apr 3, 2026

View reviewed changes

ti-chi-bot Bot added the lgtm label Apr 3, 2026

ti-chi-bot Bot added approved and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 3, 2026

ti-chi-bot Bot merged commit a888f42 into tikv:master Apr 3, 2026
15 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util: add grpc channelz collector for prometheus#1938

util: add grpc channelz collector for prometheus#1938
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
zyguan:dev/channelz-collector

zyguan commented Apr 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

zyguan commented Apr 2, 2026

Uh oh!

zyguan commented Apr 2, 2026

Uh oh!

cfzjywxk left a comment

Uh oh!

ti-chi-bot Bot commented Apr 3, 2026

Uh oh!

ti-chi-bot Bot commented Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zyguan commented Apr 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zyguan commented Apr 2, 2026

Uh oh!

zyguan commented Apr 2, 2026

Uh oh!

cfzjywxk left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 3, 2026

Uh oh!

ti-chi-bot Bot commented Apr 3, 2026

[LGTM Timeline notifier]

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zyguan commented Apr 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading