Skip to content

util: add grpc channelz collector for prometheus#1938

Merged
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
zyguan:dev/channelz-collector
Apr 3, 2026
Merged

util: add grpc channelz collector for prometheus#1938
ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
zyguan:dev/channelz-collector

Conversation

@zyguan
Copy link
Copy Markdown
Contributor

@zyguan zyguan commented Apr 2, 2026

The collector walks client-side channelz objects (Channel, Subchannel, Socket) via the gRPC channelz API and exports raw channel/socket metrics with configurable namespace, optional filtering, and optional local/remote socket labels. Channel state and trace metrics are opt-in via IncludeChannelState and IncludeChannelTrace, while collector self-metrics are limited to fetch error counters.

2026-04-02_174243

Summary by CodeRabbit

  • New Features

    • Added a Prometheus collector that exports detailed gRPC Channelz metrics: channel/subchannel/socket activity, call counts and timestamps, optional connectivity-state and trace metrics, keepalive and flow-control data, address normalization, and configurable filtering/labeling.
  • Tests

    • Added end-to-end tests validating metric emission, label sets, filtering/traversal behavior, timestamp handling, and fetch-error reporting.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot Bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c33009a1-5495-4a00-a5ec-7605ecb52b55

📥 Commits

Reviewing files that changed from the base of the PR and between f5a9f44 and 4931328.

📒 Files selected for processing (1)
  • util/collectors/channelz_test.go
✅ Files skipped from review due to trivial changes (1)
  • util/collectors/channelz_test.go

📝 Walkthrough

Walkthrough

Adds a Prometheus collector that traverses gRPC Channelz (TopChannels → Channels/Subchannels → Sockets) via a ChannelzClient, emitting metrics for calls, streams, sockets, optional channel state and trace events, with configurable filtering and per-scrape de-duplication.

Changes

Cohort / File(s) Summary
Channelz Collector
util/collectors/channelz.go
New Prometheus collector NewChannelzCollector that walks Channelz hierarchy, emits metrics for channels/subchannels/sockets, supports ChannelzCollectorOpts (Filter, emit state, emit trace, socket label toggles), de-duplication maps, and per-API fetch error counters.
Collector Tests
util/collectors/channelz_test.go
End-to-end tests using an in-memory bufconn fake Channelz server; exercises pagination, per-method error injection, filter behavior, optional state/trace metrics, socket label toggles, timestamp handling, and asserts emitted metric families and labels.

Sequence Diagram(s)

sequenceDiagram
  participant Prometheus as Prometheus Scrape
  participant Collector as ChannelzCollector
  participant Channelz as Channelz gRPC Server
  participant Registry as Prometheus Registry

  Prometheus->>Collector: HTTP scrape -> Collect()
  Collector->>Channelz: GetTopChannels(page_token)
  Channelz-->>Collector: TopChannels(list, next_token)
  loop per channel/subchannel/socket
    Collector->>Channelz: GetChannel / GetSubchannel / GetSocket(id)
    Channelz-->>Collector: Channel/Subchannel/Socket data
    Collector->>Collector: apply Filter? decide emit & descend
    Collector->>Registry: expose metrics (labels, gauges, counters)
  end
  Collector-->>Prometheus: HTTP response (metrics)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I hopped through channels, soft and fleet,
Counting calls with tiny feet,
Subchannels hummed, sockets chimed,
Metrics gathered, neatly timed,
A rabbit's scrape — tidy and sweet. 🐇✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding a gRPC channelz collector for Prometheus monitoring, which aligns with the primary purpose of the new files introduced in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 2, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
util/collectors/channelz.go (3)

498-500: Remove unnecessary identity function.

The cDesc function simply returns its input unchanged and adds no value. Consider removing it and using the descriptors directly.

♻️ Suggested removal
-func cDesc(desc *prometheus.Desc) *prometheus.Desc {
-	return desc
-}

Then replace all cDesc(w.collector.channelCallsDesc) calls with just w.collector.channelCallsDesc.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@util/collectors/channelz.go` around lines 498 - 500, Remove the trivial
identity function cDesc which simply returns its input; delete the cDesc
function and replace all calls like cDesc(w.collector.channelCallsDesc) with the
descriptor directly (e.g., w.collector.channelCallsDesc) across the file (and
any other callers), updating imports/usages if necessary to compile. Ensure no
other logic depends on cDesc and run tests/build to confirm no references
remain.

233-240: Consider adding a context timeout for gRPC calls.

Using context.Background() without a timeout means the collector could hang indefinitely if the channelz server is unresponsive. While Prometheus enforces a scrape timeout, the underlying goroutine may leak. Consider accepting a context from Collect or using a fixed timeout.

This pattern repeats in walkChannel (lines 279, 287, 295) and walkSubchannel (lines 326, 334, 342).

♻️ Suggested approach
 func (c *channelzCollector) Collect(ch chan<- prometheus.Metric) {
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
 	w := channelzWalker{
 		collector:       c,
 		ch:              ch,
+		ctx:             ctx,
 		now:             time.Now(),
 		seenChannels:    make(map[int64]struct{}),
 		seenSubchannels: make(map[int64]struct{}),
 		seenSockets:     make(map[int64]struct{}),
 	}
 	w.walkTopChannels()
 	c.collectFetchErrorMetrics(ch)
 }

Then use w.ctx in all gRPC calls within the walker methods.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@util/collectors/channelz.go` around lines 233 - 240, The gRPC calls currently
use context.Background() which can hang; replace those calls to use a
cancellable context from the collector/walker instead (e.g., use w.ctx or a
context.WithTimeout(w.ctx, <reasonableDuration>)) so each call (GetTopChannels
in the initial loop and all gRPC calls inside walkChannel and walkSubchannel)
honors timeouts/cancellation. Ensure the walker has a ctx field initialized from
Collect (or accept a context parameter in Collect and pass it to the walker) and
use that ctx for all client calls (GetTopChannels, and the calls inside
walkChannel and walkSubchannel) so goroutines won’t leak if the server is
unresponsive.

435-446: Inconsistent timestamp validation between stream-created and message timestamps.

Lines 435-440 use hasUsableTimestamp() which checks for non-zero values, but lines 441-446 only check ts != nil. If a socket has never sent/received a message, the timestamp could be zero, leading to a metric with value 0 (Unix epoch 1970-01-01).

Consider using hasUsableTimestamp() consistently for all timestamp metrics, or document why the behavior differs.

♻️ Suggested fix for consistency
-	if ts := data.GetLastMessageSentTimestamp(); ts != nil {
+	if ts := data.GetLastMessageSentTimestamp(); hasUsableTimestamp(ts) {
 		w.ch <- prometheus.MustNewConstMetric(cDesc(w.collector.socketLastMessageDesc), prometheus.GaugeValue, timestampSeconds(ts.AsTime()), appendLabels(labels, "sent")...)
 	}
-	if ts := data.GetLastMessageReceivedTimestamp(); ts != nil {
+	if ts := data.GetLastMessageReceivedTimestamp(); hasUsableTimestamp(ts) {
 		w.ch <- prometheus.MustNewConstMetric(cDesc(w.collector.socketLastMessageDesc), prometheus.GaugeValue, timestampSeconds(ts.AsTime()), appendLabels(labels, "received")...)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@util/collectors/channelz.go` around lines 435 - 446, The message timestamp
checks are inconsistent: replace the nil-only checks in the block using
data.GetLastMessageSentTimestamp() and data.GetLastMessageReceivedTimestamp()
with the same validation used for stream-created timestamps by calling
hasUsableTimestamp(ts) (or an equivalent non-zero check) before emitting metrics
to avoid reporting epoch-zero values; update the two if conditions that
currently use "ts != nil" to use hasUsableTimestamp(ts) and keep the rest of the
metric creation using cDesc(w.collector.socketLastMessageDesc) and
appendLabels(labels, "sent"/"received") unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@util/collectors/channelz_test.go`:
- Around line 486-493: The test currently calls grpc.DialContext to connect to
the bufconn listener (see grpc.DialContext usage around "bufnet" and the
WithContextDialer that calls listener.Dial); replace that call with
grpc.NewClient and use the passthrough:/// scheme (e.g.,
"passthrough:///bufnet") so the bufconn dialer is honored, keeping the existing
WithContextDialer(listener.Dial) and
grpc.WithTransportCredentials(insecure.NewCredentials()) options intact so the
test behavior remains the same.

---

Nitpick comments:
In `@util/collectors/channelz.go`:
- Around line 498-500: Remove the trivial identity function cDesc which simply
returns its input; delete the cDesc function and replace all calls like
cDesc(w.collector.channelCallsDesc) with the descriptor directly (e.g.,
w.collector.channelCallsDesc) across the file (and any other callers), updating
imports/usages if necessary to compile. Ensure no other logic depends on cDesc
and run tests/build to confirm no references remain.
- Around line 233-240: The gRPC calls currently use context.Background() which
can hang; replace those calls to use a cancellable context from the
collector/walker instead (e.g., use w.ctx or a context.WithTimeout(w.ctx,
<reasonableDuration>)) so each call (GetTopChannels in the initial loop and all
gRPC calls inside walkChannel and walkSubchannel) honors timeouts/cancellation.
Ensure the walker has a ctx field initialized from Collect (or accept a context
parameter in Collect and pass it to the walker) and use that ctx for all client
calls (GetTopChannels, and the calls inside walkChannel and walkSubchannel) so
goroutines won’t leak if the server is unresponsive.
- Around line 435-446: The message timestamp checks are inconsistent: replace
the nil-only checks in the block using data.GetLastMessageSentTimestamp() and
data.GetLastMessageReceivedTimestamp() with the same validation used for
stream-created timestamps by calling hasUsableTimestamp(ts) (or an equivalent
non-zero check) before emitting metrics to avoid reporting epoch-zero values;
update the two if conditions that currently use "ts != nil" to use
hasUsableTimestamp(ts) and keep the rest of the metric creation using
cDesc(w.collector.socketLastMessageDesc) and appendLabels(labels,
"sent"/"received") unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8640cd42-0329-4a49-817f-f851aaba6068

📥 Commits

Reviewing files that changed from the base of the PR and between 1b3c5b5 and f5a9f44.

📒 Files selected for processing (2)
  • util/collectors/channelz.go
  • util/collectors/channelz_test.go

Comment thread util/collectors/channelz_test.go Outdated
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 2, 2026

/retest

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 2, 2026

@cfzjywxk PTAL

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 3, 2026
Copy link
Copy Markdown
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, is the next step adapt it to the client-go repo?

@ti-chi-bot ti-chi-bot Bot added the lgtm label Apr 3, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 3, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, lcwangchao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added approved and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 3, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 3, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-03 02:58:01.716665989 +0000 UTC m=+493086.922026046: ☑️ agreed by lcwangchao.
  • 2026-04-03 02:58:45.995219488 +0000 UTC m=+493131.200579535: ☑️ agreed by cfzjywxk.

@ti-chi-bot ti-chi-bot Bot merged commit a888f42 into tikv:master Apr 3, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants