util: add grpc channelz collector for prometheus#1938
util: add grpc channelz collector for prometheus#1938ti-chi-bot[bot] merged 2 commits intotikv:masterfrom
Conversation
Signed-off-by: zyguan <zhongyangguan@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds a Prometheus collector that traverses gRPC Channelz (TopChannels → Channels/Subchannels → Sockets) via a ChannelzClient, emitting metrics for calls, streams, sockets, optional channel state and trace events, with configurable filtering and per-scrape de-duplication. Changes
Sequence Diagram(s)sequenceDiagram
participant Prometheus as Prometheus Scrape
participant Collector as ChannelzCollector
participant Channelz as Channelz gRPC Server
participant Registry as Prometheus Registry
Prometheus->>Collector: HTTP scrape -> Collect()
Collector->>Channelz: GetTopChannels(page_token)
Channelz-->>Collector: TopChannels(list, next_token)
loop per channel/subchannel/socket
Collector->>Channelz: GetChannel / GetSubchannel / GetSocket(id)
Channelz-->>Collector: Channel/Subchannel/Socket data
Collector->>Collector: apply Filter? decide emit & descend
Collector->>Registry: expose metrics (labels, gauges, counters)
end
Collector-->>Prometheus: HTTP response (metrics)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
util/collectors/channelz.go (3)
498-500: Remove unnecessary identity function.The
cDescfunction simply returns its input unchanged and adds no value. Consider removing it and using the descriptors directly.♻️ Suggested removal
-func cDesc(desc *prometheus.Desc) *prometheus.Desc { - return desc -}Then replace all
cDesc(w.collector.channelCallsDesc)calls with justw.collector.channelCallsDesc.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@util/collectors/channelz.go` around lines 498 - 500, Remove the trivial identity function cDesc which simply returns its input; delete the cDesc function and replace all calls like cDesc(w.collector.channelCallsDesc) with the descriptor directly (e.g., w.collector.channelCallsDesc) across the file (and any other callers), updating imports/usages if necessary to compile. Ensure no other logic depends on cDesc and run tests/build to confirm no references remain.
233-240: Consider adding a context timeout for gRPC calls.Using
context.Background()without a timeout means the collector could hang indefinitely if the channelz server is unresponsive. While Prometheus enforces a scrape timeout, the underlying goroutine may leak. Consider accepting a context fromCollector using a fixed timeout.This pattern repeats in
walkChannel(lines 279, 287, 295) andwalkSubchannel(lines 326, 334, 342).♻️ Suggested approach
func (c *channelzCollector) Collect(ch chan<- prometheus.Metric) { + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() w := channelzWalker{ collector: c, ch: ch, + ctx: ctx, now: time.Now(), seenChannels: make(map[int64]struct{}), seenSubchannels: make(map[int64]struct{}), seenSockets: make(map[int64]struct{}), } w.walkTopChannels() c.collectFetchErrorMetrics(ch) }Then use
w.ctxin all gRPC calls within the walker methods.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@util/collectors/channelz.go` around lines 233 - 240, The gRPC calls currently use context.Background() which can hang; replace those calls to use a cancellable context from the collector/walker instead (e.g., use w.ctx or a context.WithTimeout(w.ctx, <reasonableDuration>)) so each call (GetTopChannels in the initial loop and all gRPC calls inside walkChannel and walkSubchannel) honors timeouts/cancellation. Ensure the walker has a ctx field initialized from Collect (or accept a context parameter in Collect and pass it to the walker) and use that ctx for all client calls (GetTopChannels, and the calls inside walkChannel and walkSubchannel) so goroutines won’t leak if the server is unresponsive.
435-446: Inconsistent timestamp validation between stream-created and message timestamps.Lines 435-440 use
hasUsableTimestamp()which checks for non-zero values, but lines 441-446 only checkts != nil. If a socket has never sent/received a message, the timestamp could be zero, leading to a metric with value0(Unix epoch 1970-01-01).Consider using
hasUsableTimestamp()consistently for all timestamp metrics, or document why the behavior differs.♻️ Suggested fix for consistency
- if ts := data.GetLastMessageSentTimestamp(); ts != nil { + if ts := data.GetLastMessageSentTimestamp(); hasUsableTimestamp(ts) { w.ch <- prometheus.MustNewConstMetric(cDesc(w.collector.socketLastMessageDesc), prometheus.GaugeValue, timestampSeconds(ts.AsTime()), appendLabels(labels, "sent")...) } - if ts := data.GetLastMessageReceivedTimestamp(); ts != nil { + if ts := data.GetLastMessageReceivedTimestamp(); hasUsableTimestamp(ts) { w.ch <- prometheus.MustNewConstMetric(cDesc(w.collector.socketLastMessageDesc), prometheus.GaugeValue, timestampSeconds(ts.AsTime()), appendLabels(labels, "received")...) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@util/collectors/channelz.go` around lines 435 - 446, The message timestamp checks are inconsistent: replace the nil-only checks in the block using data.GetLastMessageSentTimestamp() and data.GetLastMessageReceivedTimestamp() with the same validation used for stream-created timestamps by calling hasUsableTimestamp(ts) (or an equivalent non-zero check) before emitting metrics to avoid reporting epoch-zero values; update the two if conditions that currently use "ts != nil" to use hasUsableTimestamp(ts) and keep the rest of the metric creation using cDesc(w.collector.socketLastMessageDesc) and appendLabels(labels, "sent"/"received") unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@util/collectors/channelz_test.go`:
- Around line 486-493: The test currently calls grpc.DialContext to connect to
the bufconn listener (see grpc.DialContext usage around "bufnet" and the
WithContextDialer that calls listener.Dial); replace that call with
grpc.NewClient and use the passthrough:/// scheme (e.g.,
"passthrough:///bufnet") so the bufconn dialer is honored, keeping the existing
WithContextDialer(listener.Dial) and
grpc.WithTransportCredentials(insecure.NewCredentials()) options intact so the
test behavior remains the same.
---
Nitpick comments:
In `@util/collectors/channelz.go`:
- Around line 498-500: Remove the trivial identity function cDesc which simply
returns its input; delete the cDesc function and replace all calls like
cDesc(w.collector.channelCallsDesc) with the descriptor directly (e.g.,
w.collector.channelCallsDesc) across the file (and any other callers), updating
imports/usages if necessary to compile. Ensure no other logic depends on cDesc
and run tests/build to confirm no references remain.
- Around line 233-240: The gRPC calls currently use context.Background() which
can hang; replace those calls to use a cancellable context from the
collector/walker instead (e.g., use w.ctx or a context.WithTimeout(w.ctx,
<reasonableDuration>)) so each call (GetTopChannels in the initial loop and all
gRPC calls inside walkChannel and walkSubchannel) honors timeouts/cancellation.
Ensure the walker has a ctx field initialized from Collect (or accept a context
parameter in Collect and pass it to the walker) and use that ctx for all client
calls (GetTopChannels, and the calls inside walkChannel and walkSubchannel) so
goroutines won’t leak if the server is unresponsive.
- Around line 435-446: The message timestamp checks are inconsistent: replace
the nil-only checks in the block using data.GetLastMessageSentTimestamp() and
data.GetLastMessageReceivedTimestamp() with the same validation used for
stream-created timestamps by calling hasUsableTimestamp(ts) (or an equivalent
non-zero check) before emitting metrics to avoid reporting epoch-zero values;
update the two if conditions that currently use "ts != nil" to use
hasUsableTimestamp(ts) and keep the rest of the metric creation using
cDesc(w.collector.socketLastMessageDesc) and appendLabels(labels,
"sent"/"received") unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8640cd42-0329-4a49-817f-f851aaba6068
📒 Files selected for processing (2)
util/collectors/channelz.goutil/collectors/channelz_test.go
Signed-off-by: zyguan <zhongyangguan@gmail.com>
|
/retest |
|
@cfzjywxk PTAL |
cfzjywxk
left a comment
There was a problem hiding this comment.
LGTM, is the next step adapt it to the client-go repo?
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cfzjywxk, lcwangchao The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
The collector walks client-side channelz objects (Channel, Subchannel, Socket) via the gRPC channelz API and exports raw channel/socket metrics with configurable namespace, optional filtering, and optional local/remote socket labels. Channel state and trace metrics are opt-in via IncludeChannelState and IncludeChannelTrace, while collector self-metrics are limited to fetch error counters.
Summary by CodeRabbit
New Features
Tests