Skip to content

feat: Metrics for queues and event writer#922

Closed
jannfis wants to merge 3 commits into
argoproj-labs:mainfrom
jannfis:feat/queue-metrics
Closed

feat: Metrics for queues and event writer#922
jannfis wants to merge 3 commits into
argoproj-labs:mainfrom
jannfis:feat/queue-metrics

Conversation

@jannfis

@jannfis jannfis commented Apr 22, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do / why we need it:

Add metrics for queue depths and events in the event writer

Which issue(s) this PR fixes:

Fixes #?

How to test changes / Special notes to the reviewer:

Checklist

  • Documentation update is required by this PR (and has been updated) OR no documentation update is required.

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics for event queue depth monitoring across agent and principal components
    • Introduced event writer state metrics tracking pending events, resend eligibility, and backoff status
    • Added counter for tracking exhausted event retries
  • Documentation

    • Updated metrics documentation with new metric definitions, including queue depth, event writer state, and retry exhaustion drop counters

Signed-off-by: jannfis <jann@mistrust.net>
@coderabbitai

coderabbitai Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 527beffc-61b7-4a96-a171-f29a2f5dfc16

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds comprehensive Prometheus metrics instrumentation for event writing and queue depth monitoring across principal and agent components. New collectors track event queue depths, EventWriter pending/resend states, and retry exhaustion events through optional callbacks and observer patterns integrated into the EventWriter.

Changes

Cohort / File(s) Summary
EventWriter Core Enhancement
internal/event/event_writer.go, internal/event/event_writer_test.go
Added EventWriterOption pattern with WithOnRetryExhausted callback configuration, ObserveSentResendState observer for pending/resend/backoff counts, ObserveWriters iteration on EventWritersMap, and updated retrySentEvent to invoke callback when retry exhaustion occurs. Comprehensive test coverage validates observer counts, callback invocation, and map iteration.
Queue Depth Metrics
internal/metrics/queue_depth.go, internal/metrics/queue_depth_test.go
Introduced new eventQueueDepthCollector implementing prometheus.Collector emitting principal_event_queue_depth or agent_event_queue_depth gauges with queue and direction labels. Includes registration helpers with sync.Once for safe single-process registration and comprehensive tests validating collector output, empty queues, and metric describe output.
Queue Observable
internal/queue/queue.go, internal/queue/queue_test.go
Added ObserveDepths method on SendRecvQueues to expose per-queue pair send/receive lengths via callback. Test validates correct depth reporting and updates after dequeue operations.
Agent Instrumentation
agent/agent.go, agent/connection.go
Exposed new public method CurrentEventWriter() returning the agent's outbound eventWriter. Updated handleStreamEvents to build and apply EventWriterOption slice with retry exhaustion callback (when a.metrics != nil) during EventWriter creation.
Principal Instrumentation
principal/server.go, principal/apis/eventstream/eventstream.go
Registered queue depth and event writer metrics collectors in NewServer when metrics port is enabled. Updated Server.Subscribe to construct and pass EventWriterOption slice with retry exhaustion callback to event.NewEventWriter.
Documentation
docs/operations/metrics.md
Added documentation for new metrics: event queue depth gauges, EventWriter pending/resend/backoff gauges, and retry exhaustion drop counters. Extended labels section with queue, direction, and agent label definitions.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Suggested reviewers

  • jgwest
  • chetan-rns
  • drewbailey
  • mikeshng

Poem

📊 Metrics glimmer, queues shine bright,
🐰 Observing depths with pure delight,
Callbacks fire when retries cease,
Now our systems have sweet peace! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective of the changeset: adding metrics for queue depths and event writer monitoring across principal and agent components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jannfis

jannfis commented Apr 22, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Apr 22, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
internal/queue/queue_test.go (1)

126-147: Minor: inconsistent callback guard style.

The first ObserveDepths callback (Line 128) writes unconditionally while the second (Line 139) guards with if name == "agent1". Since only one pair exists, both work, but mixing styles makes it unclear whether the guard matters. Consider making both consistent — either always unconditional (since only one pair exists here) or always guarded — to keep intent unambiguous.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/queue/queue_test.go` around lines 126 - 147, Make the two
ObserveDepths callbacks consistent by guarding assignments in the first callback
the same way as the second: in the ObserveDepths call where you set gotName,
gotSend, gotRecv, wrap the assignments with if name == "agent1" { ... } (keeping
the second callback as-is that checks name) so both callbacks use the same
intent; update the ObserveDepths lambda that currently assigns unconditionally
to only assign when name == "agent1" and leave uses of sendQ.Get and sendQ.Done
unchanged.
internal/event/event_writer.go (1)

351-358: Callback invoked while holding ew.mu write lock.

onRetryExhausted(resID) is called inside the ew.mu.Lock() critical section. For the metric-increment callers added in this PR this is fine, but this puts a sharp edge on the public WithOnRetryExhausted API: any user-supplied callback that calls back into EventWriter (e.g., Get/Add/Remove) will deadlock, and a slow callback will stall all other writers/readers on this EventWriter.

Consider either (a) documenting on WithOnRetryExhausted that the callback must be non-blocking and must not touch the EventWriter, or (b) invoking the hook after the lock is released:

♻️ Proposed refactor to invoke hook outside the lock
 		ew.mu.Lock()
+		removed := false
 		if cur, ok := ew.sentEvents[resID]; ok && cur == sentMsg {
-			if ew.onRetryExhausted != nil {
-				ew.onRetryExhausted(resID)
-			}
 			delete(ew.sentEvents, resID)
+			removed = true
 		}
 		ew.mu.Unlock()
+		if removed && ew.onRetryExhausted != nil {
+			ew.onRetryExhausted(resID)
+		}
 		return

Note: ew.onRetryExhausted is set once at construction and never mutated, so reading it outside the lock is safe.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event/event_writer.go` around lines 351 - 358, The callback
ew.onRetryExhausted is being invoked while holding ew.mu which can deadlock or
block other operations; change the code to perform the map lookup, deletion, and
capture whether the hook should be called (and a local copy of
ew.onRetryExhausted) while holding ew.mu, then release ew.mu and invoke the
captured callback outside the lock; specifically update the block around
ew.sentEvents[resID] so you still check cur == sentMsg and delete(ew.sentEvents,
resID) under the lock but assign a local variable like cb := ew.onRetryExhausted
and call cb(resID) only after ew.mu.Unlock().
internal/metrics/queue_depth_test.go (1)

63-81: Tighten the collector test against unexpected series.

This test currently passes even if the collector emits extra metric series or an unexpected direction label. Add an exact metric count and a default failure branch so regressions are caught.

Proposed test hardening
 	var sendVal, recvVal float64
 	var seenSend, seenRecv bool
+	require.Len(t, mf.GetMetric(), 2)
 	for _, m := range mf.GetMetric() {
+		require.Equal(t, "agent-a", labelValue(m.GetLabel(), queueDepthLabelQueue))
 		dir := labelValue(m.GetLabel(), queueDepthLabelDirection)
 		switch dir {
 		case "send":
 			seenSend = true
-			require.Equal(t, "agent-a", labelValue(m.GetLabel(), queueDepthLabelQueue))
 			sendVal = m.GetGauge().GetValue()
 		case "recv":
 			seenRecv = true
-			require.Equal(t, "agent-a", labelValue(m.GetLabel(), queueDepthLabelQueue))
 			recvVal = m.GetGauge().GetValue()
+		default:
+			require.Failf(t, "unexpected direction label", "direction=%q", dir)
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/metrics/queue_depth_test.go` around lines 63 - 81, Tighten the test
in queue_depth_test.go by asserting the exact number of series and failing on
unexpected directions: check that len(mf.GetMetric()) == 2 before iterating, and
add a default branch in the switch over labelValue(m.GetLabel(),
queueDepthLabelDirection) that calls require.Failf or t.Fatalf (with a message
including the unexpected direction and metric labels) so any extra or unknown
series will make the test fail; keep the existing checks for "send" and "recv"
(asserting queue label via queueDepthLabelQueue and the gauge values) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent/agent.go`:
- Around line 642-646: CurrentEventWriter currently reads a.eventWriter without
synchronization while handleStreamEvents assigns it from another goroutine,
causing a data race; change the Agent field backing eventWriter to a
synchronized holder (either use sync/atomic's atomic.Pointer[*event.EventWriter]
or protect with a sync.RWMutex), update CurrentEventWriter to return the safely
loaded value (use Load() or read under RLock), and update the code in
handleStreamEvents (where it calls event.NewEventWriter and ew.UpdateTarget) to
Store/CompareAndSwap the new writer or perform writes under Lock so reads from
the Prometheus scrape handler see a consistent value.

---

Nitpick comments:
In `@internal/event/event_writer.go`:
- Around line 351-358: The callback ew.onRetryExhausted is being invoked while
holding ew.mu which can deadlock or block other operations; change the code to
perform the map lookup, deletion, and capture whether the hook should be called
(and a local copy of ew.onRetryExhausted) while holding ew.mu, then release
ew.mu and invoke the captured callback outside the lock; specifically update the
block around ew.sentEvents[resID] so you still check cur == sentMsg and
delete(ew.sentEvents, resID) under the lock but assign a local variable like cb
:= ew.onRetryExhausted and call cb(resID) only after ew.mu.Unlock().

In `@internal/metrics/queue_depth_test.go`:
- Around line 63-81: Tighten the test in queue_depth_test.go by asserting the
exact number of series and failing on unexpected directions: check that
len(mf.GetMetric()) == 2 before iterating, and add a default branch in the
switch over labelValue(m.GetLabel(), queueDepthLabelDirection) that calls
require.Failf or t.Fatalf (with a message including the unexpected direction and
metric labels) so any extra or unknown series will make the test fail; keep the
existing checks for "send" and "recv" (asserting queue label via
queueDepthLabelQueue and the gauge values) unchanged.

In `@internal/queue/queue_test.go`:
- Around line 126-147: Make the two ObserveDepths callbacks consistent by
guarding assignments in the first callback the same way as the second: in the
ObserveDepths call where you set gotName, gotSend, gotRecv, wrap the assignments
with if name == "agent1" { ... } (keeping the second callback as-is that checks
name) so both callbacks use the same intent; update the ObserveDepths lambda
that currently assigns unconditionally to only assign when name == "agent1" and
leave uses of sendQ.Get and sendQ.Done unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f468389-8199-4170-aa85-66945b73f136

📥 Commits

Reviewing files that changed from the base of the PR and between dcdc499 and c259330.

📒 Files selected for processing (11)
  • agent/agent.go
  • agent/connection.go
  • docs/operations/metrics.md
  • internal/event/event_writer.go
  • internal/event/event_writer_test.go
  • internal/metrics/queue_depth.go
  • internal/metrics/queue_depth_test.go
  • internal/queue/queue.go
  • internal/queue/queue_test.go
  • principal/apis/eventstream/eventstream.go
  • principal/server.go

Comment thread agent/agent.go
Comment on lines +642 to +646
// CurrentEventWriter returns the outbound event writer when connected, or nil.
// It implements metrics.EventWriterLookup for Prometheus scraping.
func (a *Agent) CurrentEventWriter() *event.EventWriter {
return a.eventWriter
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Data race on a.eventWriter.

CurrentEventWriter() reads a.eventWriter without synchronization, while handleStreamEvents in agent/connection.go (Line 188) assigns a.eventWriter = event.NewEventWriter(...) from a different goroutine. Metrics scrapes happen on the Prometheus HTTP handler goroutine, so this is a concurrent read/write of an interface/pointer value — a data race per the Go memory model, even though the pointer is never reassigned after the first write.

At best the scraper will see nil briefly and skip; at worst -race builds will flag it and on some architectures the read could observe a torn value.

🔒 Proposed fix using atomic.Pointer (or a mutex)
-	eventWriter *event.EventWriter
+	eventWriter atomic.Pointer[event.EventWriter]

Callers then use a.eventWriter.Load() / a.eventWriter.Store(ew) / a.eventWriter.CompareAndSwap(...). CurrentEventWriter() becomes:

func (a *Agent) CurrentEventWriter() *event.EventWriter {
    return a.eventWriter.Load()
}

And in agent/connection.go:

if ew := a.eventWriter.Load(); ew == nil {
    // build opts...
    a.eventWriter.Store(event.NewEventWriter("", stream, ewOpts...))
} else {
    ew.UpdateTarget(stream)
}

A dedicated sync.RWMutex guarding eventWriter would also work.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@agent/agent.go` around lines 642 - 646, CurrentEventWriter currently reads
a.eventWriter without synchronization while handleStreamEvents assigns it from
another goroutine, causing a data race; change the Agent field backing
eventWriter to a synchronized holder (either use sync/atomic's
atomic.Pointer[*event.EventWriter] or protect with a sync.RWMutex), update
CurrentEventWriter to return the safely loaded value (use Load() or read under
RLock), and update the code in handleStreamEvents (where it calls
event.NewEventWriter and ew.UpdateTarget) to Store/CompareAndSwap the new writer
or perform writes under Lock so reads from the Prometheus scrape handler see a
consistent value.

jannfis added 2 commits April 22, 2026 03:31
Signed-off-by: jannfis <jann@mistrust.net>
Signed-off-by: jannfis <jann@mistrust.net>
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.33333% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 47.08%. Comparing base (dcdc499) to head (832b067).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/metrics/event_writer.go 55.55% 56 Missing ⚠️
internal/metrics/queue_depth.go 41.46% 24 Missing ⚠️
agent/connection.go 0.00% 6 Missing ⚠️
agent/agent.go 0.00% 4 Missing ⚠️
principal/apis/eventstream/eventstream.go 57.14% 2 Missing and 1 partial ⚠️
principal/server.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #922      +/-   ##
==========================================
+ Coverage   46.85%   47.08%   +0.23%     
==========================================
  Files         122      124       +2     
  Lines       17524    22213    +4689     
==========================================
+ Hits         8210    10458    +2248     
- Misses       8555    10995    +2440     
- Partials      759      760       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jannfis jannfis mentioned this pull request Apr 29, 2026
1 task
@jannfis

jannfis commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Superseded by #931

@jannfis jannfis closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants