Skip to content

feat: configure workqueue's internal queue to support event de-duplication#1003

Draft
chetan-rns wants to merge 5 commits into
argoproj-labs:mainfrom
chetan-rns:configure-internal-queue
Draft

feat: configure workqueue's internal queue to support event de-duplication#1003
chetan-rns wants to merge 5 commits into
argoproj-labs:mainfrom
chetan-rns:configure-internal-queue

Conversation

@chetan-rns

@chetan-rns chetan-rns commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do / why we need it:

client-go's workqueue allows us to customize the internal queue data structure. It also provides a hook called Touch that is invoked when we try to add an item that already exists in the queue but not yet processed. In this Touch method, we can remove the duplicate item and append the new item to the end of the queue. Instead of storing the entire resource event in the queue, we store the keys to events and maintain a map of keys to the latest events. We differentiate between de-duplicable and non-deduplicable items based on the key type:

  1. De-duplicable (SpecUpdate, StatusUpdate): ResourceID, EventType
  2. Non-deduplicable (ACKs, Resync, etc): ResourceID, EventType, EventID

The EventID is unique for each event and thereby prevents de-duplication.

Which issue(s) this PR fixes:

Fixes #?

How to test changes / Special notes to the reviewer:

Checklist

  • Documentation update is required by this PR (and has been updated) OR no documentation update is required.

Summary by CodeRabbit

  • New Features
    • Event Deduplication: The system now automatically deduplicates incoming events based on resource identifier and event type. Duplicate updates for the same resource are processed only once, reducing unnecessary event processing, improving system efficiency, and preventing redundant operations across distributed agents.

Assisted-by: Cursor
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Assisted-by: Cursor
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
Assisted-by: Cursor
Signed-off-by: Chetan Banavikalmutt <chetanrns1997@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Introduces a new dedupeQueue implementation that deduplicates CloudEvents by (ResourceID, EventType) for spec/status updates and keeps all events unique for non-deduplicable types. Exposes a WorkQueue interface, replaces newBoundedQueue everywhere with NewDedupeQueue, and migrates all consumers and tests to the new abstraction.

Changes

DedupeQueue and WorkQueue migration

Layer / File(s) Summary
EventKey, reorderQueue, and dedupeQueue implementation
internal/queue/dedupe_queue.go
Adds EventKey struct, a generic reorderQueue[T] with Touch-driven tail reordering, and dedupeQueue that wraps a typed rate-limiting workqueue with mutex-protected latestEvents/eventKeys maps, bounded eviction on Add, latest-event return on Get, pointer-tracking cleanup on Done, and a buffered notify channel.
WorkQueue interface and QueuePair wiring
internal/queue/queue.go
Introduces the WorkQueue interface (Add, Get, Done, Len, ShutDown), updates QueuePair's SendQ/RecvQ contracts and queuepair struct fields to WorkQueue, switches Create to call NewDedupeQueue, and aligns GetWithContext to type-assert to *dedupeQueue.
dedupeQueue and reorderQueue test suite
internal/queue/dedupe_queue_test.go, internal/queue/queue_test.go
Adds 20 test functions covering deduplication semantics, FIFO ordering, bounded eviction, Done lifecycle, shutdown, concurrency, notification signaling, and all reorderQueue Touch edge cases. Updates existing queue overflow tests to set eventid/resourceid extensions.
Consumer migration to WorkQueue interface
internal/resync/resync.go, principal/event.go, principal/callbacks_test.go, internal/queue/mocks/QueuePair.go
Migrates RequestHandler.sendQ, processRecvQueue, eventProcessor goroutine, and drainQueue test helper from the raw Kubernetes workqueue.TypedRateLimitingInterface to queue.WorkQueue. Updates the generated QueuePair mock RecvQ return type accordingly.
Test migration to real dedupeQueue
principal/event_test.go, agent/outbound_test.go
Replaces wqmock-based queue setup in processRecvQueue-driven tests with a newTestRecvQueue helper that constructs a real NewDedupeQueue. Adds Done calls in three agent outbound tests to satisfy the dedupeQueue lifecycle after Get.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • jgwest
  • jannfis
  • mikeshng

Poem

🐇 Hop hop, no double hops for me,
I dedupe events with glee!
The oldest gets evicted fast,
The latest event always last.
Touch moves my item to the tail,
My bounded queue will never fail! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.98% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: configuring workqueue's internal queue for event de-duplication, which is exactly what the PR implements across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chetan-rns

Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/queue/mocks/QueuePair.go (1)

257-301: ⚠️ Potential issue | 🟠 Major

QueuePair mock is only partially migrated and violates the interface contract.

The interface defines both SendQ and RecvQ to return queue.WorkQueue (internal/queue/queue.go:46-47), but the mock has RecvQ correctly updated to queue.WorkQueue while SendQ still returns workqueue.TypedRateLimitingInterface[*event.Event]. This inconsistency breaks the mock's compliance with the queue.QueuePair interface. Regenerate this mock from the updated interface definition or update SendQ and its helper method signatures to queue.WorkQueue to match.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/queue/mocks/QueuePair.go` around lines 257 - 301, The QueuePair mock
in the file has an inconsistency where RecvQ correctly returns queue.WorkQueue
but SendQ still returns workqueue.TypedRateLimitingInterface[*event.Event]. To
fix this, locate the SendQ method and all its helper methods
(QueuePair_SendQ_Call struct and its associated Run and Return methods in the
QueuePair_Expecter type) and update their return type signatures from
workqueue.TypedRateLimitingInterface[*event.Event] to queue.WorkQueue to match
the actual interface definition. Alternatively, regenerate the entire mock from
the updated interface definition in internal/queue/queue.go to ensure complete
consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/queue/dedupe_queue_test.go`:
- Around line 223-237: The TestDedupeQueue_Done test is missing the critical
regression test scenario where a duplicate event with the same dedupe key
arrives while the first item is being processed (between Get and Done). Modify
the test to add a second event with the same dedupe key (app1_uid1) after
calling Get on the first event but before calling Done, then add a second Get
call after Done to verify it returns the new event with the updated payload (v2)
rather than nil. This ensures the queue correctly handles the case where new
items arrive for the same resource while an existing item is still being
processed.

In `@internal/queue/dedupe_queue.go`:
- Around line 160-162: The issue is that the Add() method uses a blocking
q.queue.Get() call inside the eviction path after checking q.queue.Len() ==
q.maxSize, which creates a race condition where another goroutine could drain
the queue between the length check and the Get() call, causing the producer
thread to block indefinitely. Replace the blocking Get() approach with a
non-blocking alternative for item eviction, such as implementing a separate
lock-protected data structure (like a map or ordered list) that tracks key
insertion order for determining which item to evict, or switch to a non-blocking
peek/remove operation that won't wait if the queue becomes empty.
- Around line 99-123: Add validation in the NewDedupeQueue function to check if
the maxSize parameter is less than or equal to zero before proceeding with queue
initialization. If maxSize is invalid, either panic with a descriptive error
message or log a warning and clamp it to a sensible minimum value (such as 1) to
prevent the queue from entering eviction immediately on the first Add() call and
causing blocking issues on Get().
- Around line 147-153: The issue is that when Done() is called for an older
event that has been superseded by a newer event with the same key, the code
unconditionally deletes latestEvents[key], losing the newer pending event. In
the Done() method (around lines 188-201), before deleting latestEvents[key], you
need to verify that the event being completed is actually the current event
stored in latestEvents for that key by checking if q.eventKeys[item] still
points to that key and if latestEvents[key] still equals item. Only delete
latestEvents[key] if the item matches the current event; otherwise, preserve the
newer event that has arrived.

In `@internal/queue/queue.go`:
- Around line 213-217: The GetWithContext function accepts a WorkQueue interface
parameter but only works for *dedupeQueue implementations, silently returning
nil, false for other types. Either change the parameter type from WorkQueue to
*dedupeQueue to match the actual implementation, or add an explicit failure
mechanism (such as a panic or error return) for unsupported queue types instead
of silently returning nil, false. This will prevent nil events from leaking into
callers when using non-dedupeQueue implementations.

---

Outside diff comments:
In `@internal/queue/mocks/QueuePair.go`:
- Around line 257-301: The QueuePair mock in the file has an inconsistency where
RecvQ correctly returns queue.WorkQueue but SendQ still returns
workqueue.TypedRateLimitingInterface[*event.Event]. To fix this, locate the
SendQ method and all its helper methods (QueuePair_SendQ_Call struct and its
associated Run and Return methods in the QueuePair_Expecter type) and update
their return type signatures from
workqueue.TypedRateLimitingInterface[*event.Event] to queue.WorkQueue to match
the actual interface definition. Alternatively, regenerate the entire mock from
the updated interface definition in internal/queue/queue.go to ensure complete
consistency.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b0f933d5-ffa3-4545-b610-b0277a960794

📥 Commits

Reviewing files that changed from the base of the PR and between c37e3bf and 409766e.

📒 Files selected for processing (10)
  • agent/outbound_test.go
  • internal/queue/dedupe_queue.go
  • internal/queue/dedupe_queue_test.go
  • internal/queue/mocks/QueuePair.go
  • internal/queue/queue.go
  • internal/queue/queue_test.go
  • internal/resync/resync.go
  • principal/callbacks_test.go
  • principal/event.go
  • principal/event_test.go

Comment on lines +223 to +237
func TestDedupeQueue_Done(t *testing.T) {
q := NewDedupeQueue("test", 100)

ev := newDedupableEvent("app1_uid1", "v1")
q.Add(ev)

got, _ := q.Get()
assert.Equal(t, 0, q.Len())
q.Done(got)

// After Done, adding a new event for the same resource should work fresh
ev2 := newDedupableEvent("app1_uid1", "v2")
q.Add(ev2)
assert.Equal(t, 1, q.Len())
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add a regression test for “duplicate arrives while first item is processing”.

Current tests don't exercise Add(new) between Get(old) and Done(old) for the same dedupe key. That path is critical for this queue and should assert that the second Get() returns the latest payload (not nil).

Suggested test
+func TestDedupeQueue_DuplicateWhileProcessingKeepsLatest(t *testing.T) {
+	q := NewDedupeQueue("test", 100)
+
+	ev1 := newDedupableEvent("app1_uid1", "v1")
+	q.Add(ev1)
+
+	inFlight, shutdown := q.Get()
+	require.False(t, shutdown)
+	require.NotNil(t, inFlight)
+
+	ev2 := newDedupableEvent("app1_uid1", "v2")
+	q.Add(ev2)
+	q.Done(inFlight)
+
+	got, shutdown := q.Get()
+	require.False(t, shutdown)
+	require.NotNil(t, got)
+	var data string
+	_ = got.DataAs(&data)
+	assert.Equal(t, "v2", data)
+	q.Done(got)
+}

Also applies to: 310-328

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/queue/dedupe_queue_test.go` around lines 223 - 237, The
TestDedupeQueue_Done test is missing the critical regression test scenario where
a duplicate event with the same dedupe key arrives while the first item is being
processed (between Get and Done). Modify the test to add a second event with the
same dedupe key (app1_uid1) after calling Get on the first event but before
calling Done, then add a second Get call after Done to verify it returns the new
event with the updated payload (v2) rather than nil. This ensures the queue
correctly handles the case where new items arrive for the same resource while an
existing item is still being processed.

Comment on lines +99 to +123
func NewDedupeQueue(name string, maxSize int) WorkQueue {
baseQueue := workqueue.NewTypedWithConfig(workqueue.TypedQueueConfig[EventKey]{
Name: name,
Queue: newReorderQueue[EventKey](),
})

delayingQueue := workqueue.NewTypedDelayingQueueWithConfig(workqueue.TypedDelayingQueueConfig[EventKey]{
Queue: baseQueue,
})

queue := workqueue.NewTypedRateLimitingQueueWithConfig(
workqueue.DefaultTypedControllerRateLimiter[EventKey](),
workqueue.TypedRateLimitingQueueConfig[EventKey]{
DelayingQueue: delayingQueue,
},
)

return &dedupeQueue{
queue: queue,
maxSize: maxSize,
latestEvents: make(map[EventKey]*event.Event),
eventKeys: make(map[*event.Event]EventKey),
notify: make(chan struct{}, 10),
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate maxSize at construction time.

maxSize <= 0 causes the first Add() to enter eviction immediately and can block on Get(). Add an explicit constructor guard (panic or clamp with logged warning) to prevent invalid queue instances.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/queue/dedupe_queue.go` around lines 99 - 123, Add validation in the
NewDedupeQueue function to check if the maxSize parameter is less than or equal
to zero before proceeding with queue initialization. If maxSize is invalid,
either panic with a descriptive error message or log a warning and clamp it to a
sensible minimum value (such as 1) to prevent the queue from entering eviction
immediately on the first Add() call and causing blocking issues on Get().

Comment on lines +147 to +153
q.mu.Lock()
oldEvent := q.latestEvents[key]
_, exists := q.latestEvents[key]
q.latestEvents[key] = item
q.eventKeys[item] = key
if exists && oldEvent != nil {
delete(q.eventKeys, oldEvent)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

In-flight duplicate handling drops the newest event for the same key.

When a key is re-added after Get() but before Done(), the new payload is put back into latestEvents, then Done() for the older payload deletes latestEvents[key] unconditionally. That can lose the pending update and produce nil from a subsequent Get().

Proposed fix
 type dedupeQueue struct {
 	queue workqueue.TypedRateLimitingInterface[EventKey]

 	maxSize int

 	mu           sync.Mutex
 	latestEvents map[EventKey]*event.Event
 	eventKeys    map[*event.Event]EventKey
+	processing   map[EventKey]*event.Event

 	notify chan struct{}
 }
@@
 	return &dedupeQueue{
 		queue:        queue,
 		maxSize:      maxSize,
 		latestEvents: make(map[EventKey]*event.Event),
 		eventKeys:    make(map[*event.Event]EventKey),
+		processing:   make(map[EventKey]*event.Event),
 		notify:       make(chan struct{}, 10),
 	}
 }
@@
 func (q *dedupeQueue) Add(item *event.Event) {
 	key := getKey(item)

 	q.mu.Lock()
 	oldEvent := q.latestEvents[key]
-	_, exists := q.latestEvents[key]
+	_, exists := q.latestEvents[key]
+	_, inFlight := q.processing[key]
 	q.latestEvents[key] = item
 	q.eventKeys[item] = key
 	if exists && oldEvent != nil {
 		delete(q.eventKeys, oldEvent)
 	}
+	isNewKey := !exists && !inFlight
 	q.mu.Unlock()
@@
-	if !exists && q.queue.Len() == q.maxSize {
+	if isNewKey && q.queue.Len() == q.maxSize {
@@
 func (q *dedupeQueue) Get() (*event.Event, bool) {
@@
 	q.mu.Lock()
 	ev := q.latestEvents[key]
+	if ev != nil {
+		q.processing[key] = ev
+	}
 	delete(q.latestEvents, key)
 	q.mu.Unlock()

 	return ev, shutdown
 }

 func (q *dedupeQueue) Done(item *event.Event) {
 	q.mu.Lock()
 	key, ok := q.eventKeys[item]
 	if ok {
 		delete(q.eventKeys, item)
-		delete(q.latestEvents, key)
+		delete(q.processing, key)
+		if cur, exists := q.latestEvents[key]; exists && cur == item {
+			delete(q.latestEvents, key)
+		}
 	}
 	q.mu.Unlock()

Also applies to: 188-201

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/queue/dedupe_queue.go` around lines 147 - 153, The issue is that
when Done() is called for an older event that has been superseded by a newer
event with the same key, the code unconditionally deletes latestEvents[key],
losing the newer pending event. In the Done() method (around lines 188-201),
before deleting latestEvents[key], you need to verify that the event being
completed is actually the current event stored in latestEvents for that key by
checking if q.eventKeys[item] still points to that key and if latestEvents[key]
still equals item. Only delete latestEvents[key] if the item matches the current
event; otherwise, preserve the newer event that has arrived.

Comment on lines +160 to +162
if !exists && q.queue.Len() == q.maxSize {
oldest, shutdown := q.queue.Get()
if !shutdown {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Eviction path can block producer threads indefinitely.

Using blocking q.queue.Get() inside Add() is unsafe. The queue can be drained by another goroutine after the Len()==maxSize check, causing this producer to block before enqueueing the new item.

Please switch eviction to a non-blocking internal structure (or lock-protected key-order structure) rather than calling blocking Get() from the producer path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/queue/dedupe_queue.go` around lines 160 - 162, The issue is that the
Add() method uses a blocking q.queue.Get() call inside the eviction path after
checking q.queue.Len() == q.maxSize, which creates a race condition where
another goroutine could drain the queue between the length check and the Get()
call, causing the producer thread to block indefinitely. Replace the blocking
Get() approach with a non-blocking alternative for item eviction, such as
implementing a separate lock-protected data structure (like a map or ordered
list) that tracks key insertion order for determining which item to evict, or
switch to a non-blocking peek/remove operation that won't wait if the queue
becomes empty.

Comment thread internal/queue/queue.go
Comment on lines +213 to 217
func GetWithContext(q WorkQueue, ctx context.Context) (*event.Event, bool) {
bq, ok := q.(*dedupeQueue)
if !ok {
return nil, false
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

GetWithContext advertises WorkQueue but only works for *dedupeQueue.

This silently returns (nil, false) for any non-*dedupeQueue implementation, which can leak nil events into callers under a “not shutdown” state. Please either narrow the parameter type or make unsupported queue types fail explicitly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/queue/queue.go` around lines 213 - 217, The GetWithContext function
accepts a WorkQueue interface parameter but only works for *dedupeQueue
implementations, silently returning nil, false for other types. Either change
the parameter type from WorkQueue to *dedupeQueue to match the actual
implementation, or add an explicit failure mechanism (such as a panic or error
return) for unsupported queue types instead of silently returning nil, false.
This will prevent nil events from leaking into callers when using
non-dedupeQueue implementations.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.24561% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.26%. Comparing base (f89b78e) to head (409766e).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
internal/queue/queue.go 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1003      +/-   ##
==========================================
- Coverage   48.40%   48.26%   -0.15%     
==========================================
  Files         122      126       +4     
  Lines       18341    18806     +465     
==========================================
+ Hits         8878     9076     +198     
- Misses       8664     8914     +250     
- Partials      799      816      +17     
Flag Coverage Δ
unit-tests 48.26% <98.24%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants