feat(queuev2): cached timer queue reader with prefetch and time-based eviction#7962
Conversation
180eedb to
206586b
Compare
206586b to
308bee5
Compare
308bee5 to
4bee2d2
Compare
| // putTasks adds tasks to the cache and enforces the size cap. | ||
| // Caller must hold q.mu. | ||
| // putTasks adds tasks to the cache and enforces the size cap. | ||
| // Returns true if RTrimBySize fired and updated exclusiveUpperBound, | ||
| // meaning the caller must not re-advance the bound. | ||
| // Caller must hold q.mu. |
There was a problem hiding this comment.
💡 Quality: Duplicate doc comment on putTasks
Lines 427-428 and 429-432 both start with // putTasks adds tasks to the cache and enforces the size cap. and // Caller must hold q.mu.. The old comment block (lines 427-428) was left in place when the new, expanded comment block (lines 429-432) was added.
Suggested fix:
Remove the duplicate lines 427-428, keeping only the fuller comment block (lines 429-432):
// putTasks adds tasks to the cache and enforces the size cap.
// Returns true if RTrimBySize fired and updated exclusiveUpperBound,
// meaning the caller must not re-advance the bound.
// Caller must hold q.mu.
func (q *cachedQueueReader) putTasks(tasks []persistence.Task) bool {
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
… eviction Adds a cached layer over the timer (scheduled) queue reader that keeps a look-ahead window of upcoming tasks in memory, eliminating repeated DB reads on the hot path. Relates to cadence-workflow#7953. ## Components ### InMemQueue (queue_mem.go) Sorted in-memory task store. GetTasks guards PageSize<=0 and breaks early. RTrimBySize nils evicted slots for GC and resets exclusiveUpperBound when the queue is emptied. ### CachedQueueReader (queue_reader_cached.go) Three modes: disabled (default), shadow, enabled. isEnabled/isShadow — strict single-value checks isDisabled — catch-all: returns true for 'disabled' and any unrecognised mode (Warn log), ensuring misconfigurations default to safe Key design: - prefetchLoop: fires after WarmupGracePeriod; returns nil for cache-full so the loop reschedules quietly without Warn noise - RTrim guard: after putTasks, if exclusiveUpperBound changed (!Equal, catches both shrink and raise cases), skip re-advancing to avoid false coverage - putTasks resets exclusiveUpperBound to MinimumHistoryTaskKey when RTrim empties the cache (MaxSize<=0) - injectAllowedAfter: set once in constructor (now+WarmupGracePeriod), no mutex needed; isInWarmup() is a plain clock.Now().Before() check - ctx/cancel initialized in constructor to avoid nil-cancel race with Stop - LookAHead uses strict Less on upper bound (exclusive) and checks inclusiveLowerBound so evicted tasks fall through to DB Shadow mode comparison: findMismatchesInShadow — pure function returning findMismatchesInShadowResult with MissingFromCache, ExtraInCache, NextKeyMismatch, HasMismatches reportShadowComparison — logs + increments metric from result getTaskInShadow — orchestrates DB fetch, live cache re-read, and both above preFetchLowerBound filters eviction between snapshot and live re-read to suppress false mismatch warnings ### Config (service/history/config, dynamicconfig/dynamicproperties) Nine new global dynamic config properties (no ShardID filter). Feature flag: TimerProcessorEnableCachedScheduledQueue (default: off). ### CachedScheduledQueue (queue_scheduled_cached.go) Wires NotifyNewTask→Inject, updateQueueStateFn→UpdateReadLevel, and reader lifecycle into the existing scheduled queue. ### Simulation scenarios Two new scenarios (enabled + shadow mode) requiring cadence-workflow#7952 for infra. ## Tests Coverage: 87.4% on new code. Tests cover prefetch scheduling, time eviction, shadow mismatch detection (bidirectional, NextTaskKey, preFetchLowerBound filtering), inject races, lifecycle, all three modes, RTrim guards (shrink, raise, empty-cache reset), LookAHead bounds, warmup gating, unknown mode fallback, compareTasksInShadow unit tests. Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
GetTask warmup bypass
GetTask now delegates to the base reader when isInWarmup() is true,
same as disabled mode. Previously the prefetch loop had not fully
populated the cache yet, but GetTask would still attempt to serve from
it. The result was stale/partial responses and spurious 'cache misses'
that were really just warmup bypasses.
ExtraInCache excluded from HasMismatches
Tasks in the cache but absent from the DB are benign: Inject writes
tasks to the cache as soon as NotifyNewTask fires, before the DB
read in the shadow path completes. Due to task independence, having
extra tasks in cache causes no harm. ExtraInCache is still tracked and
logged at Debug for observability but no longer sets HasMismatches or
triggers the mismatch Warn/counter.
Reason tag on bound advances
updateExclusiveUpperBound and updateInclusiveLowerBound now accept a
reason string logged alongside prevBound/newBound:
prefetch-partial-page, prefetch-full-page — prefetch advancing window
gap-detected-reset — gap reset after concurrent bound change
rtrim-shrink, rtrim-empty — RTrim narrowing or clearing cache
time-eviction — periodic time-based eviction loop
ack-level-update — UpdateReadLevel from queue state persist
Tests
GetTask_Enabled and GetTask_Shadow set r.injectAllowedAfter = time.Time{}
so unit tests skip the warmup check and exercise cache logic directly.
Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
…dLevel Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
1. extraInCache moved to mismatchTags (logged on mismatch only) 2. All mismatch log fields prefixed shadowMismatch.* 3. HasMismatches restored to include ExtraInCache — still a mismatch even if benign, for consistent observability 4. ack-level-update → read-level-update in updateInclusiveLowerBound 5. LookAHead now delegates to base during warmup, matching GetTask Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
Collapse the partial/full-page if/else into a single target computation: target := exclusiveMaxKey (partial page) target = NextTaskKey (full page) One updateExclusiveUpperBound call, one reason tag 'prefetch-advance'. Drop the verbose comment block — the early return on trim is self-explanatory. Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
…efetch putTasks now returns true when RTrimBySize fired and updated the upper bound. prefetch uses this directly instead of comparing exclusiveUpperBound before and after the call, removing the prevUpper variable and the verbose !Equal guard. Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
… reset updateExclusiveUpperBound: now logs inclusiveLowerBound + cacheSize for context, and records the size histogram. updateInclusiveLowerBound: now a pure setter with log + metrics. No advance guard, no LTrim — safe to call for resets too. advanceInclusiveLowerBound: new helper wrapping the advance-only logic (upper-bound cap, advance guard, LTrim) then calls updateInclusiveLowerBound. Gap detection: uses updateInclusiveLowerBound directly after Clear() since going backwards is intentional and no LTrim is needed. timeEvict + UpdateReadLevel: fixed to call advanceInclusiveLowerBound (were calling updateInclusiveLowerBound after the rename, which skipped the LTrim — tasks below the eviction point were never removed). Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
…fy Inject guard Gap detection now calls updateInclusiveLowerBound (the simple setter) instead of assigning inclusiveLowerBound directly. No LTrim needed since the queue is already cleared, and the setter accepts backwards resets. Inject: combine isDisabled and isInWarmup into a single guard condition. Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
ExtraInCache (tasks in cache not in DB) is benign given task independence. It now logs at Info without incrementing the mismatch counter, so it is still observable without polluting the Warn metric used for alerting. MissingFromCache and NextKeyMismatch are real divergences: they log at Warn and increment the counter as before. shadowMismatch.dbTaskCount and shadowMismatch.cacheTaskCount moved into the mismatch tag set so they are only emitted when there is something to act on. The match path logs a bare Debug with no extra fields. Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
Adds a readLevelSyncLoop goroutine to cachedScheduledQueue that ticks every TimerProcessorCacheReadLevelSyncInterval (default 1s) and calls reader.UpdateReadLevel(virtualQueueManager.GetMinReadLevel()). This keeps the cache lower bound within ~1s of actual processing progress, compared to the previous ~30s lag from updateQueueStateFn which is gated on DB writes. New config: TimerProcessorCacheReadLevelSyncInterval (global, default 1s). Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
09e3939 to
b38c880
Compare
…vent startup false coverage At shard takeover, the cached reader starts with inclusiveLowerBound = MinimumHistoryTaskKey and exclusiveUpperBound = MinimumHistoryTaskKey. After the first prefetch, exclusiveUpperBound jumps to now + MaxLookAheadWindow but inclusiveLowerBound stays at MinimumHistoryTaskKey until timeEvict catches up (~60s). During this window, isRangeCovered returns true for ALL historical tasks (from the beginning of time to the cache ceiling). Tasks from the previous shard owner exist in DB but not in the fresh cache. GetTask returns 0, the virtual slice pops its progress entry, and those tasks are permanently skipped at shard takeover. Fix: after the first prefetch, advance inclusiveLowerBound to the fetch anchor (now - EvictionSafeWindow). Tasks before that anchor now correctly miss the cache and fall through to the DB. Signed-off-by: Seva Kaloshin <seva.kaloshin@gmail.com>
CI failed: The CI build failed due to flaky integration tests emitting background logs after test completion. These 'logged too late' errors indicate incomplete lifecycle teardown for the global ratelimiter component.OverviewMultiple integration tests failed due to background goroutine logs emitted after the tests completed, specifically originating from the FailuresFlaky Integration Test Teardown (confidence: high)
Summary
Code Review 👍 Approved with suggestions 9 resolved / 10 findingsIntroduces a cached timer queue reader with prefetch and eviction, resolving several concurrency and performance issues with bulk inserts and dynamic config reads. Please address the duplicate doc comments on putTasks. 💡 Quality: Duplicate doc comment on putTasks📄 service/history/queuev2/queue_reader_cached.go:427-432 Lines 427-428 and 429-432 both start with Suggested fix✅ 9 resolved✅ Edge Case: putTasks doesn't reset exclusiveUpperBound when RTrim empties cache
✅ Quality: Mode string comparison has no validation for unknown values
✅ Edge Case: Shadow mode false mismatches from cache eviction between reads
✅ Quality: InMemQueue.PutTasks is O(n*m) for unsorted bulk inserts
✅ Edge Case: Non-atomic dynamic config Mode() reads can cause inconsistency
...and 4 more resolved from earlier reviews 🤖 Prompt for agentsRules ✅ All requirements metRepository Rules
1 rule not applicable. Show all rules by commenting Tip Comment OptionsAuto-apply is off → Gitar will not commit updates to this branch. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
What changed?
Adds a cached layer over the timer (scheduled) queue reader that keeps a look-ahead window of upcoming tasks in memory, eliminating repeated DB reads on the hot path. Relates to #7953.
Replaces four stacked PRs (#7954 #7955 #7956 #7957) with a single consolidated draft for easier review iteration.
Components:
InMemQueue (
queue_mem.go) — sorted in-memory task store withPutTasks,GetTasks,LookAHead,LTrim,RTrimBySize,Clear. Deduplicates inserts by task key. RTrimBySize nils evicted interface slots for GC.CachedQueueReader (
queue_reader_cached.go) — wraps anyQueueReaderwith an in-memory look-ahead cache. Three rollout modes:disabled(default),shadow(cache runs, DB always wins, mismatches logged),enabled(cache serves reads). Key behaviours: prefetch loop, time-eviction loop,Injectfor new tasks,UpdateReadLevelfor ack-level eviction.injectAllowedAfteris computed once at construction (now + WarmupGracePeriod) — no mutex needed.CachedScheduledQueue (
queue_scheduled_cached.go) — thin wrapper wiringNotifyNewTask → Inject,updateQueueStateFn → UpdateReadLevel, and reader lifecycle into the existing scheduled queue.Config — nine new global dynamic config properties (no ShardID filter, apply uniformly per host). Feature flag:
TimerProcessorEnableCachedScheduledQueue(default: off).Simulation scenarios — two new scenarios (enabled + shadow mode) requiring #7952 for infra.
Why?
The timer queue currently issues a DB round-trip for every virtual slice read. For scheduled tasks most reads land within a predictable look-ahead window. A pre-fetched in-memory cache removes those round-trips in the common case.
How did you test it?
Unit test coverage for new code: 87.4%. Tests cover prefetch, eviction, shadow mismatch, inject races, lifecycle, RTrim guard (both shrink and raise cases).
Simulation:
Potential risks
TimerProcessorEnableCachedScheduledQueue(default: false) andTimerProcessorCachedQueueReaderMode(default:disabled). Zero code path changes when flags are off.TimerProcessorCacheMaxSize(default: 1000 tasks per shard).Release notes
N/A — internal change behind a feature flag.
Documentation Changes
N/A