fix: avoid LRU promotion on Lookup to unblock concurrent readers#562
fix: avoid LRU promotion on Lookup to unblock concurrent readers#562vMaroon wants to merge 1 commit into
Conversation
InMemoryIndex.Lookup walked requestKeys with hashicorp/lru's Get(), which takes an exclusive sync.Mutex on every call (LRU promotion is a write op on the linked list). For long prompts the loop runs hundreds of Get() calls, and concurrent Lookup callers serialize fully behind each other on that lock. In production telemetry from the inference-scheduler's precise prefix cache scorer (40-pod fleet, ~263 blocks per prompt), this manifested as bimodal scorer latency: p50=0.31ms when prompts early-break on a miss, p95=735ms / max=1.42s when reader concurrency built up on shared prefixes. A mutex profile attributed 95.44% of contention to hashicorp/lru.Cache.Get under InMemoryIndex.Lookup. Switching Lookup to Peek() takes the LRU's RLock instead, letting concurrent reads run in parallel. In a reproducer harness that faithfully mimics production load (full prefix warm, 40 KV-event writers Add()ing throughout): readers blocks Get p50 Peek p50 speedup 64 263 21.46ms 2.86ms 7.5x 128 263 37.88ms 2.36ms 16x 64 1024 94.52ms 37.10ms 2.5x Trade-off: LRU eviction order tracks Add-time recency only, no read-time recency. For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines, this is acceptable -- hot data refreshes through the write path. Existing concurrent stress test (testStressConcurrentAddLookup) and all other in_memory tests pass without modification.
There was a problem hiding this comment.
Pull request overview
This PR improves concurrent read performance of the KV block in-memory index by avoiding LRU “promotion” work during Lookup, reducing lock contention when many callers scan long key chains.
Changes:
- Switch
InMemoryIndex.Lookupfromlru.Cache.Get()tolru.Cache.Peek()to avoid MRU promotion on reads. - Add an explanatory comment documenting the contention root cause and the trade-off (eviction recency based on Add-time only).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Peek (RLock) instead of Get (Lock): Get promotes the entry to MRU | ||
| // and so takes hashicorp/lru's exclusive Mutex on every call. With | ||
| // concurrent Lookup callers each iterating hundreds of blocks, the | ||
| // per-Get serialization dominates latency (95%+ of mutex profile). | ||
| // Peek takes RLock and lets readers run in parallel. The tradeoff | ||
| // is that LRU eviction order tracks Add-time recency only; for a | ||
| // prefix cache where hot blocks are continually re-Added by KV | ||
| // events, this is acceptable. |
There was a problem hiding this comment.
The new comment explains the motivation, but it relies on hashicorp/golang-lru internal locking details ("exclusive Mutex" / "Peek takes RLock"). Since that’s an implementation detail that could change between versions, consider rephrasing to focus on the stable behavior difference (Get promotes MRU; Peek does not) and the observed contention, without asserting the specific lock type used internally.
| // Peek (RLock) instead of Get (Lock): Get promotes the entry to MRU | |
| // and so takes hashicorp/lru's exclusive Mutex on every call. With | |
| // concurrent Lookup callers each iterating hundreds of blocks, the | |
| // per-Get serialization dominates latency (95%+ of mutex profile). | |
| // Peek takes RLock and lets readers run in parallel. The tradeoff | |
| // is that LRU eviction order tracks Add-time recency only; for a | |
| // prefix cache where hot blocks are continually re-Added by KV | |
| // events, this is acceptable. | |
| // Use Peek instead of Get here because Get updates the entry's MRU | |
| // position, while Peek leaves recency unchanged. In this read-heavy | |
| // path, concurrent Lookup callers may each scan hundreds of blocks, | |
| // and avoiding that recency update significantly reduced observed | |
| // contention. The tradeoff is that eviction order reflects Add-time | |
| // recency rather than lookup-time recency; for a prefix cache where | |
| // hot blocks are continually re-added by KV events, this is | |
| // acceptable. |
KV events will be emitted even for blocks that are already cached? |
No - this PR tradeoffs lookup performance (currently ~1-5ms in a high-load environment) with LRU semantics, but requires more investigation. Should be WIP. |
|
This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the |
InMemoryIndex.Lookup walked requestKeys with hashicorp/lru's Get(), which takes an exclusive sync.Mutex on every call (LRU promotion is a write op on the linked list). For long prompts the loop runs hundreds of Get() calls, and concurrent Lookup callers serialize fully behind each other on that lock.
In production telemetry from the inference-scheduler's precise prefix cache scorer (40-pod fleet, ~263 blocks per prompt), this manifested as bimodal scorer latency: p50=0.31ms when prompts early-break on a miss, p95=735ms / max=1.42s when reader concurrency built up on shared prefixes. A mutex profile attributed 95.44% of contention to hashicorp/lru.Cache.Get under InMemoryIndex.Lookup.
Switching Lookup to Peek() takes the LRU's RLock instead, letting concurrent reads run in parallel. In a reproducer harness that faithfully mimics production load (full prefix warm, 40 KV-event writers Add()ing throughout):
readers blocks Get p50 Peek p50 speedup
64 263 21.46ms 2.86ms 7.5x
128 263 37.88ms 2.36ms 16x
64 1024 94.52ms 37.10ms 2.5x
Trade-off: LRU eviction order tracks Add-time recency only, no read-time recency. For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines, this is acceptable -- hot data refreshes through the write path.
Existing concurrent stress test (testStressConcurrentAddLookup) and all other in_memory tests pass without modification.