fix: avoid LRU promotion on Lookup to unblock concurrent readers by vMaroon · Pull Request #562 · llm-d/llm-d-kv-cache

vMaroon · 2026-05-01T11:15:08Z

InMemoryIndex.Lookup walked requestKeys with hashicorp/lru's Get(), which takes an exclusive sync.Mutex on every call (LRU promotion is a write op on the linked list). For long prompts the loop runs hundreds of Get() calls, and concurrent Lookup callers serialize fully behind each other on that lock.

In production telemetry from the inference-scheduler's precise prefix cache scorer (40-pod fleet, ~263 blocks per prompt), this manifested as bimodal scorer latency: p50=0.31ms when prompts early-break on a miss, p95=735ms / max=1.42s when reader concurrency built up on shared prefixes. A mutex profile attributed 95.44% of contention to hashicorp/lru.Cache.Get under InMemoryIndex.Lookup.

Switching Lookup to Peek() takes the LRU's RLock instead, letting concurrent reads run in parallel. In a reproducer harness that faithfully mimics production load (full prefix warm, 40 KV-event writers Add()ing throughout):

readers blocks Get p50 Peek p50 speedup
64 263 21.46ms 2.86ms 7.5x
128 263 37.88ms 2.36ms 16x
64 1024 94.52ms 37.10ms 2.5x

Trade-off: LRU eviction order tracks Add-time recency only, no read-time recency. For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines, this is acceptable -- hot data refreshes through the write path.

Existing concurrent stress test (testStressConcurrentAddLookup) and all other in_memory tests pass without modification.

InMemoryIndex.Lookup walked requestKeys with hashicorp/lru's Get(), which takes an exclusive sync.Mutex on every call (LRU promotion is a write op on the linked list). For long prompts the loop runs hundreds of Get() calls, and concurrent Lookup callers serialize fully behind each other on that lock. In production telemetry from the inference-scheduler's precise prefix cache scorer (40-pod fleet, ~263 blocks per prompt), this manifested as bimodal scorer latency: p50=0.31ms when prompts early-break on a miss, p95=735ms / max=1.42s when reader concurrency built up on shared prefixes. A mutex profile attributed 95.44% of contention to hashicorp/lru.Cache.Get under InMemoryIndex.Lookup. Switching Lookup to Peek() takes the LRU's RLock instead, letting concurrent reads run in parallel. In a reproducer harness that faithfully mimics production load (full prefix warm, 40 KV-event writers Add()ing throughout): readers blocks Get p50 Peek p50 speedup 64 263 21.46ms 2.86ms 7.5x 128 263 37.88ms 2.36ms 16x 64 1024 94.52ms 37.10ms 2.5x Trade-off: LRU eviction order tracks Add-time recency only, no read-time recency. For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines, this is acceptable -- hot data refreshes through the write path. Existing concurrent stress test (testStressConcurrentAddLookup) and all other in_memory tests pass without modification.

Copilot

Pull request overview

This PR improves concurrent read performance of the KV block in-memory index by avoiding LRU “promotion” work during Lookup, reducing lock contention when many callers scan long key chains.

Changes:

Switch InMemoryIndex.Lookup from lru.Cache.Get() to lru.Cache.Peek() to avoid MRU promotion on reads.
Add an explanatory comment documenting the contention root cause and the trade-off (eviction recency based on Add-time only).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-05-01T11:17:23Z

+		// Peek (RLock) instead of Get (Lock): Get promotes the entry to MRU
+		// and so takes hashicorp/lru's exclusive Mutex on every call. With
+		// concurrent Lookup callers each iterating hundreds of blocks, the
+		// per-Get serialization dominates latency (95%+ of mutex profile).
+		// Peek takes RLock and lets readers run in parallel. The tradeoff
+		// is that LRU eviction order tracks Add-time recency only; for a
+		// prefix cache where hot blocks are continually re-Added by KV
+		// events, this is acceptable.


The new comment explains the motivation, but it relies on hashicorp/golang-lru internal locking details ("exclusive Mutex" / "Peek takes RLock"). Since that’s an implementation detail that could change between versions, consider rephrasing to focus on the stable behavior difference (Get promotes MRU; Peek does not) and the observed contention, without asserting the specific lock type used internally.

Suggested change

// Peek (RLock) instead of Get (Lock): Get promotes the entry to MRU

// and so takes hashicorp/lru's exclusive Mutex on every call. With

// concurrent Lookup callers each iterating hundreds of blocks, the

// per-Get serialization dominates latency (95%+ of mutex profile).

// Peek takes RLock and lets readers run in parallel. The tradeoff

// is that LRU eviction order tracks Add-time recency only; for a

// prefix cache where hot blocks are continually re-Added by KV

// events, this is acceptable.

// Use Peek instead of Get here because Get updates the entry's MRU

// position, while Peek leaves recency unchanged. In this read-heavy

// path, concurrent Lookup callers may each scan hundreds of blocks,

// and avoiding that recency update significantly reduced observed

// contention. The tradeoff is that eviction order reflects Add-time

// recency rather than lookup-time recency; for a prefix cache where

// hot blocks are continually re-added by KV events, this is

// acceptable.

liu-cong · 2026-05-01T14:32:23Z

For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines,

KV events will be emitted even for blocks that are already cached?

vMaroon · 2026-05-01T17:23:29Z

For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines,

KV events will be emitted even for blocks that are already cached?

No - this PR tradeoffs lookup performance (currently ~1-5ms in a high-load environment) with LRU semantics, but requires more investigation. Should be WIP.

github-actions · 2026-05-23T01:41:03Z

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

Copilot AI review requested due to automatic review settings May 1, 2026 11:15

vMaroon requested review from dannyharnik, kfirtoledo and liu-cong as code owners May 1, 2026 11:15

github-actions Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 1, 2026

github-actions Bot requested review from hyeongyun0916, sagearc and yankay May 1, 2026 11:15

Copilot started reviewing on behalf of vMaroon May 1, 2026 11:15 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

vMaroon marked this pull request as draft May 1, 2026 17:23

github-actions Bot added the lifecycle/stale label May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid LRU promotion on Lookup to unblock concurrent readers#562

fix: avoid LRU promotion on Lookup to unblock concurrent readers#562
vMaroon wants to merge 1 commit into
llm-d:mainfrom
vMaroon:fix/lookup-peek-contention

vMaroon commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 1, 2026

Uh oh!

liu-cong commented May 1, 2026

Uh oh!

vMaroon commented May 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vMaroon commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

liu-cong commented May 1, 2026

Uh oh!

vMaroon commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vMaroon commented May 1, 2026 •

edited

Loading