Skip to content

fix: avoid LRU promotion on Lookup to unblock concurrent readers#562

Draft
vMaroon wants to merge 1 commit into
llm-d:mainfrom
vMaroon:fix/lookup-peek-contention
Draft

fix: avoid LRU promotion on Lookup to unblock concurrent readers#562
vMaroon wants to merge 1 commit into
llm-d:mainfrom
vMaroon:fix/lookup-peek-contention

Conversation

@vMaroon
Copy link
Copy Markdown
Member

@vMaroon vMaroon commented May 1, 2026

InMemoryIndex.Lookup walked requestKeys with hashicorp/lru's Get(), which takes an exclusive sync.Mutex on every call (LRU promotion is a write op on the linked list). For long prompts the loop runs hundreds of Get() calls, and concurrent Lookup callers serialize fully behind each other on that lock.

In production telemetry from the inference-scheduler's precise prefix cache scorer (40-pod fleet, ~263 blocks per prompt), this manifested as bimodal scorer latency: p50=0.31ms when prompts early-break on a miss, p95=735ms / max=1.42s when reader concurrency built up on shared prefixes. A mutex profile attributed 95.44% of contention to hashicorp/lru.Cache.Get under InMemoryIndex.Lookup.

Switching Lookup to Peek() takes the LRU's RLock instead, letting concurrent reads run in parallel. In a reproducer harness that faithfully mimics production load (full prefix warm, 40 KV-event writers Add()ing throughout):

readers blocks Get p50 Peek p50 speedup
64 263 21.46ms 2.86ms 7.5x
128 263 37.88ms 2.36ms 16x
64 1024 94.52ms 37.10ms 2.5x

Trade-off: LRU eviction order tracks Add-time recency only, no read-time recency. For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines, this is acceptable -- hot data refreshes through the write path.

Existing concurrent stress test (testStressConcurrentAddLookup) and all other in_memory tests pass without modification.

InMemoryIndex.Lookup walked requestKeys with hashicorp/lru's Get(),
which takes an exclusive sync.Mutex on every call (LRU promotion is
a write op on the linked list). For long prompts the loop runs hundreds
of Get() calls, and concurrent Lookup callers serialize fully behind
each other on that lock.

In production telemetry from the inference-scheduler's precise prefix
cache scorer (40-pod fleet, ~263 blocks per prompt), this manifested
as bimodal scorer latency: p50=0.31ms when prompts early-break on a
miss, p95=735ms / max=1.42s when reader concurrency built up on
shared prefixes. A mutex profile attributed 95.44% of contention to
hashicorp/lru.Cache.Get under InMemoryIndex.Lookup.

Switching Lookup to Peek() takes the LRU's RLock instead, letting
concurrent reads run in parallel. In a reproducer harness that
faithfully mimics production load (full prefix warm, 40 KV-event
writers Add()ing throughout):

  readers   blocks   Get p50    Peek p50    speedup
  64        263      21.46ms    2.86ms      7.5x
  128       263      37.88ms    2.36ms      16x
  64        1024     94.52ms    37.10ms     2.5x

Trade-off: LRU eviction order tracks Add-time recency only, no
read-time recency. For a prefix cache where hot blocks are
continually re-Added by KV-cache events from the engines, this is
acceptable -- hot data refreshes through the write path.

Existing concurrent stress test (testStressConcurrentAddLookup) and
all other in_memory tests pass without modification.
Copilot AI review requested due to automatic review settings May 1, 2026 11:15
@github-actions github-actions Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves concurrent read performance of the KV block in-memory index by avoiding LRU “promotion” work during Lookup, reducing lock contention when many callers scan long key chains.

Changes:

  • Switch InMemoryIndex.Lookup from lru.Cache.Get() to lru.Cache.Peek() to avoid MRU promotion on reads.
  • Add an explanatory comment documenting the contention root cause and the trade-off (eviction recency based on Add-time only).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +120 to +127
// Peek (RLock) instead of Get (Lock): Get promotes the entry to MRU
// and so takes hashicorp/lru's exclusive Mutex on every call. With
// concurrent Lookup callers each iterating hundreds of blocks, the
// per-Get serialization dominates latency (95%+ of mutex profile).
// Peek takes RLock and lets readers run in parallel. The tradeoff
// is that LRU eviction order tracks Add-time recency only; for a
// prefix cache where hot blocks are continually re-Added by KV
// events, this is acceptable.
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new comment explains the motivation, but it relies on hashicorp/golang-lru internal locking details ("exclusive Mutex" / "Peek takes RLock"). Since that’s an implementation detail that could change between versions, consider rephrasing to focus on the stable behavior difference (Get promotes MRU; Peek does not) and the observed contention, without asserting the specific lock type used internally.

Suggested change
// Peek (RLock) instead of Get (Lock): Get promotes the entry to MRU
// and so takes hashicorp/lru's exclusive Mutex on every call. With
// concurrent Lookup callers each iterating hundreds of blocks, the
// per-Get serialization dominates latency (95%+ of mutex profile).
// Peek takes RLock and lets readers run in parallel. The tradeoff
// is that LRU eviction order tracks Add-time recency only; for a
// prefix cache where hot blocks are continually re-Added by KV
// events, this is acceptable.
// Use Peek instead of Get here because Get updates the entry's MRU
// position, while Peek leaves recency unchanged. In this read-heavy
// path, concurrent Lookup callers may each scan hundreds of blocks,
// and avoiding that recency update significantly reduced observed
// contention. The tradeoff is that eviction order reflects Add-time
// recency rather than lookup-time recency; for a prefix cache where
// hot blocks are continually re-added by KV events, this is
// acceptable.

Copilot uses AI. Check for mistakes.
@liu-cong
Copy link
Copy Markdown
Member

liu-cong commented May 1, 2026

For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines,

KV events will be emitted even for blocks that are already cached?

@vMaroon
Copy link
Copy Markdown
Member Author

vMaroon commented May 1, 2026

For a prefix cache where hot blocks are continually re-Added by KV-cache events from the engines,

KV events will be emitted even for blocks that are already cached?

No - this PR tradeoffs lookup performance (currently ~1-5ms in a high-load environment) with LRU semantics, but requires more investigation. Should be WIP.

@vMaroon vMaroon marked this pull request as draft May 1, 2026 17:23
@github-actions
Copy link
Copy Markdown

This PR is marked as stale after 21d of inactivity. After an additional 14d of inactivity (7d to become rotten, then 7d more), it will be closed. To prevent this PR from being closed, add a comment or remove the lifecycle/stale label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/stale size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants