fix: Close data race in InMemoryIndex Add/Evict with RWMutex by gyliu513 · Pull Request #422 · llm-d/llm-d-kv-cache

gyliu513 · 2026-03-16T02:05:18Z

Fixed #421

Add sync.RWMutex to InMemoryIndex to protect compound read-modify-write sequences in Add and Evict that were not atomic, allowing concurrent Evict to remove a PodCache from data while Add was writing entries into it, silently losing those entries.
Lookup and GetRequestKey use RLock (concurrent reads allowed); Add and Evict use Lock (mutually exclusive).
Remove the now-redundant PodCache.mu and the double-check relocking pattern in Evict, simplifying the eviction path.

yankay

Review: fix: Close data race in InMemoryIndex Add/Evict with RWMutex

The PR correctly identifies and fixes the race described in #421. However, the global RWMutex approach introduces a significant concurrency regression and removes the existing fine-grained locking entirely. I'd suggest a more targeted fix.

What the PR does right

The race analysis in #421 is spot-on. The gap between m.data.Get(requestKey) and podCache.mu.Lock() in Add allows a concurrent Evict to remove the PodCache from m.data, and subsequent writes by Add go into an orphaned object. The PR closes this gap.

Concern 1: Global write lock serializes all mutations — even for unrelated keys

Add and Evict both acquire mu.Lock() (exclusive) for their entire duration. This means:

Two Add calls for completely different requestKeys cannot run concurrently.
Add and Evict for unrelated keys cannot run concurrently.
A single large Add(requestKeys=[k1,k2,...,kN]) holds the write lock while looping over all N keys, blocking every other operation.

The original per-PodCache mu allowed different keys to be modified in parallel. This PR replaces that with full serialization.

Concern 2: Lookup blocked by any mutation

Lookup takes RLock for its entire loop over requestKeys. This means every Lookup (the hot path — called on every inference request) is blocked whenever any Add or Evict is in progress, regardless of whether they touch the same keys.

Concern 3: GetRequestKey does not need a read lock

GetRequestKey is a single m.engineToRequestKeys.Get() call. lru.Cache is already internally thread-safe, so this is atomic without the outer lock. The added RLock/RUnlock is unnecessary overhead.

Concern 4: No regression test

This is the root cause behind the flaky test that PR #366 worked around by removing correctness assertions from ConcurrentOperations:

"These assertions are invalid because Evict can remove the entire PodCache when it becomes temporarily empty, causing pods from other goroutines to be lost."

The fix for #421 should include a targeted test that reproduces the specific race (concurrent Add + Evict-to-empty on the same key) and asserts that newly added entries are not lost. This would also validate that #366's workaround is no longer needed.

Suggested alternative: per-PodCache `removed` flag with retry

The race is specifically between one PodCache's fetch and lock. A surgical fix that preserves fine-grained concurrency:

Add a removed bool field to PodCache.
Evict: when the cache empties, set removed = true while holding podCache.mu, then call m.data.Remove() — both under the same lock hold.
Add: after acquiring podCache.mu, check removed. If true, unlock and retry (the retry will either find a new PodCache or create one).

// PodCache — add a removed flag
type PodCache struct {
    cache   *lru.Cache[PodEntry, struct{}]
    mu      sync.Mutex
    removed bool
}

// Evict — set removed + remove from map atomically under podCache.mu
podCache.mu.Lock()
for _, entry := range entries {
    podCache.cache.Remove(entry)
}
if podCache.cache.Len() == 0 {
    podCache.removed = true
    if cur, ok := m.data.Peek(requestKey); ok && cur == podCache {
        m.data.Remove(requestKey)
        m.engineToRequestKeys.Remove(engineKey)
    }
}
podCache.mu.Unlock()

// Add — check removed after acquiring lock, retry if stale
for attempt := 0; attempt < maxRetries; attempt++ {
    podCache := getOrCreatePodCache(requestKey)
    podCache.mu.Lock()
    if podCache.removed {
        podCache.mu.Unlock()
        continue
    }
    for _, entry := range entries {
        podCache.cache.Add(entry, struct{}{})
    }
    podCache.mu.Unlock()
    return nil
}

This approach:

Preserves concurrent Add/Evict on different keys (the original benefit of per-PodCache locking).
Does not block Lookup at all (no locking needed in the read path).
Only serializes operations on the same key, which is exactly the scope of the #421 race.
Uses Peek + pointer equality to avoid removing a replacement PodCache that a concurrent Add may have inserted.

gyliu513 · 2026-03-20T01:46:14Z

Thanks @yankay , comments are addressed, can you help review and comment?

Signed-off-by: Guangya Liu <gyliu513@gmail.com>

gyliu513 requested review from dannyharnik, kfirtoledo, liu-cong and vMaroon as code owners March 16, 2026 02:05

github-actions bot requested review from hyeongyun0916, sagearc and yankay March 16, 2026 02:05

yankay suggested changes Mar 19, 2026

View reviewed changes

gyliu513 force-pushed the race-memory branch from 939f851 to cece48c Compare March 20, 2026 01:45

github-actions bot requested a review from yankay March 20, 2026 01:45

gyliu513 force-pushed the race-memory branch 2 times, most recently from 2c291f6 to f8d98f1 Compare March 20, 2026 02:20

fix: Close data race in InMemoryIndex Add/Evict with RWMutex

a217e62

Signed-off-by: Guangya Liu <gyliu513@gmail.com>

gyliu513 force-pushed the race-memory branch from f8d98f1 to a217e62 Compare March 20, 2026 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Close data race in InMemoryIndex Add/Evict with RWMutex#422

fix: Close data race in InMemoryIndex Add/Evict with RWMutex#422
gyliu513 wants to merge 1 commit intollm-d:mainfrom
gyliu513:race-memory

gyliu513 commented Mar 16, 2026

Uh oh!

yankay left a comment

Uh oh!

gyliu513 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gyliu513 commented Mar 16, 2026

Uh oh!

yankay left a comment

Choose a reason for hiding this comment

Review: fix: Close data race in InMemoryIndex Add/Evict with RWMutex

What the PR does right

Concern 1: Global write lock serializes all mutations — even for unrelated keys

Concern 2: Lookup blocked by any mutation

Concern 3: GetRequestKey does not need a read lock

Concern 4: No regression test

Suggested alternative: per-PodCache removed flag with retry

Uh oh!

gyliu513 commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Suggested alternative: per-PodCache `removed` flag with retry