Skip to content

fix: Close data race in InMemoryIndex Add/Evict with RWMutex#422

Open
gyliu513 wants to merge 1 commit intollm-d:mainfrom
gyliu513:race-memory
Open

fix: Close data race in InMemoryIndex Add/Evict with RWMutex#422
gyliu513 wants to merge 1 commit intollm-d:mainfrom
gyliu513:race-memory

Conversation

@gyliu513
Copy link
Copy Markdown
Contributor

Fixed #421

  • Add sync.RWMutex to InMemoryIndex to protect compound read-modify-write sequences in Add and Evict that were not atomic, allowing concurrent Evict to remove a PodCache from data while Add was writing entries into it, silently losing those entries.
  • Lookup and GetRequestKey use RLock (concurrent reads allowed); Add and Evict use Lock (mutually exclusive).
  • Remove the now-redundant PodCache.mu and the double-check relocking pattern in Evict, simplifying the eviction path.

Copy link
Copy Markdown
Collaborator

@yankay yankay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix: Close data race in InMemoryIndex Add/Evict with RWMutex

The PR correctly identifies and fixes the race described in #421. However, the global RWMutex approach introduces a significant concurrency regression and removes the existing fine-grained locking entirely. I'd suggest a more targeted fix.

What the PR does right

The race analysis in #421 is spot-on. The gap between m.data.Get(requestKey) and podCache.mu.Lock() in Add allows a concurrent Evict to remove the PodCache from m.data, and subsequent writes by Add go into an orphaned object. The PR closes this gap.

Concern 1: Global write lock serializes all mutations — even for unrelated keys

Add and Evict both acquire mu.Lock() (exclusive) for their entire duration. This means:

  • Two Add calls for completely different requestKeys cannot run concurrently.
  • Add and Evict for unrelated keys cannot run concurrently.
  • A single large Add(requestKeys=[k1,k2,...,kN]) holds the write lock while looping over all N keys, blocking every other operation.

The original per-PodCache mu allowed different keys to be modified in parallel. This PR replaces that with full serialization.

Concern 2: Lookup blocked by any mutation

Lookup takes RLock for its entire loop over requestKeys. This means every Lookup (the hot path — called on every inference request) is blocked whenever any Add or Evict is in progress, regardless of whether they touch the same keys.

Concern 3: GetRequestKey does not need a read lock

GetRequestKey is a single m.engineToRequestKeys.Get() call. lru.Cache is already internally thread-safe, so this is atomic without the outer lock. The added RLock/RUnlock is unnecessary overhead.

Concern 4: No regression test

This is the root cause behind the flaky test that PR #366 worked around by removing correctness assertions from ConcurrentOperations:

"These assertions are invalid because Evict can remove the entire PodCache when it becomes temporarily empty, causing pods from other goroutines to be lost."

The fix for #421 should include a targeted test that reproduces the specific race (concurrent Add + Evict-to-empty on the same key) and asserts that newly added entries are not lost. This would also validate that #366's workaround is no longer needed.

Suggested alternative: per-PodCache removed flag with retry

The race is specifically between one PodCache's fetch and lock. A surgical fix that preserves fine-grained concurrency:

  1. Add a removed bool field to PodCache.
  2. Evict: when the cache empties, set removed = true while holding podCache.mu, then call m.data.Remove() — both under the same lock hold.
  3. Add: after acquiring podCache.mu, check removed. If true, unlock and retry (the retry will either find a new PodCache or create one).
// PodCache — add a removed flag
type PodCache struct {
    cache   *lru.Cache[PodEntry, struct{}]
    mu      sync.Mutex
    removed bool
}

// Evict — set removed + remove from map atomically under podCache.mu
podCache.mu.Lock()
for _, entry := range entries {
    podCache.cache.Remove(entry)
}
if podCache.cache.Len() == 0 {
    podCache.removed = true
    if cur, ok := m.data.Peek(requestKey); ok && cur == podCache {
        m.data.Remove(requestKey)
        m.engineToRequestKeys.Remove(engineKey)
    }
}
podCache.mu.Unlock()

// Add — check removed after acquiring lock, retry if stale
for attempt := 0; attempt < maxRetries; attempt++ {
    podCache := getOrCreatePodCache(requestKey)
    podCache.mu.Lock()
    if podCache.removed {
        podCache.mu.Unlock()
        continue
    }
    for _, entry := range entries {
        podCache.cache.Add(entry, struct{}{})
    }
    podCache.mu.Unlock()
    return nil
}

This approach:

  • Preserves concurrent Add/Evict on different keys (the original benefit of per-PodCache locking).
  • Does not block Lookup at all (no locking needed in the read path).
  • Only serializes operations on the same key, which is exactly the scope of the #421 race.
  • Uses Peek + pointer equality to avoid removing a replacement PodCache that a concurrent Add may have inserted.

@gyliu513
Copy link
Copy Markdown
Contributor Author

Thanks @yankay , comments are addressed, can you help review and comment?

@gyliu513 gyliu513 force-pushed the race-memory branch 2 times, most recently from 2c291f6 to f8d98f1 Compare March 20, 2026 02:20
Signed-off-by: Guangya Liu <gyliu513@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

InMemoryIndex: lost writes under concurrent Add/Evict due to non-atomic compound operations

2 participants