fix: Close data race in InMemoryIndex Add/Evict with RWMutex#422
fix: Close data race in InMemoryIndex Add/Evict with RWMutex#422gyliu513 wants to merge 1 commit intollm-d:mainfrom
Conversation
yankay
left a comment
There was a problem hiding this comment.
Review: fix: Close data race in InMemoryIndex Add/Evict with RWMutex
The PR correctly identifies and fixes the race described in #421. However, the global RWMutex approach introduces a significant concurrency regression and removes the existing fine-grained locking entirely. I'd suggest a more targeted fix.
What the PR does right
The race analysis in #421 is spot-on. The gap between m.data.Get(requestKey) and podCache.mu.Lock() in Add allows a concurrent Evict to remove the PodCache from m.data, and subsequent writes by Add go into an orphaned object. The PR closes this gap.
Concern 1: Global write lock serializes all mutations — even for unrelated keys
Add and Evict both acquire mu.Lock() (exclusive) for their entire duration. This means:
- Two
Addcalls for completely differentrequestKeyscannot run concurrently. AddandEvictfor unrelated keys cannot run concurrently.- A single large
Add(requestKeys=[k1,k2,...,kN])holds the write lock while looping over all N keys, blocking every other operation.
The original per-PodCache mu allowed different keys to be modified in parallel. This PR replaces that with full serialization.
Concern 2: Lookup blocked by any mutation
Lookup takes RLock for its entire loop over requestKeys. This means every Lookup (the hot path — called on every inference request) is blocked whenever any Add or Evict is in progress, regardless of whether they touch the same keys.
Concern 3: GetRequestKey does not need a read lock
GetRequestKey is a single m.engineToRequestKeys.Get() call. lru.Cache is already internally thread-safe, so this is atomic without the outer lock. The added RLock/RUnlock is unnecessary overhead.
Concern 4: No regression test
This is the root cause behind the flaky test that PR #366 worked around by removing correctness assertions from ConcurrentOperations:
"These assertions are invalid because Evict can remove the entire PodCache when it becomes temporarily empty, causing pods from other goroutines to be lost."
The fix for #421 should include a targeted test that reproduces the specific race (concurrent Add + Evict-to-empty on the same key) and asserts that newly added entries are not lost. This would also validate that #366's workaround is no longer needed.
Suggested alternative: per-PodCache removed flag with retry
The race is specifically between one PodCache's fetch and lock. A surgical fix that preserves fine-grained concurrency:
- Add a
removed boolfield toPodCache. - Evict: when the cache empties, set
removed = truewhile holdingpodCache.mu, then callm.data.Remove()— both under the same lock hold. - Add: after acquiring
podCache.mu, checkremoved. If true, unlock and retry (the retry will either find a new PodCache or create one).
// PodCache — add a removed flag
type PodCache struct {
cache *lru.Cache[PodEntry, struct{}]
mu sync.Mutex
removed bool
}
// Evict — set removed + remove from map atomically under podCache.mu
podCache.mu.Lock()
for _, entry := range entries {
podCache.cache.Remove(entry)
}
if podCache.cache.Len() == 0 {
podCache.removed = true
if cur, ok := m.data.Peek(requestKey); ok && cur == podCache {
m.data.Remove(requestKey)
m.engineToRequestKeys.Remove(engineKey)
}
}
podCache.mu.Unlock()
// Add — check removed after acquiring lock, retry if stale
for attempt := 0; attempt < maxRetries; attempt++ {
podCache := getOrCreatePodCache(requestKey)
podCache.mu.Lock()
if podCache.removed {
podCache.mu.Unlock()
continue
}
for _, entry := range entries {
podCache.cache.Add(entry, struct{}{})
}
podCache.mu.Unlock()
return nil
}This approach:
- Preserves concurrent Add/Evict on different keys (the original benefit of per-PodCache locking).
- Does not block Lookup at all (no locking needed in the read path).
- Only serializes operations on the same key, which is exactly the scope of the #421 race.
- Uses
Peek+ pointer equality to avoid removing a replacement PodCache that a concurrent Add may have inserted.
|
Thanks @yankay , comments are addressed, can you help review and comment? |
2c291f6 to
f8d98f1
Compare
Signed-off-by: Guangya Liu <gyliu513@gmail.com>
Fixed #421
sync.RWMutextoInMemoryIndexto protect compound read-modify-write sequences inAddandEvictthat were not atomic, allowing concurrentEvictto remove aPodCachefromdatawhileAddwas writing entries into it, silently losing those entries.LookupandGetRequestKeyuseRLock(concurrent reads allowed);AddandEvictuseLock(mutually exclusive).PodCache.muand the double-check relocking pattern inEvict, simplifying the eviction path.