feat(cache): add in flight deduping#4459
feat(cache): add in flight deduping#4459MasterPtato wants to merge 1 commit into03-18-fix_cache_clean_up_libfrom
Conversation
|
🚅 Deployed to the rivet-pr-4459 environment in rivet-frontend
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
175706a to
17d21f5
Compare
75e1e38 to
bc5d6d3
Compare
Code Review: feat(cache): add in flight dedupingThe core idea of in-flight request deduplication is sound and addresses a real stampede problem. Here are my findings: Critical Issue 1: Lease not released on getter error. If the getter returns an Err, try_join short-circuits via ?, so the lease release loop never executes. This permanently stalls subsequent requests for those keys (5-second wait per attempt) until the process restarts. A guard/defer pattern would ensure cleanup regardless of the error path. Critical Issue 2: Broadcast not sent on empty resolution or getter error. broadcast_tx.send is only called when entries_values is non-empty. If the getter resolves no values or errors, waiters are stuck until IN_FLIGHT_TIMEOUT (5 seconds). The broadcast should be sent unconditionally after the getter completes. Combined with issue 1, a getter error causes both a leaked lease and a 5-second stall for all waiters. Moderate Issue 3: HashMap iteration order creates implicit coupling. In req_config.rs, keys and cache_keys are unzipped from ctx.entries() with non-deterministic HashMap iteration order, then keys is zipped with cached_values from the driver. This works because both were derived from the same iterator in the same pass, but it is fragile. A Vec pairing (Key, RawCacheKey) would make the relationship explicit and safe. The same issue applies in the waiting-keys path (succeeded_keys / succeeded_cache_keys). Moderate Issue 4: Rate limit tests silently removed. integration.rs contained test_rate_limit_basic and test_rate_limit_ip_isolation. These do not appear in any of the new test files (fetch.rs, in_flight.rs, ttl.rs). If rate limiting is still a feature of this crate, these tests should be preserved. Minor Issue 5: timeout_falls_back_to_getter test adds 5 seconds to the test suite. The test necessarily waits for IN_FLIGHT_TIMEOUT. Making the timeout configurable via a cfg(test) override or a parameter on CacheInner would allow faster test runs. Minor Issue 6: Inline await changes latency behavior on cache misses. The cache write was previously done in a background task; now it is awaited inline. This is likely intentional (to ensure broadcast happens after write), but it adds write latency to every cache miss response. A short comment documenting the trade-off would help future readers. Positive Observations: Switching GetterCtx from Vec to HashMap removes O(n) deduplication and makes key lookups O(1). scc::HashMap is an appropriate choice for concurrent in-flight tracking. Test coverage for the deduplication logic is solid: single waiter, multiple waiters, independent keys, mixed cached/in-flight, and timeout fallback are all covered. Removing the anyhow glob import aligns with the project style guidelines. The Driver::get signature change to a slice of RawCacheKey is a correct ergonomics improvement. |
bc5d6d3 to
b395fae
Compare
17d21f5 to
de18421
Compare
de18421 to
e65f84d
Compare
b395fae to
6570bf2
Compare
6570bf2 to
b46226c
Compare
65200cf to
ea32d90
Compare
b46226c to
66ec30f
Compare
|
Good overall approach. Deduplicating concurrent cache misses with a broadcast-based mechanism is the right design. The refactor of GetterCtx from Vec to HashMap is a clean improvement. A few issues need attention before merge. Critical: Lease not released on getter error In req_config.rs, if the lease-holder getter fails, tokio::try_join! short-circuits and returns early. The broadcast_tx.send and in_flight.remove_async calls are never reached. Effect: Waiters hang for the full IN_FLIGHT_TIMEOUT (5 s) before falling back. The in_flight entries are never removed and leak for the lifetime of CacheInner. Every future request for those keys will find a stale entry, subscribe to a dead broadcast channel, and always time out. Fix: use a RAII guard so cleanup always runs on drop, even on error. Bug: Broadcast skipped when getter resolves no values The broadcast send lives inside if !entries_values.is_empty(), but lease removal happens unconditionally after. If the getter resolves nothing (entity not found), entries_values is empty, the broadcast is never sent, but leases are removed. Waiters hang the full 5 s before timing out. Correct result, but severe latency penalty for a common not-found path. Fix: move broadcast_tx.send outside the if so it fires unconditionally after the write attempt. Ordering: remove from in_flight before broadcasting Current order: (1) write to cache, (2) broadcast, (3) remove lease from in_flight. Between steps 2 and 3, a new request can find the stale entry, subscribe to the already-consumed channel, and wait 5 s. Swap the order: remove from in_flight first, then broadcast. New requests after removal will do a fresh cache read (already populated) and return immediately. Test coverage gaps The new in_flight.rs tests are excellent. Two additional cases worth adding:
Minor broadcast::channel with capacity 16 is created even when leased_keys is empty, and only one message is ever sent. Capacity 1 is sufficient. Overall: the core mechanism is sound, but the error-path cleanup is a real correctness/leak bug and the broadcast-skip-on-empty is a meaningful latency regression on not-found paths. Both should be fixed before merging. |
66ec30f to
97b9cfd
Compare
ea32d90 to
ddfa969
Compare
ddfa969 to
bed6ca4
Compare
97b9cfd to
3fc4f7f
Compare

Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: