fix(cache-aware): gate hash_index hot-path writes behind explicit flag#1565
fix(cache-aware): gate hash_index hot-path writes behind explicit flag#1565ekzhang wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds an atomic ChangesHash-Index Population Control
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a gating mechanism (populate_hash_index via an AtomicBool) to control the population of hash_index on the request hot path in CacheAwarePolicy. This prevents memory leaks and potential OOM issues when mesh is disabled. A critical issue was identified in the test changes where a non-existent method set_mesh_sync is called, which will cause a compilation failure; it should be replaced with set_populate_hash_index(true).
| let stores = Arc::new(smg_mesh::StateStores::with_self_name("test".to_string())); | ||
| let mesh = Arc::new(smg_mesh::MeshSyncManager::new(stores, "test".to_string())); | ||
| policy.set_mesh_sync(Some(mesh)); |
There was a problem hiding this comment.
The test attempts to call policy.set_mesh_sync(Some(mesh)) which does not exist on CacheAwarePolicy due to recent refactors that removed mesh_sync. This will cause a compilation error. Since you introduced set_populate_hash_index to explicitly gate the hash_index population, you should call policy.set_populate_hash_index(true) instead.
| let stores = Arc::new(smg_mesh::StateStores::with_self_name("test".to_string())); | |
| let mesh = Arc::new(smg_mesh::MeshSyncManager::new(stores, "test".to_string())); | |
| policy.set_mesh_sync(Some(mesh)); | |
| policy.set_populate_hash_index(true); |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 347b4f207d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }); | ||
| let stores = Arc::new(smg_mesh::StateStores::with_self_name("test".to_string())); | ||
| let mesh = Arc::new(smg_mesh::MeshSyncManager::new(stores, "test".to_string())); | ||
| policy.set_mesh_sync(Some(mesh)); |
There was a problem hiding this comment.
Replace the removed mesh setter in the test
When building tests, this new call does not resolve: CacheAwarePolicy now exposes set_populate_hash_index, and a repo-wide rg "fn set_mesh_sync|set_mesh_sync\\(" model_gateway/src finds no set_mesh_sync implementation other than this added test call. Any cargo test --all-targets/CI run that compiles this test module will fail before exercising the cache-aware changes; the test should enable the new hash-index gate instead.
Useful? React with 👍 / 👎.
|
Review summary ( The gating approach is sound — all four Issues found:
The test at lines 1347-1349 won't compile: Not approving due to the compilation error. |
The hash_index (DashMap<model_id, PerModelHashIndex>) is written from four select_worker_* request-hot-path sites on every request, but its only readers are mesh-only methods on the TreeHandle trait (apply_known_remote_insert reads, apply_repair_page also writes during cold-start sync). When no mesh adapter is attached to the policy these entries accumulate at ~300/sec with no consumer until per-model count exceeds max_tree_size (4M), producing unbounded memory growth (~10 GiB/hour) and OOMKills every ~6 hours in production. Add an AtomicBool populate_hash_index field to CacheAwarePolicy (default false) and a set_populate_hash_index() setter. Gate the four hot-path inserts on the flag; mesh wiring is expected to call set_populate_hash_index(true) when attaching the policy to a TreeSyncAdapter. The cold-start apply_repair_page writes are not gated since they only run when mesh is actually applying remote pages. Note: there is currently no production code in server.rs that wires the v2 mesh adapters to the policy (per the existing comment about v1->v2 migration landing in a follow-up). Until that wiring lands and opts in, the index stays empty and the memory leak is closed. Signed-off-by: Eric Zhang <eric@thinkingmachines.ai>
347b4f2 to
b8ace41
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8ace41be1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| _eviction_task: eviction_task, | ||
| kv_monitor: RwLock::new(None), | ||
| hash_index, | ||
| populate_hash_index: AtomicBool::new(false), |
There was a problem hiding this comment.
Enable hash indexing when attaching mesh
With this new default disabled, mesh-backed cache-aware policies stop recording request-hot-path hashes unless production wiring calls set_populate_hash_index(true). I checked the repo with rg "set_populate_hash_index" and the only caller is the unit test added below; TreeSyncAdapter::handle_incoming_batch still relies on apply_known_remote_insert resolving these hashes before it avoids repair. In an enabled mesh deployment, every prompt learned from normal traffic now looks unknown to peers, causing repair requests instead of applying tenant deltas, so the flag needs to be flipped where the real CacheAwarePolicy is attached to tree sync.
Useful? React with 👍 / 👎.
Add an AtomicBool populate_hash_index field to CacheAwarePolicy (default false) and a set_populate_hash_index() setter. Gate the four hot-path inserts on the flag; mesh wiring is expected to call set_populate_hash_index(true) when attaching the policy to a TreeSyncAdapter. The cold-start apply_repair_page writes are not gated since they only run when mesh is actually applying remote pages.
Note: there is currently no production code in server.rs that wires the v2 mesh adapters to the policy (per the existing comment about v1->v2 migration landing in a follow-up). Until that wiring lands and opts in, the index stays empty and the memory leak is closed.
Description
Problem
The hash_index (DashMap<model_id, PerModelHashIndex>) is written from four select_worker_* request-hot-path sites on every request, but its only readers are mesh-only methods on the TreeHandle trait (apply_known_remote_insert reads, apply_repair_page also writes during cold-start sync). When no mesh adapter is attached to the policy these entries accumulate with no consumer, resulting in OOM crashes every ~15 minutes in production.
Solution
Gate writing on
hash_indexto only when mesh mode is enabled. This way, the disabled path (default) doesn't run into this issue.Changes
We needed to add an
AtomicBoolfield to make this work due to recent refactors that removedmesh_sync.Test Plan
We've been running this in production for a while.
Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Bug Fixes