Skip to content

fix(cache-aware): gate hash_index hot-path writes behind explicit flag#1565

Open
ekzhang wants to merge 1 commit into
mainfrom
ekzhang/cache-aware-hash-index-leak
Open

fix(cache-aware): gate hash_index hot-path writes behind explicit flag#1565
ekzhang wants to merge 1 commit into
mainfrom
ekzhang/cache-aware-hash-index-leak

Conversation

@ekzhang
Copy link
Copy Markdown
Collaborator

@ekzhang ekzhang commented May 27, 2026

Add an AtomicBool populate_hash_index field to CacheAwarePolicy (default false) and a set_populate_hash_index() setter. Gate the four hot-path inserts on the flag; mesh wiring is expected to call set_populate_hash_index(true) when attaching the policy to a TreeSyncAdapter. The cold-start apply_repair_page writes are not gated since they only run when mesh is actually applying remote pages.

Note: there is currently no production code in server.rs that wires the v2 mesh adapters to the policy (per the existing comment about v1->v2 migration landing in a follow-up). Until that wiring lands and opts in, the index stays empty and the memory leak is closed.

Description

Problem

The hash_index (DashMap<model_id, PerModelHashIndex>) is written from four select_worker_* request-hot-path sites on every request, but its only readers are mesh-only methods on the TreeHandle trait (apply_known_remote_insert reads, apply_repair_page also writes during cold-start sync). When no mesh adapter is attached to the policy these entries accumulate with no consumer, resulting in OOM crashes every ~15 minutes in production.

Solution

Gate writing on hash_index to only when mesh mode is enabled. This way, the disabled path (default) doesn't run into this issue.

Changes

We needed to add an AtomicBool field to make this work due to recent refactors that removed mesh_sync.

Test Plan

We've been running this in production for a while.

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Added a runtime toggle to enable or disable hot-path population of per-model cache hash indexes, letting operators control when request paths write index entries.
  • Bug Fixes

    • Ensures cache-index population is gated to avoid unintended writes during inactive sync scenarios, reducing unnecessary overhead and improving stability.

Review Change Stack

@ekzhang ekzhang requested a review from slin1237 as a code owner May 27, 2026 16:36
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1b99b57a-9485-43e5-9a66-785002439dad

📥 Commits

Reviewing files that changed from the base of the PR and between 347b4f2 and b8ace41.

📒 Files selected for processing (1)
  • model_gateway/src/policies/cache_aware.rs

📝 Walkthrough

Walkthrough

Adds an atomic populate_hash_index gate to CacheAwarePolicy (default false) with set_populate_hash_index(), and conditions four hot-paths/routing sites to populate per-model hash_index only when the gate is enabled. A test now enables the gate before asserting behavior.

Changes

Hash-Index Population Control

Layer / File(s) Summary
Gate field and public API
model_gateway/src/policies/cache_aware.rs
populate_hash_index: AtomicBool field added to CacheAwarePolicy, initialized to false. Adds pub fn set_populate_hash_index(&self, enabled: bool) and internal should_populate_hash_index(); imports updated for AtomicBool and Ordering.
Conditional hash-index population in routing and cache-update paths
model_gateway/src/policies/cache_aware.rs
Token/gRPC imbalanced cache-update, string/HTTP imbalanced cache-update, token-tree balanced routing, and string-tree balanced routing now check should_populate_hash_index() before computing matched-prefix values and inserting into hash_index.token_tree / hash_index.string_tree.
Test enablement for hot-path population
model_gateway/src/policies/cache_aware.rs
test_apply_known_remote_insert_from_request_hot_path updated to call policy.set_populate_hash_index(true) so hot-path population sites write resolvable metadata prior to asserting apply_known_remote_insert.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • lightseekorg/smg#1364: Related prior changes that added hash-index population sites which this PR gates.
  • lightseekorg/smg#1535: Related changes touching matched-prefix/hash computation used by CacheAwarePolicy.

Suggested reviewers

  • slin1237
  • tonyluj
  • llfl
  • claude

Poem

🐰 I tuck a tiny gate in code so neat,
To hush the hash when meshes sleep complete.
Atomic whiskers hold the path in check,
Hot hops only write when I say, "go trek!"
A rabbit's nod — the index stays discreet.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a gate (flag) to control hash_index hot-path writes in the cache-aware policy, which directly matches the core objective of preventing unbounded memory growth.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ekzhang/cache-aware-hash-index-leak

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the model-gateway Model gateway crate changes label May 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a gating mechanism (populate_hash_index via an AtomicBool) to control the population of hash_index on the request hot path in CacheAwarePolicy. This prevents memory leaks and potential OOM issues when mesh is disabled. A critical issue was identified in the test changes where a non-existent method set_mesh_sync is called, which will cause a compilation failure; it should be replaced with set_populate_hash_index(true).

Comment on lines +1347 to +1349
let stores = Arc::new(smg_mesh::StateStores::with_self_name("test".to_string()));
let mesh = Arc::new(smg_mesh::MeshSyncManager::new(stores, "test".to_string()));
policy.set_mesh_sync(Some(mesh));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The test attempts to call policy.set_mesh_sync(Some(mesh)) which does not exist on CacheAwarePolicy due to recent refactors that removed mesh_sync. This will cause a compilation error. Since you introduced set_populate_hash_index to explicitly gate the hash_index population, you should call policy.set_populate_hash_index(true) instead.

Suggested change
let stores = Arc::new(smg_mesh::StateStores::with_self_name("test".to_string()));
let mesh = Arc::new(smg_mesh::MeshSyncManager::new(stores, "test".to_string()));
policy.set_mesh_sync(Some(mesh));
policy.set_populate_hash_index(true);

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 347b4f207d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

});
let stores = Arc::new(smg_mesh::StateStores::with_self_name("test".to_string()));
let mesh = Arc::new(smg_mesh::MeshSyncManager::new(stores, "test".to_string()));
policy.set_mesh_sync(Some(mesh));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace the removed mesh setter in the test

When building tests, this new call does not resolve: CacheAwarePolicy now exposes set_populate_hash_index, and a repo-wide rg "fn set_mesh_sync|set_mesh_sync\\(" model_gateway/src finds no set_mesh_sync implementation other than this added test call. Any cargo test --all-targets/CI run that compiles this test module will fail before exercising the cache-aware changes; the test should enable the new hash-index gate instead.

Useful? React with 👍 / 👎.

@claude
Copy link
Copy Markdown

claude Bot commented May 27, 2026

Review summary (cache_aware.rs):

The gating approach is sound — all four hash_index hot-path write sites are correctly guarded by the AtomicBool, tree inserts remain unconditional, apply_repair_page is correctly exempted (mesh-only by definition), and Ordering::Relaxed is the right choice for a simple gate flag with no data dependencies.

Issues found:

Severity Count
🔴 Important 0 new (1 already flagged by Gemini)
🟡 Nit 0
🟣 Pre-existing 0

The test at lines 1347-1349 won't compile: smg_mesh::StateStores and smg_mesh::MeshSyncManager are not defined in the smg-mesh crate, and set_mesh_sync is not a method on CacheAwarePolicy. As Gemini noted, replace with policy.set_populate_hash_index(true) and remove the unused stores/mesh lines.

Not approving due to the compilation error.

The hash_index (DashMap<model_id, PerModelHashIndex>) is written from
four select_worker_* request-hot-path sites on every request, but its
only readers are mesh-only methods on the TreeHandle trait
(apply_known_remote_insert reads, apply_repair_page also writes during
cold-start sync). When no mesh adapter is attached to the policy these
entries accumulate at ~300/sec with no consumer until per-model count
exceeds max_tree_size (4M), producing unbounded memory growth
(~10 GiB/hour) and OOMKills every ~6 hours in production.

Add an AtomicBool populate_hash_index field to CacheAwarePolicy
(default false) and a set_populate_hash_index() setter. Gate the four
hot-path inserts on the flag; mesh wiring is expected to call
set_populate_hash_index(true) when attaching the policy to a
TreeSyncAdapter. The cold-start apply_repair_page writes are not
gated since they only run when mesh is actually applying remote
pages.

Note: there is currently no production code in server.rs that wires
the v2 mesh adapters to the policy (per the existing comment about
v1->v2 migration landing in a follow-up). Until that wiring lands and
opts in, the index stays empty and the memory leak is closed.

Signed-off-by: Eric Zhang <eric@thinkingmachines.ai>
@ekzhang ekzhang force-pushed the ekzhang/cache-aware-hash-index-leak branch from 347b4f2 to b8ace41 Compare May 27, 2026 18:28
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b8ace41be1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

_eviction_task: eviction_task,
kv_monitor: RwLock::new(None),
hash_index,
populate_hash_index: AtomicBool::new(false),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enable hash indexing when attaching mesh

With this new default disabled, mesh-backed cache-aware policies stop recording request-hot-path hashes unless production wiring calls set_populate_hash_index(true). I checked the repo with rg "set_populate_hash_index" and the only caller is the unit test added below; TreeSyncAdapter::handle_incoming_batch still relies on apply_known_remote_insert resolving these hashes before it avoids repair. In an enabled mesh deployment, every prompt learned from normal traffic now looks unknown to peers, causing repair requests instead of applying tenant deltas, so the flag needs to be flipped where the real CacheAwarePolicy is attached to tree sync.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model-gateway Model gateway crate changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants