Problem
In the SGLang PD router path for OpenAI-compatible chat completions, the cache-aware routing text is currently built only from the first chat message:
let request_text = body.messages.first().and_then(|msg| { ... });
For typical chat workloads, this is usually just the system prompt, rather than the actual reusable prefill prefix:
system prompt
+ previous user/assistant/tool messages
+ current user message
+ tool definitions / template-expanded content
As a result, cache-aware routing decisions can be dominated by shared system-prompt overlap instead of the multi-turn conversation prefix that actually drives KV-cache reuse on workers.
Code pointers
Permalinks below reference upstream SGLang main at commit e86fdf3a3c43e311b7599ecfed7f30a7a2271a94.
policies_need_request_text() only requests request_text when required by the routing policy:
|
fn policies_need_request_text(&self) -> bool { |
|
let prefill_policy = self.policy_registry.get_prefill_policy(); |
|
let decode_policy = self.policy_registry.get_decode_policy(); |
|
prefill_policy.needs_request_text() || decode_policy.needs_request_text() |
/v1/chat/completions builds request_text from body.messages.first() only:
|
async fn route_chat( |
|
&self, |
|
headers: Option<&HeaderMap>, |
|
body: &ChatCompletionRequest, |
|
model_id: Option<&str>, |
|
) -> Response { |
|
let is_stream = body.stream; |
|
let return_logprob = body.logprobs; |
|
|
|
let request_text = if self.policies_need_request_text() { |
|
body.messages.first().and_then(|msg| match msg { |
|
ChatMessage::User { content, .. } => match content { |
|
MessageContent::Text(text) => Some(text.clone()), |
|
MessageContent::Parts(_) => None, |
|
}, |
|
ChatMessage::Developer { content, .. } => match content { |
|
MessageContent::Text(text) => Some(text.clone()), |
|
MessageContent::Parts(_) => None, |
|
}, |
|
ChatMessage::System { content, .. } => Some(content.to_simple_string()), |
|
_ => None, |
|
}) |
|
} else { |
|
None |
|
}; |
|
|
|
// Calculate batch size |
|
let batch_size = Self::get_chat_batch_size(body); |
|
|
|
let context = PDRequestContext { |
|
route: "/v1/chat/completions", |
|
batch_size, |
|
is_stream, |
|
return_logprob, |
|
request_text, |
|
model_id, |
|
headers: headers.cloned(), |
|
}; |
|
|
|
self.execute_dual_dispatch(headers, body, context).await |
By contrast, /generate uses body.text, which is much closer to the full prompt:
|
body: &GenerateRequest, |
|
model_id: Option<&str>, |
|
) -> Response { |
|
let is_stream = body.stream; |
|
let return_logprob = body.return_logprob.unwrap_or(false); |
|
|
|
let request_text = if self.policies_need_request_text() { |
|
body.text.as_deref().map(|s| s.to_string()) |
|
} else { |
|
None |
|
}; |
|
|
|
let batch_size = Self::get_generate_batch_size(body); |
|
|
|
let context = PDRequestContext { |
|
route: "/generate", |
|
batch_size, |
|
is_stream, |
|
return_logprob, |
|
request_text, |
|
model_id, |
|
headers: headers.cloned(), |
|
}; |
|
|
|
self.execute_dual_dispatch(headers, body, context).await |
The cache-aware policy maintains an approximate radix tree over request text and routes by prefix match ratio:
|
1. Cache-Aware Routing (Approximate Tree) |
|
------------------------------------------- |
|
This strategy maintains an approximate radix tree for each worker based on request history, |
|
eliminating the need for direct cache state queries. The tree stores raw text characters |
|
instead of token IDs to avoid tokenization overhead. |
|
|
|
Process: |
|
a. For each request, find the worker with the highest prefix match |
|
b. If match rate > cache_threshold: |
|
Route to the worker with highest match (likely has relevant data cached) |
|
c. If match rate ≤ cache_threshold: |
|
Route to the worker with smallest tree size (most available cache capacity) |
Worker selection ultimately uses request_text via tree.prefix_match_with_counts(text):
|
async fn select_worker( |
|
&self, |
|
workers: &[Arc<dyn Worker>], |
|
info: &SelectWorkerInfo<'_>, |
|
) -> Option<usize> { |
|
let request_text = info.request_text; |
|
let healthy_indices = get_healthy_worker_indices(workers); |
|
|
|
if healthy_indices.is_empty() { |
|
return None; |
|
} |
|
|
|
// Determine the (pool, model) key for this set of workers — the router pre-filters |
|
// so every healthy worker here belongs to the same pool and same model. |
|
let pivot = workers[healthy_indices[0]].as_ref(); |
|
let tree_key = tree_key_for_worker(pivot); |
|
|
|
// Get current load statistics - compute min/max in single pass without allocation |
|
let (min_load, max_load) = workers.iter().fold((usize::MAX, 0usize), |(min, max), w| { |
|
let load = w.load(); |
|
(min.min(load), max.max(load)) |
|
}); |
|
let min_load = if min_load == usize::MAX { 0 } else { min_load }; |
|
|
|
// Check if load is imbalanced |
|
let is_imbalanced = max_load.saturating_sub(min_load) > self.config.balance_abs_threshold |
|
&& (max_load as f32) > (min_load as f32 * self.config.balance_rel_threshold); |
|
|
|
if is_imbalanced { |
|
return self.select_worker_min_load( |
|
workers, |
|
&request_text, |
|
&healthy_indices, |
|
&tree_key, |
|
max_load, |
|
min_load, |
|
); |
|
} |
|
|
|
// Use cache-aware routing when balanced |
|
let text = request_text.unwrap_or(""); |
|
|
|
// Get the tree reference without locking the entire HashMap |
|
// DashMap only locks the specific shard containing this key |
|
let tree = self.trees.get(&tree_key).map(|entry| entry.value().clone()); |
|
|
|
if let Some(tree) = tree { |
|
// Now we work with the tree without holding the HashMap lock |
|
// Use prefix_match_with_counts to avoid redundant chars().count() calls |
|
let result = tree.prefix_match_with_counts(text); |
|
let match_rate = if result.input_char_count == 0 { |
|
0.0 |
|
} else { |
|
result.matched_char_count as f32 / result.input_char_count as f32 |
|
}; |
|
|
|
// Select worker without String allocation |
|
let selected_idx = if match_rate > self.config.cache_threshold { |
|
// Cache hit path: find worker by URL (compare &str directly, no allocation) |
|
let tenant_url: &str = &result.tenant; |
|
workers |
|
.iter() |
|
.position(|w| w.url() == tenant_url) |
|
.filter(|&idx| workers[idx].is_healthy()) |
|
} else { |
|
// Low cache match: use worker with minimum load |
|
healthy_indices |
|
.iter() |
|
.min_by_key(|&&idx| workers[idx].load()) |
|
.copied() |
|
}; |
|
|
|
if let Some(idx) = selected_idx { |
|
// Update the tree with this request (use worker URL directly, no allocation) |
|
tree.insert(text, workers[idx].url()); |
|
|
|
// Sync insert operation to mesh if enabled (no-op if mesh is not enabled) |
Why this matters
This becomes more visible in disaggregated serving with DP-aware routing.
Even if the router can target a specific DP rank, the cache-aware policy still needs a routing signal that reflects actual KV reuse. Using only the first message can:
- overvalue unrelated conversations sharing the same system prompt;
- miss strong reuse between consecutive turns of the same conversation;
- route requests away from the DP rank that already owns the useful KV prefix.
Minimal example:
Conversation A, turn 1:
[system S, user A1]
Conversation A, turn 2:
[system S, user A1, assistant A1, user A2]
Conversation B, turn 1:
[system S, user B1]
Current routing text makes all three requests look mostly identical because they share system S, while worker-side KV reuse is much stronger between A turn 1 and A turn 2.
Observed impact
On an internal multi-turn coding-agent benchmark:
-
no conversation affinity:
- cache hit ~69%
- output TPS ~678
-
tuned native DP-aware/cache-aware routing:
- cache hit ~85-90%
- TPS still unstable and much lower than expected
-
explicit conversation affinity / terminal-prefix affinity:
- cache hit ~96%
- output TPS ~1078-1085
The main difference is that conversation-affinity routing consistently sends later turns back to the same DP rank, matching the reusable KV prefix much more closely.
Expected behavior
For chat completions, cache-aware routing should support using a deterministic representation of the full chat prefill input instead of only messages[0].
Possible approaches:
- build routing text from all chat messages;
- optionally include tool/template-affecting fields;
- add a configurable mode such as
first_message vs all_messages;
- expose an explicit routing/cache key hook for gateways or clients.
The router likely does not need to perfectly reproduce worker-side chat templating to improve routing quality. However, first-message-only matching is too lossy for multi-turn chat workloads where KV reuse is dominated by conversation history rather than the shared system prompt.
Problem
In the SGLang PD router path for OpenAI-compatible chat completions, the cache-aware routing text is currently built only from the first chat message:
For typical chat workloads, this is usually just the system prompt, rather than the actual reusable prefill prefix:
As a result, cache-aware routing decisions can be dominated by shared system-prompt overlap instead of the multi-turn conversation prefix that actually drives KV-cache reuse on workers.
Code pointers
Permalinks below reference upstream SGLang main at commit
e86fdf3a3c43e311b7599ecfed7f30a7a2271a94.policies_need_request_text()only requestsrequest_textwhen required by the routing policy:sglang/sgl-model-gateway/src/routers/http/pd_router.rs
Lines 787 to 790 in e86fdf3
/v1/chat/completionsbuildsrequest_textfrombody.messages.first()only:sglang/sgl-model-gateway/src/routers/http/pd_router.rs
Lines 1415 to 1454 in e86fdf3
By contrast,
/generateusesbody.text, which is much closer to the full prompt:sglang/sgl-model-gateway/src/routers/http/pd_router.rs
Lines 1388 to 1412 in e86fdf3
The cache-aware policy maintains an approximate radix tree over request text and routes by prefix match ratio:
sglang/sgl-model-gateway/src/policies/cache_aware.rs
Lines 19 to 30 in e86fdf3
Worker selection ultimately uses
request_textviatree.prefix_match_with_counts(text):sglang/sgl-model-gateway/src/policies/cache_aware.rs
Lines 376 to 452 in e86fdf3
Why this matters
This becomes more visible in disaggregated serving with DP-aware routing.
Even if the router can target a specific DP rank, the cache-aware policy still needs a routing signal that reflects actual KV reuse. Using only the first message can:
Minimal example:
Current routing text makes all three requests look mostly identical because they share
system S, while worker-side KV reuse is much stronger between A turn 1 and A turn 2.Observed impact
On an internal multi-turn coding-agent benchmark:
no conversation affinity:
tuned native DP-aware/cache-aware routing:
explicit conversation affinity / terminal-prefix affinity:
The main difference is that conversation-affinity routing consistently sends later turns back to the same DP rank, matching the reusable KV prefix much more closely.
Expected behavior
For chat completions, cache-aware routing should support using a deterministic representation of the full chat prefill input instead of only
messages[0].Possible approaches:
first_messagevsall_messages;The router likely does not need to perfectly reproduce worker-side chat templating to improve routing quality. However, first-message-only matching is too lossy for multi-turn chat workloads where KV reuse is dominated by conversation history rather than the shared system prompt.