Skip to content

PD router cache-aware routing for chat completions only uses the first message #26263

@YAMY1234

Description

@YAMY1234

Problem

In the SGLang PD router path for OpenAI-compatible chat completions, the cache-aware routing text is currently built only from the first chat message:

let request_text = body.messages.first().and_then(|msg| { ... });

For typical chat workloads, this is usually just the system prompt, rather than the actual reusable prefill prefix:

system prompt
+ previous user/assistant/tool messages
+ current user message
+ tool definitions / template-expanded content

As a result, cache-aware routing decisions can be dominated by shared system-prompt overlap instead of the multi-turn conversation prefix that actually drives KV-cache reuse on workers.


Code pointers

Permalinks below reference upstream SGLang main at commit e86fdf3a3c43e311b7599ecfed7f30a7a2271a94.

policies_need_request_text() only requests request_text when required by the routing policy:

fn policies_need_request_text(&self) -> bool {
let prefill_policy = self.policy_registry.get_prefill_policy();
let decode_policy = self.policy_registry.get_decode_policy();
prefill_policy.needs_request_text() || decode_policy.needs_request_text()

/v1/chat/completions builds request_text from body.messages.first() only:

async fn route_chat(
&self,
headers: Option<&HeaderMap>,
body: &ChatCompletionRequest,
model_id: Option<&str>,
) -> Response {
let is_stream = body.stream;
let return_logprob = body.logprobs;
let request_text = if self.policies_need_request_text() {
body.messages.first().and_then(|msg| match msg {
ChatMessage::User { content, .. } => match content {
MessageContent::Text(text) => Some(text.clone()),
MessageContent::Parts(_) => None,
},
ChatMessage::Developer { content, .. } => match content {
MessageContent::Text(text) => Some(text.clone()),
MessageContent::Parts(_) => None,
},
ChatMessage::System { content, .. } => Some(content.to_simple_string()),
_ => None,
})
} else {
None
};
// Calculate batch size
let batch_size = Self::get_chat_batch_size(body);
let context = PDRequestContext {
route: "/v1/chat/completions",
batch_size,
is_stream,
return_logprob,
request_text,
model_id,
headers: headers.cloned(),
};
self.execute_dual_dispatch(headers, body, context).await

By contrast, /generate uses body.text, which is much closer to the full prompt:

body: &GenerateRequest,
model_id: Option<&str>,
) -> Response {
let is_stream = body.stream;
let return_logprob = body.return_logprob.unwrap_or(false);
let request_text = if self.policies_need_request_text() {
body.text.as_deref().map(|s| s.to_string())
} else {
None
};
let batch_size = Self::get_generate_batch_size(body);
let context = PDRequestContext {
route: "/generate",
batch_size,
is_stream,
return_logprob,
request_text,
model_id,
headers: headers.cloned(),
};
self.execute_dual_dispatch(headers, body, context).await

The cache-aware policy maintains an approximate radix tree over request text and routes by prefix match ratio:

1. Cache-Aware Routing (Approximate Tree)
-------------------------------------------
This strategy maintains an approximate radix tree for each worker based on request history,
eliminating the need for direct cache state queries. The tree stores raw text characters
instead of token IDs to avoid tokenization overhead.
Process:
a. For each request, find the worker with the highest prefix match
b. If match rate > cache_threshold:
Route to the worker with highest match (likely has relevant data cached)
c. If match rate ≤ cache_threshold:
Route to the worker with smallest tree size (most available cache capacity)

Worker selection ultimately uses request_text via tree.prefix_match_with_counts(text):

async fn select_worker(
&self,
workers: &[Arc<dyn Worker>],
info: &SelectWorkerInfo<'_>,
) -> Option<usize> {
let request_text = info.request_text;
let healthy_indices = get_healthy_worker_indices(workers);
if healthy_indices.is_empty() {
return None;
}
// Determine the (pool, model) key for this set of workers — the router pre-filters
// so every healthy worker here belongs to the same pool and same model.
let pivot = workers[healthy_indices[0]].as_ref();
let tree_key = tree_key_for_worker(pivot);
// Get current load statistics - compute min/max in single pass without allocation
let (min_load, max_load) = workers.iter().fold((usize::MAX, 0usize), |(min, max), w| {
let load = w.load();
(min.min(load), max.max(load))
});
let min_load = if min_load == usize::MAX { 0 } else { min_load };
// Check if load is imbalanced
let is_imbalanced = max_load.saturating_sub(min_load) > self.config.balance_abs_threshold
&& (max_load as f32) > (min_load as f32 * self.config.balance_rel_threshold);
if is_imbalanced {
return self.select_worker_min_load(
workers,
&request_text,
&healthy_indices,
&tree_key,
max_load,
min_load,
);
}
// Use cache-aware routing when balanced
let text = request_text.unwrap_or("");
// Get the tree reference without locking the entire HashMap
// DashMap only locks the specific shard containing this key
let tree = self.trees.get(&tree_key).map(|entry| entry.value().clone());
if let Some(tree) = tree {
// Now we work with the tree without holding the HashMap lock
// Use prefix_match_with_counts to avoid redundant chars().count() calls
let result = tree.prefix_match_with_counts(text);
let match_rate = if result.input_char_count == 0 {
0.0
} else {
result.matched_char_count as f32 / result.input_char_count as f32
};
// Select worker without String allocation
let selected_idx = if match_rate > self.config.cache_threshold {
// Cache hit path: find worker by URL (compare &str directly, no allocation)
let tenant_url: &str = &result.tenant;
workers
.iter()
.position(|w| w.url() == tenant_url)
.filter(|&idx| workers[idx].is_healthy())
} else {
// Low cache match: use worker with minimum load
healthy_indices
.iter()
.min_by_key(|&&idx| workers[idx].load())
.copied()
};
if let Some(idx) = selected_idx {
// Update the tree with this request (use worker URL directly, no allocation)
tree.insert(text, workers[idx].url());
// Sync insert operation to mesh if enabled (no-op if mesh is not enabled)


Why this matters

This becomes more visible in disaggregated serving with DP-aware routing.

Even if the router can target a specific DP rank, the cache-aware policy still needs a routing signal that reflects actual KV reuse. Using only the first message can:

  • overvalue unrelated conversations sharing the same system prompt;
  • miss strong reuse between consecutive turns of the same conversation;
  • route requests away from the DP rank that already owns the useful KV prefix.

Minimal example:

Conversation A, turn 1:
  [system S, user A1]

Conversation A, turn 2:
  [system S, user A1, assistant A1, user A2]

Conversation B, turn 1:
  [system S, user B1]

Current routing text makes all three requests look mostly identical because they share system S, while worker-side KV reuse is much stronger between A turn 1 and A turn 2.


Observed impact

On an internal multi-turn coding-agent benchmark:

  • no conversation affinity:

    • cache hit ~69%
    • output TPS ~678
  • tuned native DP-aware/cache-aware routing:

    • cache hit ~85-90%
    • TPS still unstable and much lower than expected
  • explicit conversation affinity / terminal-prefix affinity:

    • cache hit ~96%
    • output TPS ~1078-1085

The main difference is that conversation-affinity routing consistently sends later turns back to the same DP rank, matching the reusable KV prefix much more closely.


Expected behavior

For chat completions, cache-aware routing should support using a deterministic representation of the full chat prefill input instead of only messages[0].

Possible approaches:

  • build routing text from all chat messages;
  • optionally include tool/template-affecting fields;
  • add a configurable mode such as first_message vs all_messages;
  • expose an explicit routing/cache key hook for gateways or clients.

The router likely does not need to perfectly reproduce worker-side chat templating to improve routing quality. However, first-message-only matching is too lossy for multi-turn chat workloads where KV reuse is dominated by conversation history rather than the shared system prompt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions