PD router cache-aware routing for chat completions only uses the first message


## Problem

In the SGLang PD router path for OpenAI-compatible chat completions, the cache-aware routing text is currently built only from the first chat message:

```rust
let request_text = body.messages.first().and_then(|msg| { ... });
```

For typical chat workloads, this is usually just the system prompt, rather than the actual reusable prefill prefix:

```text
system prompt
+ previous user/assistant/tool messages
+ current user message
+ tool definitions / template-expanded content
```

As a result, cache-aware routing decisions can be dominated by shared system-prompt overlap instead of the multi-turn conversation prefix that actually drives KV-cache reuse on workers.

---

## Code pointers

Permalinks below reference upstream SGLang main at commit `e86fdf3a3c43e311b7599ecfed7f30a7a2271a94`.

`policies_need_request_text()` only requests `request_text` when required by the routing policy:

https://github.com/sgl-project/sglang/blob/e86fdf3a3c43e311b7599ecfed7f30a7a2271a94/sgl-model-gateway/src/routers/http/pd_router.rs#L787-L790

`/v1/chat/completions` builds `request_text` from `body.messages.first()` only:

https://github.com/sgl-project/sglang/blob/e86fdf3a3c43e311b7599ecfed7f30a7a2271a94/sgl-model-gateway/src/routers/http/pd_router.rs#L1415-L1454

By contrast, `/generate` uses `body.text`, which is much closer to the full prompt:

https://github.com/sgl-project/sglang/blob/e86fdf3a3c43e311b7599ecfed7f30a7a2271a94/sgl-model-gateway/src/routers/http/pd_router.rs#L1388-L1412

The cache-aware policy maintains an approximate radix tree over request text and routes by prefix match ratio:

https://github.com/sgl-project/sglang/blob/e86fdf3a3c43e311b7599ecfed7f30a7a2271a94/sgl-model-gateway/src/policies/cache_aware.rs#L19-L30

Worker selection ultimately uses `request_text` via `tree.prefix_match_with_counts(text)`:

https://github.com/sgl-project/sglang/blob/e86fdf3a3c43e311b7599ecfed7f30a7a2271a94/sgl-model-gateway/src/policies/cache_aware.rs#L376-L452

---

## Why this matters

This becomes more visible in disaggregated serving with DP-aware routing.

Even if the router can target a specific DP rank, the cache-aware policy still needs a routing signal that reflects actual KV reuse. Using only the first message can:

* overvalue unrelated conversations sharing the same system prompt;
* miss strong reuse between consecutive turns of the same conversation;
* route requests away from the DP rank that already owns the useful KV prefix.

Minimal example:

```text
Conversation A, turn 1:
  [system S, user A1]

Conversation A, turn 2:
  [system S, user A1, assistant A1, user A2]

Conversation B, turn 1:
  [system S, user B1]
```

Current routing text makes all three requests look mostly identical because they share `system S`, while worker-side KV reuse is much stronger between A turn 1 and A turn 2.

---

## Observed impact

On an internal multi-turn coding-agent benchmark:

* no conversation affinity:

  * cache hit ~69%
  * output TPS ~678

* tuned native DP-aware/cache-aware routing:

  * cache hit ~85-90%
  * TPS still unstable and much lower than expected

* explicit conversation affinity / terminal-prefix affinity:

  * cache hit ~96%
  * output TPS ~1078-1085

The main difference is that conversation-affinity routing consistently sends later turns back to the same DP rank, matching the reusable KV prefix much more closely.

---

## Expected behavior

For chat completions, cache-aware routing should support using a deterministic representation of the full chat prefill input instead of only `messages[0]`.

Possible approaches:

* build routing text from all chat messages;
* optionally include tool/template-affecting fields;
* add a configurable mode such as `first_message` vs `all_messages`;
* expose an explicit routing/cache key hook for gateways or clients.

The router likely does not need to perfectly reproduce worker-side chat templating to improve routing quality. However, first-message-only matching is too lossy for multi-turn chat workloads where KV reuse is dominated by conversation history rather than the shared system prompt.

	fn policies_need_request_text(&self) -> bool {
	let prefill_policy = self.policy_registry.get_prefill_policy();
	let decode_policy = self.policy_registry.get_decode_policy();
	prefill_policy.needs_request_text() \|\| decode_policy.needs_request_text()

	async fn route_chat(
	&self,
	headers: Option<&HeaderMap>,
	body: &ChatCompletionRequest,
	model_id: Option<&str>,
	) -> Response {
	let is_stream = body.stream;
	let return_logprob = body.logprobs;

	let request_text = if self.policies_need_request_text() {
	body.messages.first().and_then(\|msg\| match msg {
	ChatMessage::User { content, .. } => match content {
	MessageContent::Text(text) => Some(text.clone()),
	MessageContent::Parts(_) => None,
	},
	ChatMessage::Developer { content, .. } => match content {
	MessageContent::Text(text) => Some(text.clone()),
	MessageContent::Parts(_) => None,
	},
	ChatMessage::System { content, .. } => Some(content.to_simple_string()),
	_ => None,
	})
	} else {
	None
	};

	// Calculate batch size
	let batch_size = Self::get_chat_batch_size(body);

	let context = PDRequestContext {
	route: "/v1/chat/completions",
	batch_size,
	is_stream,
	return_logprob,
	request_text,
	model_id,
	headers: headers.cloned(),
	};

	self.execute_dual_dispatch(headers, body, context).await

	body: &GenerateRequest,
	model_id: Option<&str>,
	) -> Response {
	let is_stream = body.stream;
	let return_logprob = body.return_logprob.unwrap_or(false);

	let request_text = if self.policies_need_request_text() {
	body.text.as_deref().map(\|s\| s.to_string())
	} else {
	None
	};

	let batch_size = Self::get_generate_batch_size(body);

	let context = PDRequestContext {
	route: "/generate",
	batch_size,
	is_stream,
	return_logprob,
	request_text,
	model_id,
	headers: headers.cloned(),
	};

	self.execute_dual_dispatch(headers, body, context).await

	1. Cache-Aware Routing (Approximate Tree)
	-------------------------------------------
	This strategy maintains an approximate radix tree for each worker based on request history,
	eliminating the need for direct cache state queries. The tree stores raw text characters
	instead of token IDs to avoid tokenization overhead.

	Process:
	a. For each request, find the worker with the highest prefix match
	b. If match rate > cache_threshold:
	Route to the worker with highest match (likely has relevant data cached)
	c. If match rate ≤ cache_threshold:
	Route to the worker with smallest tree size (most available cache capacity)

	async fn select_worker(
	&self,
	workers: &[Arc<dyn Worker>],
	info: &SelectWorkerInfo<'_>,
	) -> Option<usize> {
	let request_text = info.request_text;
	let healthy_indices = get_healthy_worker_indices(workers);

	if healthy_indices.is_empty() {
	return None;
	}

	// Determine the (pool, model) key for this set of workers — the router pre-filters
	// so every healthy worker here belongs to the same pool and same model.
	let pivot = workers[healthy_indices[0]].as_ref();
	let tree_key = tree_key_for_worker(pivot);

	// Get current load statistics - compute min/max in single pass without allocation
	let (min_load, max_load) = workers.iter().fold((usize::MAX, 0usize), \|(min, max), w\| {
	let load = w.load();
	(min.min(load), max.max(load))
	});
	let min_load = if min_load == usize::MAX { 0 } else { min_load };

	// Check if load is imbalanced
	let is_imbalanced = max_load.saturating_sub(min_load) > self.config.balance_abs_threshold
	&& (max_load as f32) > (min_load as f32 * self.config.balance_rel_threshold);

	if is_imbalanced {
	return self.select_worker_min_load(
	workers,
	&request_text,
	&healthy_indices,
	&tree_key,
	max_load,
	min_load,
	);
	}

	// Use cache-aware routing when balanced
	let text = request_text.unwrap_or("");

	// Get the tree reference without locking the entire HashMap
	// DashMap only locks the specific shard containing this key
	let tree = self.trees.get(&tree_key).map(\|entry\| entry.value().clone());

	if let Some(tree) = tree {
	// Now we work with the tree without holding the HashMap lock
	// Use prefix_match_with_counts to avoid redundant chars().count() calls
	let result = tree.prefix_match_with_counts(text);
	let match_rate = if result.input_char_count == 0 {
	0.0
	} else {
	result.matched_char_count as f32 / result.input_char_count as f32
	};

	// Select worker without String allocation
	let selected_idx = if match_rate > self.config.cache_threshold {
	// Cache hit path: find worker by URL (compare &str directly, no allocation)
	let tenant_url: &str = &result.tenant;
	workers
	.iter()
	.position(\|w\| w.url() == tenant_url)
	.filter(\|&idx\| workers[idx].is_healthy())
	} else {
	// Low cache match: use worker with minimum load
	healthy_indices
	.iter()
	.min_by_key(\|&&idx\| workers[idx].load())
	.copied()
	};

	if let Some(idx) = selected_idx {
	// Update the tree with this request (use worker URL directly, no allocation)
	tree.insert(text, workers[idx].url());

	// Sync insert operation to mesh if enabled (no-op if mesh is not enabled)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD router cache-aware routing for chat completions only uses the first message #26263

Problem

Code pointers

Why this matters

Observed impact

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PD router cache-aware routing for chat completions only uses the first message #26263

Description

Problem

Code pointers

Why this matters

Observed impact

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions