Skip to content

Commit ee4dc5d

Browse files
Emit X-SMG-Routing-Key from the session server for sticky agentic routing (#30)
The session server proxies each agent turn to the externally-launched SMG gateway. Tag every proxied chat-completion with X-SMG-Routing-Key=session_id so a routing-key gateway policy (manual / consistent_hashing) pins the session to one worker, reusing its KV cache across turns. Emitted unconditionally: the gateway is launched by the cluster job (RL360), not by miles, so miles cannot know its policy. The header is ignored by policies that do not route on it (e.g. cache_aware); only manual / consistent_hashing read it. Selecting manual + min_load is a gateway-launch (RL360) change, not a miles change. Co-authored-by: Rupesh K Srivastava <rupspace@gmail.com>
1 parent a73cff0 commit ee4dc5d

1 file changed

Lines changed: 12 additions & 2 deletions

File tree

miles/rollout/session/sessions.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -253,9 +253,17 @@ async def chat_completions(request: Request, session_id: str):
253253
expected_num_assistant = session.num_assistant
254254
# --- lock released here ---
255255

256+
# Tag every turn of this session with a routing key so a routing-key
257+
# gateway policy (manual / consistent_hashing) pins the session to one
258+
# worker, reusing the worker that holds its KV cache. Emitted
259+
# unconditionally: the gateway is launched externally (miles does not
260+
# know its policy), and policies that don't route on the key
261+
# (e.g. cache_aware) ignore the header.
262+
proxy_headers = {**dict(request.headers), "X-SMG-Routing-Key": session_id}
263+
256264
# --- Phase 2: proxy to SGLang (NO lock held) ---
257265
t_proxy_start = time.monotonic()
258-
result = await backend.do_proxy(request, "v1/chat/completions", body=body)
266+
result = await backend.do_proxy(request, "v1/chat/completions", body=body, headers=proxy_headers)
259267
t_proxy_end = time.monotonic()
260268

261269
# If SGLang returned a non-200 error (e.g. 400 for context too long),
@@ -279,7 +287,9 @@ async def chat_completions(request: Request, session_id: str):
279287
)
280288
request_body.pop("input_ids", None)
281289
retry_body = orjson.dumps(request_body)
282-
result = await backend.do_proxy(request, "v1/chat/completions", body=retry_body)
290+
result = await backend.do_proxy(
291+
request, "v1/chat/completions", body=retry_body, headers=proxy_headers
292+
)
283293
t_proxy_end = time.monotonic()
284294
if result["status_code"] != 200:
285295
return backend.build_proxy_response(result)

0 commit comments

Comments
 (0)