Emit X-SMG-Routing-Key from the session server for sticky agentic routing (#30)

DavidBellamy · flukeskywalker · web-flow · commit ee4dc5d65382 · 2026-06-16T23:15:51.000-07:00
The session server proxies each agent turn to the externally-launched SMG
gateway. Tag every proxied chat-completion with X-SMG-Routing-Key=session_id so
a routing-key gateway policy (manual / consistent_hashing) pins the session to
one worker, reusing its KV cache across turns.

Emitted unconditionally: the gateway is launched by the cluster job (RL360), not
by miles, so miles cannot know its policy. The header is ignored by policies
that do not route on it (e.g. cache_aware); only manual / consistent_hashing
read it. Selecting manual + min_load is a gateway-launch (RL360) change, not a
miles change.

Co-authored-by: Rupesh K Srivastava &lt;rupspace@gmail.com&gt;
diff --git a/miles/rollout/session/sessions.py b/miles/rollout/session/sessions.py
@@ -253,9 +253,17 @@ async def chat_completions(request: Request, session_id: str):
                 expected_num_assistant = session.num_assistant
             # --- lock released here ---
 
+            # Tag every turn of this session with a routing key so a routing-key
+            # gateway policy (manual / consistent_hashing) pins the session to one
+            # worker, reusing the worker that holds its KV cache. Emitted
+            # unconditionally: the gateway is launched externally (miles does not
+            # know its policy), and policies that don't route on the key
+            # (e.g. cache_aware) ignore the header.
+            proxy_headers = {**dict(request.headers), "X-SMG-Routing-Key": session_id}
+
             # --- Phase 2: proxy to SGLang (NO lock held) ---
             t_proxy_start = time.monotonic()
-            result = await backend.do_proxy(request, "v1/chat/completions", body=body)
+            result = await backend.do_proxy(request, "v1/chat/completions", body=body, headers=proxy_headers)
             t_proxy_end = time.monotonic()
 
             # If SGLang returned a non-200 error (e.g. 400 for context too long),
@@ -279,7 +287,9 @@ async def chat_completions(request: Request, session_id: str):
                     )
                     request_body.pop("input_ids", None)
                     retry_body = orjson.dumps(request_body)
-                    result = await backend.do_proxy(request, "v1/chat/completions", body=retry_body)
+                    result = await backend.do_proxy(
+                        request, "v1/chat/completions", body=retry_body, headers=proxy_headers
+                    )
                     t_proxy_end = time.monotonic()
                     if result["status_code"] != 200:
                         return backend.build_proxy_response(result)