Context
The session server proxy uses an httpx timeout (miles_router_timeout, default 600s) that starts counting from when the request is submitted to the engine. This means queue wait time and GPU stalls from other requests' prefill operations all count against the timeout budget, even though the engine hasn't been working on the request during those periods.
Problem
In TITO-enabled agentic rollouts, the following sequence causes spurious timeouts:
t=0s Request A submitted to engine via session server proxy
t=0s httpx timeout clock starts (600s budget)
...engine is decoding A at 70 tok/s...
t=480s New TITO session starts on the same engine
Engine prefills 100K accumulated tokens (30-40s)
Request A's decode throughput drops from 70 to 3-8 tok/s
t=520s Prefill done, A resumes at 70 tok/s
t=550s Another TITO prefill starts (20s)
A drops to 3 tok/s again
t=600s TIMEOUT - but A was making progress the whole time
~90s of the 600s budget was consumed by OTHER requests'
prefill operations, not by A itself
This was reproduced with a minimal 2-task, 2-sample job on a single node. Only 4 concurrent requests, no thundering herd, yet still timeouts. The cause: GPU contention from TITO prefill storms starving active decode requests.
Proposed Fix (credit: @moonfolk)
Use a client-side dispatch mechanism (e.g., semaphore) so the timeout only begins when the engine actually starts processing the request:
-
Client-side concurrency control: Limit the number of in-flight requests to the session server (matching engine capacity). New requests wait in the client's queue, where no timeout applies.
-
Timeout starts at dispatch: The httpx timeout clock begins when the request is actually sent to the engine, not when it enters the client queue.
-
Alternative: server-side "started" signal: The engine could signal when it begins processing a request (e.g., via a header or callback), and the proxy could reset/start the timeout from that point.
The goal: the timeout should measure "how long has the engine been working on this specific request" rather than "how long since the client first asked." Queue wait and GPU stalls from other requests should not consume timeout budget.
Evidence
From single-node job logs:
- DP5 decoding at 68 tok/s, then DP7 starts TITO prefill
- DP5 drops to 3-4 tok/s for 30+ seconds during prefill
- DP5 has been decoding for ~460s already
- 460s active decode + 90s of prefill stalls = 550s > would timeout at 600s
From larger 33-node job:
- 4,418 ReadTimeouts in 7 hours
- All tracebacks identical: httpx.ReadTimeout in session_server.py do_proxy
- Timeouts come in bursts correlated with TITO session transitions
Related Issues
/cc @mingshanhee
Context
The session server proxy uses an httpx timeout (
miles_router_timeout, default 600s) that starts counting from when the request is submitted to the engine. This means queue wait time and GPU stalls from other requests' prefill operations all count against the timeout budget, even though the engine hasn't been working on the request during those periods.Problem
In TITO-enabled agentic rollouts, the following sequence causes spurious timeouts:
This was reproduced with a minimal 2-task, 2-sample job on a single node. Only 4 concurrent requests, no thundering herd, yet still timeouts. The cause: GPU contention from TITO prefill storms starving active decode requests.
Proposed Fix (credit: @moonfolk)
Use a client-side dispatch mechanism (e.g., semaphore) so the timeout only begins when the engine actually starts processing the request:
Client-side concurrency control: Limit the number of in-flight requests to the session server (matching engine capacity). New requests wait in the client's queue, where no timeout applies.
Timeout starts at dispatch: The httpx timeout clock begins when the request is actually sent to the engine, not when it enters the client queue.
Alternative: server-side "started" signal: The engine could signal when it begins processing a request (e.g., via a header or callback), and the proxy could reset/start the timeout from that point.
The goal: the timeout should measure "how long has the engine been working on this specific request" rather than "how long since the client first asked." Queue wait and GPU stalls from other requests should not consume timeout budget.
Evidence
From single-node job logs:
From larger 33-node job:
Related Issues
/cc @mingshanhee