Session server timeout should measure engine processing time, not wall-clock time from submission

## Context

The session server proxy uses an httpx timeout (`miles_router_timeout`, default 600s) that starts counting from when the request is submitted to the engine. This means queue wait time and GPU stalls from other requests' prefill operations all count against the timeout budget, even though the engine hasn't been working on the request during those periods.

## Problem

In TITO-enabled agentic rollouts, the following sequence causes spurious timeouts:

```
t=0s     Request A submitted to engine via session server proxy
t=0s     httpx timeout clock starts (600s budget)
         
         ...engine is decoding A at 70 tok/s...

t=480s   New TITO session starts on the same engine
         Engine prefills 100K accumulated tokens (30-40s)
         Request A's decode throughput drops from 70 to 3-8 tok/s
         
t=520s   Prefill done, A resumes at 70 tok/s
t=550s   Another TITO prefill starts (20s)
         A drops to 3 tok/s again

t=600s   TIMEOUT - but A was making progress the whole time
         ~90s of the 600s budget was consumed by OTHER requests'
         prefill operations, not by A itself
```

This was reproduced with a minimal 2-task, 2-sample job on a single node. Only 4 concurrent requests, no thundering herd, yet still timeouts. The cause: GPU contention from TITO prefill storms starving active decode requests.

## Proposed Fix (credit: @moonfolk)

Use a client-side dispatch mechanism (e.g., semaphore) so the timeout only begins when the engine actually starts processing the request:

1. **Client-side concurrency control**: Limit the number of in-flight requests to the session server (matching engine capacity). New requests wait in the client's queue, where no timeout applies.

2. **Timeout starts at dispatch**: The httpx timeout clock begins when the request is actually sent to the engine, not when it enters the client queue.

3. **Alternative: server-side "started" signal**: The engine could signal when it begins processing a request (e.g., via a header or callback), and the proxy could reset/start the timeout from that point.

The goal: the timeout should measure "how long has the engine been working on this specific request" rather than "how long since the client first asked." Queue wait and GPU stalls from other requests should not consume timeout budget.

## Evidence

From single-node job logs:
- DP5 decoding at 68 tok/s, then DP7 starts TITO prefill
- DP5 drops to 3-4 tok/s for 30+ seconds during prefill
- DP5 has been decoding for ~460s already
- 460s active decode + 90s of prefill stalls = 550s > would timeout at 600s

From larger 33-node job:
- 4,418 ReadTimeouts in 7 hours
- All tracebacks identical: httpx.ReadTimeout in session_server.py do_proxy
- Timeouts come in bursts correlated with TITO session transitions

## Related Issues
- #920 (thundering herd, revised to GPU contention root cause)
- #921 (weight update gating)

/cc @mingshanhee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Session server timeout should measure engine processing time, not wall-clock time from submission #936

Context

Problem

Proposed Fix (credit: @moonfolk)

Evidence

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Session server timeout should measure engine processing time, not wall-clock time from submission #936

Description

Context

Problem

Proposed Fix (credit: @moonfolk)

Evidence

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions