You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During sustained timeout storms, SGLang engines evict sessions from memory. Subsequent requests to those sessions return 404 "session not found" errors, killing the affected trials.
Failed requests leave orphaned session state on the engine
Engine memory pressure increases as new sessions are created while old ones still hold KV cache
Engine evicts older sessions to reclaim memory (likely via radix cache eviction with radix_eviction_policy='lru')
Client retries or sends the next turn to a session that no longer exists
Engine returns 404
Impact
Each 404 kills the trial with AgentError, reward=0.0
The 99 occurrences in this run were concentrated in a ~9 minute window, indicating a snowball effect where evictions beget more evictions
Combined with rollback failures (127x, see filed issue), these two secondary failure modes accounted for 226 wasted rollouts on top of the 625 direct timeouts
Suggested fix
Return a retryable status code (e.g., 410 Gone or custom) with metadata about why the session was evicted, so the client can distinguish "session expired due to eviction" from "session ID was never valid"
Implement session pinning during active rollouts so sessions with in-flight requests are not eligible for eviction
Add session eviction logging so evictions are visible in the log with the session ID, eviction reason, and memory stats
Summary
During sustained timeout storms, SGLang engines evict sessions from memory. Subsequent requests to those sessions return 404 "session not found" errors, killing the affected trials.
Error
Reproduction
In a 1h25m test run with 33 nodes and verbose logging:
9cdcaab1...,c1194948...,24cea266...,62689204...Mechanism
radix_eviction_policy='lru')Impact
AgentError, reward=0.0Suggested fix
Related: #920 (root cause), #936 (timeout measurement)