Context
Analysis of a Harbor RL training job shows that ~33% of httpx.ReadTimeout bursts correlate with weight update events. With --update-weights-interval 2, weights are pushed to engines every 2 training steps.
Problem
During update_weights, SGLang engines freeze inference for 20-30s (measured: 22.1s sync) to load the new weight checkpoint. Any in-flight inference requests continue waiting, and any new requests dispatched during the freeze queue up. If an in-flight request was already 580+ seconds into a long generation, the 20-30s freeze pushes it past the proxy timeout (600s).
Example from the log:
- Weight update completed at 10:37:04 after 22.1s sync
- 119
ReadTimeout errors immediately followed
- All errors from the RolloutManager session server proxy
The engines themselves are healthy before and after the freeze (130-170 tok/s, <7% token usage). The timeouts are purely from the freeze duration stacking on top of existing request latency.
Proposed Fix
Add a dispatch gate around weight updates:
- Before weight update: stop dispatching new inference requests to engines about to receive weights. Let in-flight requests drain (or at least stop adding new ones).
- During weight update: hold new requests in the router queue (not forwarded to engines).
- After weight update: resume dispatch once engines confirm they are ready to serve again.
This prevents wasting GPU-hours on requests that will inevitably timeout during the freeze, and avoids the post-update error burst.
Evidence
- Weight update sync: ~22s measured
- 5 weight updates observed in 7h run
- 33% of timeout bursts within 10 minutes of a weight update
- Correlation is strong for early updates, weaker for later ones (suggesting other factors also contribute)
- All tracebacks are identical:
httpx.ReadTimeout in session_server.py:65 (do_proxy)
/cc @mingshanhee
Context
Analysis of a Harbor RL training job shows that ~33% of
httpx.ReadTimeoutbursts correlate with weight update events. With--update-weights-interval 2, weights are pushed to engines every 2 training steps.Problem
During
update_weights, SGLang engines freeze inference for 20-30s (measured: 22.1s sync) to load the new weight checkpoint. Any in-flight inference requests continue waiting, and any new requests dispatched during the freeze queue up. If an in-flight request was already 580+ seconds into a long generation, the 20-30s freeze pushes it past the proxy timeout (600s).Example from the log:
ReadTimeouterrors immediately followedThe engines themselves are healthy before and after the freeze (130-170 tok/s, <7% token usage). The timeouts are purely from the freeze duration stacking on top of existing request latency.
Proposed Fix
Add a dispatch gate around weight updates:
This prevents wasting GPU-hours on requests that will inevitably timeout during the freeze, and avoids the post-update error burst.
Evidence
httpx.ReadTimeoutinsession_server.py:65(do_proxy)/cc @mingshanhee