Gate inference dispatch during weight updates to prevent wasted timeouts

## Context

Analysis of a Harbor RL training job shows that ~33% of `httpx.ReadTimeout` bursts correlate with weight update events. With `--update-weights-interval 2`, weights are pushed to engines every 2 training steps.

## Problem

During `update_weights`, SGLang engines freeze inference for 20-30s (measured: 22.1s sync) to load the new weight checkpoint. Any in-flight inference requests continue waiting, and any new requests dispatched during the freeze queue up. If an in-flight request was already 580+ seconds into a long generation, the 20-30s freeze pushes it past the proxy timeout (600s).

Example from the log:
- Weight update completed at 10:37:04 after 22.1s sync
- 119 `ReadTimeout` errors immediately followed
- All errors from the RolloutManager session server proxy

The engines themselves are healthy before and after the freeze (130-170 tok/s, <7% token usage). The timeouts are purely from the freeze duration stacking on top of existing request latency.

## Proposed Fix

Add a dispatch gate around weight updates:

1. **Before weight update**: stop dispatching new inference requests to engines about to receive weights. Let in-flight requests drain (or at least stop adding new ones).
2. **During weight update**: hold new requests in the router queue (not forwarded to engines).
3. **After weight update**: resume dispatch once engines confirm they are ready to serve again.

This prevents wasting GPU-hours on requests that will inevitably timeout during the freeze, and avoids the post-update error burst.

## Evidence

- Weight update sync: ~22s measured
- 5 weight updates observed in 7h run
- 33% of timeout bursts within 10 minutes of a weight update
- Correlation is strong for early updates, weaker for later ones (suggesting other factors also contribute)
- All tracebacks are identical: `httpx.ReadTimeout` in `session_server.py:65` (`do_proxy`)

/cc @mingshanhee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gate inference dispatch during weight updates to prevent wasted timeouts #921

Context

Problem

Proposed Fix

Evidence

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gate inference dispatch during weight updates to prevent wasted timeouts #921

Description

Context

Problem

Proposed Fix

Evidence

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions