Skip to content

Gate inference dispatch during weight updates to prevent wasted timeouts #921

@DavidBellamy

Description

@DavidBellamy

Context

Analysis of a Harbor RL training job shows that ~33% of httpx.ReadTimeout bursts correlate with weight update events. With --update-weights-interval 2, weights are pushed to engines every 2 training steps.

Problem

During update_weights, SGLang engines freeze inference for 20-30s (measured: 22.1s sync) to load the new weight checkpoint. Any in-flight inference requests continue waiting, and any new requests dispatched during the freeze queue up. If an in-flight request was already 580+ seconds into a long generation, the 20-30s freeze pushes it past the proxy timeout (600s).

Example from the log:

  • Weight update completed at 10:37:04 after 22.1s sync
  • 119 ReadTimeout errors immediately followed
  • All errors from the RolloutManager session server proxy

The engines themselves are healthy before and after the freeze (130-170 tok/s, <7% token usage). The timeouts are purely from the freeze duration stacking on top of existing request latency.

Proposed Fix

Add a dispatch gate around weight updates:

  1. Before weight update: stop dispatching new inference requests to engines about to receive weights. Let in-flight requests drain (or at least stop adding new ones).
  2. During weight update: hold new requests in the router queue (not forwarded to engines).
  3. After weight update: resume dispatch once engines confirm they are ready to serve again.

This prevents wasting GPU-hours on requests that will inevitably timeout during the freeze, and avoids the post-update error burst.

Evidence

  • Weight update sync: ~22s measured
  • 5 weight updates observed in 7h run
  • 33% of timeout bursts within 10 minutes of a weight update
  • Correlation is strong for early updates, weaker for later ones (suggesting other factors also contribute)
  • All tracebacks are identical: httpx.ReadTimeout in session_server.py:65 (do_proxy)

/cc @mingshanhee

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions