Skip to content

[Bug] tokenizer_manager KeyError in _wait_one_response (rid_to_state) — regression from #29012, breaks nightly-perf-2-gpu-vlm #29256

Description

@kangwangamd

Summary

nightly-perf-2-gpu-vlm (ROCm) started failing on 2026-06-24 with a server-side KeyError in tokenizer_manager.py that aborts the HTTP response mid-stream:

File ".../sglang/srt/managers/tokenizer_manager.py", line 1441, in _wait_one_response
    state = self.rid_to_state[obj.rid]
KeyError: '5e25b243e77743e6b3711ae7206ffdf3'

Client side:

urllib3.exceptions.ProtocolError: Response ended prematurely
requests.exceptions.ChunkedEncodingError: Response ended prematurely
Error running benchmark for Qwen/Qwen3-VL-30B-A3B-Instruct

The request's state has already been removed from rid_to_state by the time _wait_one_response reads it — a race between request-state cleanup (the abort/error branch that does del self.rid_to_state[rid], ~line 1421) and the wait path.

Bisect — clean before/after boundary at #29012

Checking the nightly-perf-2-gpu-vlm job outcome specifically (run-level conclusions are misleading because of skipped/partial runs):

Date perf-vlm job Result
2026-06-22 19:52 82793291659 ✅ success
2026-06-23 19:05 83026842412 ✅ success
2026-06-23 22:54 #29012 merged (34dd9c28)
2026-06-24 20:48 83267124889 ❌ KeyError

The first completed perf-vlm run after #29012 merged is the first failure. #29012 ("[Refactor] Introduce sock_send/sock_recv wrappers for zmq IPC") rewrote the scheduler/detokenizer IPC in tokenizer_manager.py (+42/-16), including converting several send_pyobj / recv_pyobj calls to sock_send / async_sock_send / async_sock_recv (i.e. introducing new await / yield points into the dispatch & receive paths). That reshapes the interleaving between request registration, output handling, and the abort-cleanup del self.rid_to_state[rid], which is consistent with this newly-exposed race.

(I'm confident in the bisect; I'll leave the exact offending interleaving to you since you authored the refactor and know the intended ordering. The single-request dispatch path is synchronous, so the race most likely involves the abort / batched / output-receive paths that now await.)

Reproduction

  • Workflow: nightly-test-amd-rocm720, job nightly-perf-2-gpu-vlm-rocm720
  • Test: test/registered/amd/perf/mi30x/test_vlms_perf_amd.py::test_bench_one_batch, model Qwen/Qwen3-VL-30B-A3B-Instruct
  • The perf harness issues /flush_cache followed by bursts of concurrent /generate requests; the KeyError fires on one of the concurrent requests.
  • Failing job log: run 28119401197, job 83267124889.

This is likely reproducible on any platform under the same concurrent-generate-after-flush pattern (the changed code is platform-agnostic); it surfaced first on the AMD VLM perf job but is not ROCm-specific.

Related prior races (context)

There is a history of rid_to_state / abort-race fixes here (#21257, #21281, #28341, #28380, #28694) — this appears to be a fresh re-exposure of the same class after the IPC refactor, not a duplicate.

Note

Not opening a PR: this is core (non-AMD) tokenizer-manager IPC code with whole-project impact, and a blind patch to the async ordering without a local high-concurrency repro would be risky. Filing with the bisect + evidence so the change author can place the fix correctly. Happy to test a candidate fix on the AMD perf job.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions