Summary
nightly-perf-2-gpu-vlm (ROCm) started failing on 2026-06-24 with a server-side KeyError in tokenizer_manager.py that aborts the HTTP response mid-stream:
File ".../sglang/srt/managers/tokenizer_manager.py", line 1441, in _wait_one_response
state = self.rid_to_state[obj.rid]
KeyError: '5e25b243e77743e6b3711ae7206ffdf3'
Client side:
urllib3.exceptions.ProtocolError: Response ended prematurely
requests.exceptions.ChunkedEncodingError: Response ended prematurely
Error running benchmark for Qwen/Qwen3-VL-30B-A3B-Instruct
The request's state has already been removed from rid_to_state by the time _wait_one_response reads it — a race between request-state cleanup (the abort/error branch that does del self.rid_to_state[rid], ~line 1421) and the wait path.
Bisect — clean before/after boundary at #29012
Checking the nightly-perf-2-gpu-vlm job outcome specifically (run-level conclusions are misleading because of skipped/partial runs):
| Date |
perf-vlm job |
Result |
| 2026-06-22 19:52 |
82793291659 |
✅ success |
| 2026-06-23 19:05 |
83026842412 |
✅ success |
| 2026-06-23 22:54 |
— |
#29012 merged (34dd9c28) |
| 2026-06-24 20:48 |
83267124889 |
❌ KeyError |
The first completed perf-vlm run after #29012 merged is the first failure. #29012 ("[Refactor] Introduce sock_send/sock_recv wrappers for zmq IPC") rewrote the scheduler/detokenizer IPC in tokenizer_manager.py (+42/-16), including converting several send_pyobj / recv_pyobj calls to sock_send / async_sock_send / async_sock_recv (i.e. introducing new await / yield points into the dispatch & receive paths). That reshapes the interleaving between request registration, output handling, and the abort-cleanup del self.rid_to_state[rid], which is consistent with this newly-exposed race.
(I'm confident in the bisect; I'll leave the exact offending interleaving to you since you authored the refactor and know the intended ordering. The single-request dispatch path is synchronous, so the race most likely involves the abort / batched / output-receive paths that now await.)
Reproduction
- Workflow:
nightly-test-amd-rocm720, job nightly-perf-2-gpu-vlm-rocm720
- Test:
test/registered/amd/perf/mi30x/test_vlms_perf_amd.py::test_bench_one_batch, model Qwen/Qwen3-VL-30B-A3B-Instruct
- The perf harness issues
/flush_cache followed by bursts of concurrent /generate requests; the KeyError fires on one of the concurrent requests.
- Failing job log: run 28119401197, job 83267124889.
This is likely reproducible on any platform under the same concurrent-generate-after-flush pattern (the changed code is platform-agnostic); it surfaced first on the AMD VLM perf job but is not ROCm-specific.
Related prior races (context)
There is a history of rid_to_state / abort-race fixes here (#21257, #21281, #28341, #28380, #28694) — this appears to be a fresh re-exposure of the same class after the IPC refactor, not a duplicate.
Note
Not opening a PR: this is core (non-AMD) tokenizer-manager IPC code with whole-project impact, and a blind patch to the async ordering without a local high-concurrency repro would be risky. Filing with the bisect + evidence so the change author can place the fix correctly. Happy to test a candidate fix on the AMD perf job.
Summary
nightly-perf-2-gpu-vlm(ROCm) started failing on 2026-06-24 with a server-sideKeyErrorintokenizer_manager.pythat aborts the HTTP response mid-stream:Client side:
The request's state has already been removed from
rid_to_stateby the time_wait_one_responsereads it — a race between request-state cleanup (the abort/error branch that doesdel self.rid_to_state[rid], ~line 1421) and the wait path.Bisect — clean before/after boundary at #29012
Checking the
nightly-perf-2-gpu-vlmjob outcome specifically (run-level conclusions are misleading because of skipped/partial runs):34dd9c28)The first completed perf-vlm run after #29012 merged is the first failure. #29012 ("[Refactor] Introduce sock_send/sock_recv wrappers for zmq IPC") rewrote the scheduler/detokenizer IPC in
tokenizer_manager.py(+42/-16), including converting severalsend_pyobj/recv_pyobjcalls tosock_send/async_sock_send/async_sock_recv(i.e. introducing newawait/ yield points into the dispatch & receive paths). That reshapes the interleaving between request registration, output handling, and the abort-cleanupdel self.rid_to_state[rid], which is consistent with this newly-exposed race.(I'm confident in the bisect; I'll leave the exact offending interleaving to you since you authored the refactor and know the intended ordering. The single-request dispatch path is synchronous, so the race most likely involves the abort / batched / output-receive paths that now await.)
Reproduction
nightly-test-amd-rocm720, jobnightly-perf-2-gpu-vlm-rocm720test/registered/amd/perf/mi30x/test_vlms_perf_amd.py::test_bench_one_batch, modelQwen/Qwen3-VL-30B-A3B-Instruct/flush_cachefollowed by bursts of concurrent/generaterequests; the KeyError fires on one of the concurrent requests.This is likely reproducible on any platform under the same concurrent-generate-after-flush pattern (the changed code is platform-agnostic); it surfaced first on the AMD VLM perf job but is not ROCm-specific.
Related prior races (context)
There is a history of
rid_to_state/ abort-race fixes here (#21257, #21281, #28341, #28380, #28694) — this appears to be a fresh re-exposure of the same class after the IPC refactor, not a duplicate.Note
Not opening a PR: this is core (non-AMD) tokenizer-manager IPC code with whole-project impact, and a blind patch to the async ordering without a local high-concurrency repro would be risky. Filing with the bisect + evidence so the change author can place the fix correctly. Happy to test a candidate fix on the AMD perf job.