[Bug] tokenizer_manager KeyError in _wait_one_response (rid_to_state) — regression from #29012, breaks nightly-perf-2-gpu-vlm

## Summary

`nightly-perf-2-gpu-vlm` (ROCm) started failing on 2026-06-24 with a server-side `KeyError` in `tokenizer_manager.py` that aborts the HTTP response mid-stream:

```
File ".../sglang/srt/managers/tokenizer_manager.py", line 1441, in _wait_one_response
    state = self.rid_to_state[obj.rid]
KeyError: '5e25b243e77743e6b3711ae7206ffdf3'
```

Client side:
```
urllib3.exceptions.ProtocolError: Response ended prematurely
requests.exceptions.ChunkedEncodingError: Response ended prematurely
Error running benchmark for Qwen/Qwen3-VL-30B-A3B-Instruct
```

The request's state has already been removed from `rid_to_state` by the time `_wait_one_response` reads it — a race between request-state cleanup (the abort/error branch that does `del self.rid_to_state[rid]`, ~line 1421) and the wait path.

## Bisect — clean before/after boundary at #29012

Checking the `nightly-perf-2-gpu-vlm` **job** outcome specifically (run-level conclusions are misleading because of skipped/partial runs):

| Date | perf-vlm job | Result |
|------|--------------|--------|
| 2026-06-22 19:52 | 82793291659 | ✅ success |
| 2026-06-23 19:05 | 83026842412 | ✅ success |
| **2026-06-23 22:54** | — | **#29012 merged (`34dd9c28`)** |
| 2026-06-24 20:48 | 83267124889 | ❌ KeyError |

The first completed perf-vlm run after #29012 merged is the first failure. #29012 ("[Refactor] Introduce sock_send/sock_recv wrappers for zmq IPC") rewrote the scheduler/detokenizer IPC in `tokenizer_manager.py` (+42/-16), including converting several `send_pyobj` / `recv_pyobj` calls to `sock_send` / `async_sock_send` / `async_sock_recv` (i.e. introducing new `await` / yield points into the dispatch & receive paths). That reshapes the interleaving between request registration, output handling, and the abort-cleanup `del self.rid_to_state[rid]`, which is consistent with this newly-exposed race.

(I'm confident in the bisect; I'll leave the exact offending interleaving to you since you authored the refactor and know the intended ordering. The single-request dispatch path is synchronous, so the race most likely involves the abort / batched / output-receive paths that now await.)

## Reproduction

- Workflow: `nightly-test-amd-rocm720`, job `nightly-perf-2-gpu-vlm-rocm720`
- Test: `test/registered/amd/perf/mi30x/test_vlms_perf_amd.py::test_bench_one_batch`, model `Qwen/Qwen3-VL-30B-A3B-Instruct`
- The perf harness issues `/flush_cache` followed by bursts of concurrent `/generate` requests; the KeyError fires on one of the concurrent requests.
- Failing job log: run 28119401197, job 83267124889.

This is likely reproducible on any platform under the same concurrent-generate-after-flush pattern (the changed code is platform-agnostic); it surfaced first on the AMD VLM perf job but is not ROCm-specific.

## Related prior races (context)

There is a history of `rid_to_state` / abort-race fixes here (#21257, #21281, #28341, #28380, #28694) — this appears to be a fresh re-exposure of the same class after the IPC refactor, not a duplicate.

## Note

Not opening a PR: this is core (non-AMD) tokenizer-manager IPC code with whole-project impact, and a blind patch to the async ordering without a local high-concurrency repro would be risky. Filing with the bisect + evidence so the change author can place the fix correctly. Happy to test a candidate fix on the AMD perf job.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] tokenizer_manager KeyError in _wait_one_response (rid_to_state) — regression from #29012, breaks nightly-perf-2-gpu-vlm #29256

Summary

Bisect — clean before/after boundary at #29012

Reproduction

Related prior races (context)

Note

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Date	perf-vlm job	Result
2026-06-22 19:52	82793291659	✅ success
2026-06-23 19:05	83026842412	✅ success
2026-06-23 22:54	—	#29012 merged (`34dd9c28`)
2026-06-24 20:48	83267124889	❌ KeyError

Uh oh!

[Bug] tokenizer_manager KeyError in _wait_one_response (rid_to_state) — regression from #29012, breaks nightly-perf-2-gpu-vlm #29256

Description

Summary

Bisect — clean before/after boundary at #29012

Reproduction

Related prior races (context)

Note

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions