[Fix] TokenizerManager: avoid KeyError race in _wait_one_response#29529
[Fix] TokenizerManager: avoid KeyError race in _wait_one_response#29529GodlyDonuts wants to merge 1 commit into
Conversation
_wait_one_response looked up self.rid_to_state[obj.rid] on entry. For batched and parallel-sample requests the per-request generators are built in a loop and only advanced later under asyncio.gather. If one of them finishes first, _handle_batch_output removes its rid before the waiter runs, so the lookup raised KeyError and cut off the response. The extra await points added in sgl-project#29012 made the window wider and broke nightly-perf-2-gpu-vlm. Pass the state the caller already holds into _wait_one_response instead of looking it up again. The notify path sets state.event on the same object, so the output still arrives after the rid has been removed. Add a test that runs _wait_one_response with the rid already gone and checks the output is still returned. Fixes sgl-project#29256
There was a problem hiding this comment.
Code Review
This pull request resolves a race condition in TokenizerManager where _wait_one_response could raise a KeyError if a request's rid was cleaned up from rid_to_state before the generator was first advanced. To fix this, _wait_one_response now accepts the ReqState object directly from the caller instead of performing a dictionary lookup. Corresponding unit tests have been added to verify this behavior. I have no additional feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Motivation
Fixes #29256.
nightly-perf-2-gpu-vlmstarted failing with aKeyErrorthat cuts off the response mid-stream (Response ended prematurelyon the client):_wait_one_responselooked upself.rid_to_state[obj.rid]on entry. For batched and parallel-sample requests the per-request generators are built in a loop and only advanced later underasyncio.gather. If one finishes first,_handle_batch_outputremoves its rid (del self.rid_to_state[rid]) before the waiter runs, so the lookup raisedKeyError. The extraawaitpoints added in #29012 made the window wider, which is why the failure bisects to that PR.Modifications
ReqStatethe caller already holds into_wait_one_responseinstead of looking it up again. Every caller fetchesstateright after registering the request; the warmup path now does the same. The notify path in_handle_batch_outputsetsstate.eventon that same object, so the output still arrives after the rid is removed._wait_one_responsewith the rid already gone fromrid_to_stateand checks the output is still returned.Checklist
CI States
Latest PR Test (Base): ❌ Run #28300592522
Latest PR Test (Extra): ❌ Run #28300592456