[Fix] TokenizerManager: avoid KeyError race in _wait_one_response by GodlyDonuts · Pull Request #29529 · sgl-project/sglang

GodlyDonuts · 2026-06-27T20:16:22Z

Motivation

Fixes #29256. nightly-perf-2-gpu-vlm started failing with a KeyError that cuts off the response mid-stream (Response ended prematurely on the client):

File ".../managers/tokenizer_manager.py", line 1451, in _wait_one_response
    state = self.rid_to_state[obj.rid]
KeyError: '5e25b243e77743e6b3711ae7206ffdf3'

_wait_one_response looked up self.rid_to_state[obj.rid] on entry. For batched and parallel-sample requests the per-request generators are built in a loop and only advanced later under asyncio.gather. If one finishes first, _handle_batch_output removes its rid (del self.rid_to_state[rid]) before the waiter runs, so the lookup raised KeyError. The extra await points added in #29012 made the window wider, which is why the failure bisects to that PR.

Modifications

Pass the ReqState the caller already holds into _wait_one_response instead of looking it up again. Every caller fetches state right after registering the request; the warmup path now does the same. The notify path in _handle_batch_output sets state.event on that same object, so the output still arrives after the rid is removed.
Add a regression test that runs _wait_one_response with the rid already gone from rid_to_state and checks the output is still returned.

Checklist

Format code with pre-commit
Add unit tests

CI States

Latest PR Test (Base): ❌ Run #28300592522
Latest PR Test (Extra): ❌ Run #28300592456

_wait_one_response looked up self.rid_to_state[obj.rid] on entry. For batched and parallel-sample requests the per-request generators are built in a loop and only advanced later under asyncio.gather. If one of them finishes first, _handle_batch_output removes its rid before the waiter runs, so the lookup raised KeyError and cut off the response. The extra await points added in sgl-project#29012 made the window wider and broke nightly-perf-2-gpu-vlm. Pass the state the caller already holds into _wait_one_response instead of looking it up again. The notify path sets state.event on the same object, so the output still arrives after the rid has been removed. Add a test that runs _wait_one_response with the rid already gone and checks the output is still returned. Fixes sgl-project#29256

gemini-code-assist

Code Review

This pull request resolves a race condition in TokenizerManager where _wait_one_response could raise a KeyError if a request's rid was cleaned up from rid_to_state before the generator was first advanced. To fix this, _wait_one_response now accepts the ReqState object directly from the caller instead of performing a dictionary lookup. Corresponding unit tests have been added to verify this behavior. I have no additional feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

GodlyDonuts requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners June 27, 2026 20:16

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] TokenizerManager: avoid KeyError race in _wait_one_response#29529

[Fix] TokenizerManager: avoid KeyError race in _wait_one_response#29529
GodlyDonuts wants to merge 1 commit into
sgl-project:mainfrom
GodlyDonuts:fix/tokenizer-manager-rid-race

GodlyDonuts commented Jun 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

GodlyDonuts commented Jun 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GodlyDonuts commented Jun 27, 2026 •

edited by github-actions Bot

Loading