-
-
Notifications
You must be signed in to change notification settings - Fork 12k
[Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition #30671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix missing first token in tool calls during reasoning-to-tool transition #30671
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request refactors the logic within the chat_completion_stream_generator function, specifically concerning the handling of reasoning extraction and the detection of reasoning completion. The reasoning_parser.extract_reasoning_streaming call, along with its associated delta_message processing, is moved to be conditional, now only executing when reasoning is actively ongoing and not yet marked as ended. This change ensures that current_text is not updated unnecessarily when reasoning ends via prompt token IDs, and clarifies that tool calls are processed only after the reasoning phase has concluded.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
chaunceyjiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a minimal reproducible example so that I can reproduce it on my local environment.
The deployment config is the same as https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#launching-deepseek-v32, and a reproducible example is shown below You can run multiple times to check if it's missing the first token. In my tests, nearly 90% tests will lose the first token. |
|
Thanks~ @mondaylord you need to DCO |
ddd7919 to
a7ae345
Compare
chaunceyjiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks~
Signed-off-by: mondaylord <[email protected]>
a7ae345 to
24e0084
Compare
…ol transition (vllm-project#30671) Signed-off-by: mondaylord <[email protected]> Signed-off-by: Joachim Studnia <[email protected]>
…ol transition (vllm-project#30671) Signed-off-by: mondaylord <[email protected]> Signed-off-by: Joachim Studnia <[email protected]>
Purpose
This PR refactors the streaming logic in the generation handler to fix an issue where the first token of a tool call could be dropped or set to
Noneduring the transition fromReasoningtoTool Calling.The Problem
Previously, the logic used a mutually exclusive
if-elsestructure:In streaming scenarios, if the "Reasoning End" token (e.g.,
</think>) appeared in the current iteration, thereasoning_end_arrflag was set toTrue, but the else block was skipped for that iteration. This resulted in the immediate next token being generated by the engine but dropped from the streaming response.Evidence (Log Analysis)
The following SSE logs demonstrate the issue before the fix. Observe the second chunk: the model generated the token "Ap" (visible in
logprobs), but thedelta.contentwas set to null. The next chunk continues with "ologies". Result: The user receives "ologies" instead of "Apologies".The Fix
The logic is now changed to sequential
ifstatements:Sequential Processing: After checking/processing reasoning, the code immediately checks
if reasoning_end_arr[i]:. This ensures that if reasoning finishes in the current step, thetool_parseris immediately invoked to process the remaining tokens in the same iteration.Optimization: The check for
prompt_token_ids(disabling reasoning via prompt) is moved before the expensiveextract_reasoning_streaming call, avoiding unnecessary processing when thinking is disabled.Test Plan
Just directly test tool-call + reasoning case, such as
Test Result
The first token is generated normally, without setting to
Noneor just dropped.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.