Skip to content

Conversation

@mondaylord
Copy link
Contributor

@mondaylord mondaylord commented Dec 15, 2025

Purpose

This PR refactors the streaming logic in the generation handler to fix an issue where the first token of a tool call could be dropped or set to None during the transition from Reasoning to Tool Calling.

The Problem

Previously, the logic used a mutually exclusive if-else structure:

if not reasoning_end_arr[i]:
    # Handle reasoning...
    if is_end: reasoning_end_arr[i] = True
else:
    # Handle tool calls...

In streaming scenarios, if the "Reasoning End" token (e.g., </think>) appeared in the current iteration, the reasoning_end_arr flag was set to True, but the else block was skipped for that iteration. This resulted in the immediate next token being generated by the engine but dropped from the streaming response.

Evidence (Log Analysis)

The following SSE logs demonstrate the issue before the fix. Observe the second chunk: the model generated the token "Ap" (visible in logprobs), but the delta.content was set to null. The next chunk continues with "ologies". Result: The user receives "ologies" instead of "Apologies".

// 1. Initial chunk
data: {"id": "...", "choices": [{"delta": {"role": "assistant", "content": "", "reasoning_content": null}}]}

// 2. THE BUG: Transition happens here.
// 'token' is "Ap" (logprob exists), but 'delta.content' is null.
data: {"id": "...", "choices": [{"delta": {"content": null, "reasoning_content": null}, "logprobs": {"content": [{"token": "Ap", "logprob": -8.55, ...}]}}]}

// 3. Next chunk continues, missing the start.
data: {"id": "...", "choices": [{"delta": {"content": "ologies", "reasoning_content": null}, "logprobs": {"content": [{"token": "ologies", "logprob": -0.0008, ...}]}}]}

The Fix

The logic is now changed to sequential if statements:

  • Sequential Processing: After checking/processing reasoning, the code immediately checks if reasoning_end_arr[i]:. This ensures that if reasoning finishes in the current step, the tool_parser is immediately invoked to process the remaining tokens in the same iteration.

  • Optimization: The check for prompt_token_ids (disabling reasoning via prompt) is moved before the expensive extract_reasoning_streaming call, avoiding unnecessary processing when thinking is disabled.

Test Plan

Just directly test tool-call + reasoning case, such as

'{
  "model": "deepseek-ai/DeepSeek-V3.2",
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "messages": [
    {
      "role": "user",
      "content": "What is the weather like in Boston in fahrenheit?"
    }
  ],
  "max_tokens": 65536,
  "temperature": 1,
  "top_p": 1,
  "frequency_penalty": 0,
  "presence_penalty": 0,
  "seed": null,
  "min_p": 0,
  "repetition_penalty": 1,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "strict": true,
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": [
                "celsius",
                "fahrenheit"
              ]
            }
          },
          "additionalProperties": false,
          "required": [
            "location",
            "unit"
          ]
        }
      }
    }
  ]
}'

Test Result

The first token is generated normally, without setting to None or just dropped.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the frontend label Dec 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors the logic within the chat_completion_stream_generator function, specifically concerning the handling of reasoning extraction and the detection of reasoning completion. The reasoning_parser.extract_reasoning_streaming call, along with its associated delta_message processing, is moved to be conditional, now only executing when reasoning is actively ongoing and not yet marked as ended. This change ensures that current_text is not updated unnecessarily when reasoning ends via prompt token IDs, and clarifies that tool calls are processed only after the reasoning phase has concluded.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a minimal reproducible example so that I can reproduce it on my local environment.

@mondaylord
Copy link
Contributor Author

Please provide a minimal reproducible example so that I can reproduce it on my local environment.

The deployment config is the same as https://docs.vllm.ai/projects/recipes/en/latest/DeepSeek/DeepSeek-V3_2.html#launching-deepseek-v32, and a reproducible example is shown below

'{
  "model": "deepseek-ai/DeepSeek-V3.2",
  "stream": true,
  "stream_options": {
    "include_usage": true
  },
  "messages": [
    {
      "role": "user",
      "content": "What is the weather like in Boston in fahrenheit?"
    }
  ],
  "max_tokens": 65536,
  "temperature": 1,
  "top_p": 1,
  "frequency_penalty": 0,
  "presence_penalty": 0,
  "seed": null,
  "min_p": 0,
  "repetition_penalty": 1,
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "strict": true,
        "parameters": {
          "type": "object",
          "properties": {
            "location": {
              "type": "string",
              "description": "The city and state, e.g. San Francisco, CA"
            },
            "unit": {
              "type": "string",
              "enum": [
                "celsius",
                "fahrenheit"
              ]
            }
          },
          "additionalProperties": false,
          "required": [
            "location",
            "unit"
          ]
        }
      }
    }
  ]
}'

You can run multiple times to check if it's missing the first token. In my tests, nearly 90% tests will lose the first token.

@chaunceyjiang
Copy link
Collaborator

Thanks~ @mondaylord

you need to DCO

@mondaylord mondaylord force-pushed the fix_dsv32_ignore_first_token branch from ddd7919 to a7ae345 Compare December 15, 2025 13:54
Copy link
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks~

@mondaylord mondaylord force-pushed the fix_dsv32_ignore_first_token branch from a7ae345 to 24e0084 Compare December 15, 2025 13:57
@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 15, 2025
@chaunceyjiang chaunceyjiang enabled auto-merge (squash) December 15, 2025 13:57
@chaunceyjiang chaunceyjiang self-assigned this Dec 15, 2025
@chaunceyjiang chaunceyjiang merged commit 17fec3a into vllm-project:main Dec 15, 2025
49 checks passed
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request Dec 15, 2025
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants