Skip to content

[Bugfix] Fix vllm bench serve to count multimodal tokens in "total input tokens"#38654

Open
mgehre-amd wants to merge 1 commit intovllm-project:mainfrom
mgehre-amd:matthias.fix-bench-mm-input-tokens
Open

[Bugfix] Fix vllm bench serve to count multimodal tokens in "total input tokens"#38654
mgehre-amd wants to merge 1 commit intovllm-project:mainfrom
mgehre-amd:matthias.fix-bench-mm-input-tokens

Conversation

@mgehre-amd
Copy link
Copy Markdown
Contributor

Purpose

When benchmarking multimodal models, vllm bench serve reports total_input_tokens based on the client-side text-only prompt length, excluding image/encoder tokens. The server already reports the correct count (text + image tokens) via usage.prompt_tokens in the streaming response, but it was not captured.

Capture prompt_tokens from the streaming usage chunk and use it for input token metrics so total_input_tokens reflects the actual prefill size.

Test Plan

vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct --port 8000 --random-input-len 512 --output-len 128 --num-prompts 1  --backend openai-chat --endpoint /v1/chat/completions --dataset-name
   random-mm --random-mm-base-items-per-request 1 --random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' --random-mm-bucket-config '{(1024, 800, 1): 1.0}

Before:

  Total input tokens:                      512

After:

  Total input tokens:                      1606

When benchmarking multimodal models, `vllm bench serve` reports
`total_input_tokens` based on the client-side text-only prompt
length, excluding image/encoder tokens. The server already reports
the correct count (text + image tokens) via `usage.prompt_tokens`
in the streaming response, but it was not captured.

Capture `prompt_tokens` from the streaming usage chunk and use
it for input token metrics so `total_input_tokens` reflects the
actual prefill size.

Before (Qwen2.5-VL-7B-Instruct, 512 text tokens + 1024x800 image):
  Total input tokens:                      512

After:
  Total input tokens:                      1606

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd changed the title [Bugfix] Fix benchmark to count multimodal tokens in input metrics [Bugfix] Fix vllm bench serve to count multimodal tokens in "total input tokens" Mar 31, 2026
@mergify mergify bot added performance Performance-related issues bug Something isn't working labels Mar 31, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the OpenAI benchmark request functions to capture prompt_tokens from the server's usage response and utilizes this value for metric calculations in serve.py. Feedback indicates that the current implementation uses an elif block for usage data, which may lead to missed metrics if a provider includes both choices and usage in a single streaming chunk; it is recommended to check for usage independently to ensure robustness across different backends.

Comment on lines +363 to +364
if (pt := usage.get("prompt_tokens")) is not None:
output.prompt_len = pt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the completions endpoint, the elif usage := data.get("usage"): block might be skipped if a provider sends both choices and usage in the same streaming chunk. To ensure that prompt_len and output_tokens are always captured when provided by the server, it is safer to use a separate if statement for the usage data rather than an elif tied to the presence of choices.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-existing issue, not related to this PR

@shen-shanshan
Copy link
Copy Markdown
Contributor

LGTM. This 1606 = text inputs + chat template + ViT inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working performance Performance-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants