[Bugfix] Fix `vllm bench serve` to count multimodal tokens in "total input tokens" by mgehre-amd · Pull Request #38654 · vllm-project/vllm

mgehre-amd · 2026-03-31T20:07:57Z

Purpose

When benchmarking multimodal models, vllm bench serve reports total_input_tokens based on the client-side text-only prompt length, excluding image/encoder tokens. The server already reports the correct count (text + image tokens) via usage.prompt_tokens in the streaming response, but it was not captured.

Capture prompt_tokens from the streaming usage chunk and use it for input token metrics so total_input_tokens reflects the actual prefill size.

Test Plan

vllm bench serve --model Qwen/Qwen2.5-VL-7B-Instruct --port 8000 --random-input-len 512 --output-len 128 --num-prompts 1  --backend openai-chat --endpoint /v1/chat/completions --dataset-name
   random-mm --random-mm-base-items-per-request 1 --random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' --random-mm-bucket-config '{(1024, 800, 1): 1.0}

Before:

  Total input tokens:                      512

After:

  Total input tokens:                      1606

When benchmarking multimodal models, `vllm bench serve` reports `total_input_tokens` based on the client-side text-only prompt length, excluding image/encoder tokens. The server already reports the correct count (text + image tokens) via `usage.prompt_tokens` in the streaming response, but it was not captured. Capture `prompt_tokens` from the streaming usage chunk and use it for input token metrics so `total_input_tokens` reflects the actual prefill size. Before (Qwen2.5-VL-7B-Instruct, 512 text tokens + 1024x800 image): Total input tokens: 512 After: Total input tokens: 1606 Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

gemini-code-assist

Code Review

This pull request updates the OpenAI benchmark request functions to capture prompt_tokens from the server's usage response and utilizes this value for metric calculations in serve.py. Feedback indicates that the current implementation uses an elif block for usage data, which may lead to missed metrics if a provider includes both choices and usage in a single streaming chunk; it is recommended to check for usage independently to ensure robustness across different backends.

vllm/benchmarks/lib/endpoint_request_func.py

gemini-code-assist · 2026-03-31T20:13:52Z

vllm/benchmarks/lib/endpoint_request_func.py

+                                if (pt := usage.get("prompt_tokens")) is not None:
+                                    output.prompt_len = pt


Similar to the completions endpoint, the elif usage := data.get("usage"): block might be skipped if a provider sends both choices and usage in the same streaming chunk. To ensure that prompt_len and output_tokens are always captured when provided by the server, it is safer to use a separate if statement for the usage data rather than an elif tied to the presence of choices.

This is a pre-existing issue, not related to this PR

shen-shanshan · 2026-04-01T02:34:12Z

LGTM. This 1606 = text inputs + chat template + ViT inputs.

mgehre-amd changed the title ~~[Bugfix] Fix benchmark to count multimodal tokens in input metrics~~ [Bugfix] Fix vllm bench serve to count multimodal tokens in "total input tokens" Mar 31, 2026

mergify bot added performance Performance-related issues bug Something isn't working labels Mar 31, 2026

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix `vllm bench serve` to count multimodal tokens in "total input tokens"#38654

[Bugfix] Fix `vllm bench serve` to count multimodal tokens in "total input tokens"#38654
mgehre-amd wants to merge 1 commit intovllm-project:mainfrom
mgehre-amd:matthias.fix-bench-mm-input-tokens

mgehre-amd commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

mgehre-amd Mar 31, 2026

Uh oh!

shen-shanshan commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if (pt := usage.get("prompt_tokens")) is not None:
		output.prompt_len = pt

Uh oh!

Conversation

mgehre-amd commented Mar 31, 2026

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

mgehre-amd Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

shen-shanshan commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants