[Bugfix] Fix vllm bench serve to count multimodal tokens in "total input tokens"#38654
[Bugfix] Fix vllm bench serve to count multimodal tokens in "total input tokens"#38654mgehre-amd wants to merge 1 commit intovllm-project:mainfrom
vllm bench serve to count multimodal tokens in "total input tokens"#38654Conversation
When benchmarking multimodal models, `vllm bench serve` reports `total_input_tokens` based on the client-side text-only prompt length, excluding image/encoder tokens. The server already reports the correct count (text + image tokens) via `usage.prompt_tokens` in the streaming response, but it was not captured. Capture `prompt_tokens` from the streaming usage chunk and use it for input token metrics so `total_input_tokens` reflects the actual prefill size. Before (Qwen2.5-VL-7B-Instruct, 512 text tokens + 1024x800 image): Total input tokens: 512 After: Total input tokens: 1606 Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
vllm bench serve to count multimodal tokens in "total input tokens"
There was a problem hiding this comment.
Code Review
This pull request updates the OpenAI benchmark request functions to capture prompt_tokens from the server's usage response and utilizes this value for metric calculations in serve.py. Feedback indicates that the current implementation uses an elif block for usage data, which may lead to missed metrics if a provider includes both choices and usage in a single streaming chunk; it is recommended to check for usage independently to ensure robustness across different backends.
| if (pt := usage.get("prompt_tokens")) is not None: | ||
| output.prompt_len = pt |
There was a problem hiding this comment.
Similar to the completions endpoint, the elif usage := data.get("usage"): block might be skipped if a provider sends both choices and usage in the same streaming chunk. To ensure that prompt_len and output_tokens are always captured when provided by the server, it is safer to use a separate if statement for the usage data rather than an elif tied to the presence of choices.
There was a problem hiding this comment.
This is a pre-existing issue, not related to this PR
|
LGTM. This 1606 = text inputs + chat template + ViT inputs. |
Purpose
When benchmarking multimodal models,
vllm bench servereportstotal_input_tokensbased on the client-side text-only prompt length, excluding image/encoder tokens. The server already reports the correct count (text + image tokens) viausage.prompt_tokensin the streaming response, but it was not captured.Capture
prompt_tokensfrom the streaming usage chunk and use it for input token metrics sototal_input_tokensreflects the actual prefill size.Test Plan
Before:
After: