fix to adapt VLM multi img benchmarking with sglang.bench #5682
+56
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview:
benchmark CLI based on sglang.bench to allign withg sglang's PR
python -m sglang.bench_serving --model /raid/model_hub/Qwen/Qwen2.5-VL-7B-Instruct --num-prompts 1 --dataset-name image --random-input-len 128 --random-output-len 256 --image-count 8 --image-resolution 1080p --host localhost --port 8000 --backend vllm-chat --request-rate 1Made some modifications to make this work, will init a PRI fixed qps =1 and only tune the total num_prompt to be 1 or 10
seems like
--multimodal-encode-prefill-workeronly works with llama4 but not qwen3 vl series due to vLLM limitations, so I put E and P processes on same GPU id 0 to demonstrate EP/D and limit P worker memory allocation to be 0.4 to avoid OOM.Details:
sucess benchmark

Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)