Skip to content

Conversation

@davilu-nvidia
Copy link
Contributor

@davilu-nvidia davilu-nvidia commented Jan 27, 2026

Overview:

benchmark CLI based on sglang.bench to allign withg sglang's PR
python -m sglang.bench_serving --model /raid/model_hub/Qwen/Qwen2.5-VL-7B-Instruct --num-prompts 1 --dataset-name image --random-input-len 128 --random-output-len 256 --image-count 8 --image-resolution 1080p --host localhost --port 8000 --backend vllm-chat --request-rate 1

Made some modifications to make this work, will init a PRI fixed qps =1 and only tune the total num_prompt to be 1 or 10

seems like --multimodal-encode-prefill-worker only works with llama4 but not qwen3 vl series due to vLLM limitations, so I put E and P processes on same GPU id 0 to demonstrate EP/D and limit P worker memory allocation to be 0.4 to avoid OOM.

Details:

sucess benchmark
image

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@davilu-nvidia davilu-nvidia self-assigned this Jan 27, 2026
@github-actions github-actions bot added backend::vllm Relates to the vllm backend multimodal labels Jan 27, 2026
@davilu-nvidia davilu-nvidia requested a review from GuanLuo January 27, 2026 16:45
@davilu-nvidia davilu-nvidia force-pushed the fix/support-Qwen-multi-img--EPD-for-sglang.bench branch from e4e68c0 to bf6f2d1 Compare January 27, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend::vllm Relates to the vllm backend multimodal size/M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants