-
Notifications
You must be signed in to change notification settings - Fork 597
Open
Description
Anything you want to discuss about vllm on ascend.
环境:vllm0110基于cann8.3rc1,910b4,qwen3模型的dense和moe
问题:相比mindIE,vllm的ttft快,tpot慢,期待优化
压测:
| 框架 | 环境 | ttft | 1000/tpot |
|---|---|---|---|
| vllm | qwen3-235b-int8,单机8卡,2k输入,16并发 | 1.3s | 14tok/s |
| mindIE | qwen3-235b-int8,单机8卡,2k输入,16并发 | 2.9s | 20tok/s |
| vllm | qwen3-235b-int8,单机8卡,8k输入,16并发 | 3.5s | 12tok/s |
| mindIE | qwen3-235b-int8,单机8卡,8k输入,16并发 | 12s | 19tok/s |
| vllm | qwen3-235b-int8,单机8卡,64k输入,4并发 | 22s | 8tok/s |
| mindIE | qwen3-235b-int8,单机8卡,64k输入,4并发 | 39s | 9tok/s |
| vllm | qwen3-32b,2卡,2k输入,16并发 | 1.8s | 15tok/s |
| mindIE | qwen3-32b,2卡,2k输入,16并发 | 4.7s | 21tok/s |
| vllm | qwen3-32b,2卡,8k输入,16并发 | 9s | 9tok/s |
| mindIE | qwen3-32b,2卡,8k输入,16并发 | 34s | 15tok/s |
| vllm | qwen3-32b,2卡,64k输入,2并发 | 29s | 7tok/s |
| mindIE | qwen3-32b,2卡,64k输入,2并发 | 97s | 25tok/s |
结论:
首先yes,vllm的ttft确实优秀不少,但tpot影响了总生成时间以及用户对流式生成的体感,期待继续优化tpot,谢谢!
Metadata
Metadata
Assignees
Labels
No labels