当前速度和性能的瓶颈在哪里 #1142
Unanswered
JennieGao-njust
asked this question in
Q&A
当前速度和性能的瓶颈在哪里
#1142
Replies: 1 comment
-
假定软件部分保持不变: 另外,内存、显存容量影响可加载的模型类型,越大的模型需要的内存和显存越大(使用KT以外的方案吃的全部是显存) 关于多GPU,请参考以下教程: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
机器配置:500GB内存 500GB硬盘 模型是UD-Q2_K_XL
cpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 52 bits physical, 57 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Gold 6462C
Stepping: 8
内核版本:
Linux Richco12 5.4.0-204-generic #224 SMP Thu Dec 5 13:38:28 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
内存占用:
Image
显卡机器L20 以及占用:
Image
启动命令python ktransformers/server/main.py
--port 10002
--model_path /mnt/work/models/deepseek-ai/DeepSeek-V3-0324-GGUF/DeepSeek-V3
--gguf_path /mnt/work/models/deepseek-ai/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml
--max_new_tokens 1024
--cache_lens 131072
--chunk_size 128
--max_batch_size 4
--backend_type balance_serve
--cpu_infer 32
单并发decode 11 tokens/s 双并发:
Request 1: Decode Speed = 6.07 tokens/s
Request 0: Decode Speed = 7.97 tokens/s
以上配置 改变启动命令中的参数对速度影响不明显 是否有多gpu 方案能进一步提高并发速度?目前显存利用率35%左右,内存free100多G
Beta Was this translation helpful? Give feedback.
All reactions