Deepseek-R1 -- 12 TPS with RTX 3090 #959
Replies: 6 comments 3 replies
-
alternatively, should I wait for a tensor parallelism support in ktransformers (so I would be able to stack up 4 x RTX 3090 into the one workstation)? (just like in vllm) |
Beta Was this translation helpful? Give feedback.
-
#!/usr/bin/env bash
source /opt/ktransformers/.ktransformers/bin/activate
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="8.6"
export ENABLE_TAGS_GENERATION=False
export ENABLE_AUTOCOMPLETE_GENERATION=False
export TITLE_GENERATION_PROMPT_TEMPLATE=""
# --optimize_config_path /opt/ktransformers/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
# --cache_q4 false \
ktransformers \
--gguf_path /opt/unsloth/r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
--model_path deepseek-ai/DeepSeek-R1 \
--model_name unsloth/r1-1776-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--max_new_tokens 32768 \
--cache_lens 32768 \
--total_context 32768 \
--cache_8bit True \
--cache_q4 False \
--temperature 0.6 \
--top_p 0.95 \
--optimize_config_path /opt/ktransformers/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
--force_think \
--use_cuda_graph \
--host 127.0.0.1 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
It looks like memory bandwidth is the bottleneck. You should be using 8-channel memory (I guess...), but considering the Threadripper Pro 3995wx only supports DDR4-3200 specs at max, it seems like you'll need to upgrade your CPU. since the project team will be releasing support for AMX accelerated GENERATE phase in the near future, you may want to look into an AMX-enabled intel CPU at your discretion. |
Beta Was this translation helpful? Give feedback.
-
Upgrading GPU will not help much as the bottleneck is memory bandwidth. If money is not the issue, build a new system with,
This should give you over 20 t/s for Q4 or 15 t/s for Q8. It will cost a lot of money though. |
Beta Was this translation helpful? Give feedback.
-
I'm running with the hardware environment listed below:
And because of the heat limit, CPU load is capped at ~50% with 1.2GHz(default is 2.3GHz at 100%). Under certain limitations, my Q4 can run at maximum 4tk/s(with most cache hit), average 3tk/s. |
Beta Was this translation helpful? Give feedback.
-
Hello! I have a Threadripper Pro 3995wx, 256 GB DDR4 2933 MT/s and RTX 3090 FE.
I am using unsloth/r1-1776-GGUF/UD-Q2_K_XL and having about 11.7 tokens per second with 32k context window.
I am looking for a ways to upgrade the hardware to have some better TPS. What can be recommended?
I have noticed that FP8 optimizations are available starting from RTX 3090 (https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/fp8_kernel.md)
What would be better to upgrade in my case -- to put some crazy motherboard with two CPUs? or I would be better off just by upgrading the GPU?
Please advise.
Beta Was this translation helpful? Give feedback.
All reactions