Deepseek-R1 -- 12 TPS with RTX 3090 #959
Replies: 9 comments 8 replies
-
alternatively, should I wait for a tensor parallelism support in ktransformers (so I would be able to stack up 4 x RTX 3090 into the one workstation)? (just like in vllm) |
Beta Was this translation helpful? Give feedback.
-
#!/usr/bin/env bash
source /opt/ktransformers/.ktransformers/bin/activate
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda
export TORCH_CUDA_ARCH_LIST="8.6"
export ENABLE_TAGS_GENERATION=False
export ENABLE_AUTOCOMPLETE_GENERATION=False
export TITLE_GENERATION_PROMPT_TEMPLATE=""
# --optimize_config_path /opt/ktransformers/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
# --cache_q4 false \
ktransformers \
--gguf_path /opt/unsloth/r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
--model_path deepseek-ai/DeepSeek-R1 \
--model_name unsloth/r1-1776-GGUF \
--cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
--max_new_tokens 32768 \
--cache_lens 32768 \
--total_context 32768 \
--cache_8bit True \
--cache_q4 False \
--temperature 0.6 \
--top_p 0.95 \
--optimize_config_path /opt/ktransformers/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
--force_think \
--use_cuda_graph \
--host 127.0.0.1 \
--port 8080 \
--fast_safetensors True \
--log_level DEBUG
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
It looks like memory bandwidth is the bottleneck. You should be using 8-channel memory (I guess...), but considering the Threadripper Pro 3995wx only supports DDR4-3200 specs at max, it seems like you'll need to upgrade your CPU. since the project team will be releasing support for AMX accelerated GENERATE phase in the near future, you may want to look into an AMX-enabled intel CPU at your discretion. |
Beta Was this translation helpful? Give feedback.
-
Upgrading GPU will not help much as the bottleneck is memory bandwidth. If money is not the issue, build a new system with,
This should give you over 20 t/s for Q4 or 15 t/s for Q8. It will cost a lot of money though. |
Beta Was this translation helpful? Give feedback.
-
I'm running with the hardware environment listed below:
And because of the heat limit, CPU load is capped at ~50% with 1.2GHz(default is 2.3GHz at 100%). Under certain limitations, my Q4 can run at maximum 4tk/s(with most cache hit), average 3tk/s. |
Beta Was this translation helpful? Give feedback.
-
Watch this build, it uses Intel Engineering sample processor QYFS for $140 on ebay (56 cores 112 threads) and supports DDR5 ram. |
Beta Was this translation helpful? Give feedback.
-
FP8 optimizations are not available on 3090. I have 2x3090, 2x4090 and 2x5090. Native FP8 processing starts from 40 series. I just upgraded to 5090s and I am trying to sell my 4090s, if you are in MN we can connect, I can happily help you in upgrading. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello! I have a Threadripper Pro 3995wx, 256 GB DDR4 2933 MT/s and RTX 3090 FE.
I am using unsloth/r1-1776-GGUF/UD-Q2_K_XL and having about 11.7 tokens per second with 32k context window.
I am looking for a ways to upgrade the hardware to have some better TPS. What can be recommended?
I have noticed that FP8 optimizations are available starting from RTX 3090 (https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/fp8_kernel.md)
What would be better to upgrade in my case -- to put some crazy motherboard with two CPUs? or I would be better off just by upgrading the GPU?
Please advise.
[ERRDATA]. FP8 natively is only supported since RTX4xxx and up. The RTX 3090 DOES NOT support FP8.
Beta Was this translation helpful? Give feedback.
All reactions