Deepseek-R1 -- 12 TPS with RTX 3090 #959

magikRUKKOLA · 2025-03-22T20:47:08Z

magikRUKKOLA
Mar 22, 2025

Hello! I have a Threadripper Pro 3995wx, 256 GB DDR4 2933 MT/s and RTX 3090 FE.
I am using unsloth/r1-1776-GGUF/UD-Q2_K_XL and having about 11.7 tokens per second with 32k context window.
I am looking for a ways to upgrade the hardware to have some better TPS. What can be recommended?
I have noticed that FP8 optimizations are available starting from RTX 3090 (https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/fp8_kernel.md)
What would be better to upgrade in my case -- to put some crazy motherboard with two CPUs? or I would be better off just by upgrading the GPU?
Please advise.

[ERRDATA]. FP8 natively is only supported since RTX4xxx and up. The RTX 3090 DOES NOT support FP8.

magikRUKKOLA · 2025-03-22T21:02:20Z

magikRUKKOLA
Mar 22, 2025
Author

alternatively, should I wait for a tensor parallelism support in ktransformers (so I would be able to stack up 4 x RTX 3090 into the one workstation)? (just like in vllm)

0 replies

magikRUKKOLA · 2025-03-22T21:07:31Z

magikRUKKOLA
Mar 22, 2025
Author

#!/usr/bin/env bash

source /opt/ktransformers/.ktransformers/bin/activate

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda

export TORCH_CUDA_ARCH_LIST="8.6"

export ENABLE_TAGS_GENERATION=False
export ENABLE_AUTOCOMPLETE_GENERATION=False
export TITLE_GENERATION_PROMPT_TEMPLATE=""

# --optimize_config_path /opt/ktransformers/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
#  --cache_q4 false \
ktransformers \
  --gguf_path /opt/unsloth/r1-1776-GGUF/UD-Q2_K_XL/r1-1776-UD-Q2_K_XL-00001-of-00005.gguf \
  --model_path deepseek-ai/DeepSeek-R1 \
  --model_name unsloth/r1-1776-GGUF \
  --cpu_infer $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
  --max_new_tokens 32768 \
  --cache_lens 32768 \
  --total_context 32768 \
  --cache_8bit True \
  --cache_q4 False \
  --temperature 0.6 \
  --top_p 0.95 \
  --optimize_config_path /opt/ktransformers/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml \
  --force_think \
  --use_cuda_graph \
  --host 127.0.0.1 \
  --port 8080 \
  --fast_safetensors True \
  --log_level DEBUG

Performance(T/s): prefill 30.160924465286556, decode 11.8372454270321

0 replies

magikRUKKOLA · 2025-03-22T21:10:31Z

magikRUKKOLA
Mar 22, 2025
Author

AMD Ryzen Threadripper PRO 3995WX 64-Cores

     *-memory
          description: System Memory
          physical id: 1c
          slot: System board or motherboard
          size: 256GiB

        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V
        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V
        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V
        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V
        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V
        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V
        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V
        Configured Memory Speed: 2933 MT/s
        Configured Voltage: 1.2 V

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:41:00.0 Off |                  N/A |
|  0%   42C    P8             20W /  350W |   17314MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |

0 replies

cyhasuka · 2025-03-24T08:49:10Z

cyhasuka
Mar 24, 2025

It looks like memory bandwidth is the bottleneck. You should be using 8-channel memory (I guess...), but considering the Threadripper Pro 3995wx only supports DDR4-3200 specs at max, it seems like you'll need to upgrade your CPU. since the project team will be releasing support for AMX accelerated GENERATE phase in the near future, you may want to look into an AMX-enabled intel CPU at your discretion.

2 replies

cyhasuka Mar 24, 2025

Additionaly, RTX 3090 is using Ampere architecture, which does not have support for FP8 execution.

magikRUKKOLA Mar 24, 2025
Author

It looks like memory bandwidth is the bottleneck. You should be using 8-channel memory (I guess...), but considering the Threadripper Pro 3995wx only supports DDR4-3200 specs at max, it seems like you'll need to upgrade your CPU. since the project team will be releasing support for AMX accelerated GENERATE phase in the near future, you may want to look into an AMX-enabled intel CPU at your discretion.

Well, yes, Threadropper PRO 3995wx supports DDR4-3200. But in my case its not even that, because I couldn't rise the the DRAM voltage to 1.35V on my GIGABYTE motherboard. I am not entirely sure that the bottleneck is the RAM bandwidth. The theoretical bw limit for DDR4 RAM is over 200 GB/sec (since the CPU has 8 CCDs). In by case I am getting about 85 GB/sec so its well below the max bw. Here is how the bw is measured:

perf stat -x, -e $(perf list 2>&1 | grep dram_channel_data_controller | awk '{print $1}' | paste -sd,) -I 1000 2>&1 | head -n $(dmidecode -t memory | grep 'Locator: DIMM' | wc -l) | grep -v "^#" | awk -F, '$2 ~ /^[0-9]+$/ {sum += $2} END {printf "%.2f GB/s\n", sum*64/1e9}'

since the project team will be releasing support for AMX accelerated GENERATE phase in the near future, you may want to look into an AMX-enabled intel CPU at your discretion.

That's exactly the reason I have created this discussion. I am not sure if its actually worth to switch the CPU -- in the README.md of this project it has been specified that with Deepseek R1 - Q4 - 6 (out of 8) experts activated, with dual Intel CPU (same core count as AMD Threadripper Pro 3995wx) and with RTX 4090 the author got only about 14 tokens per second. Its just 2 tokens per sec better than the setup with a single CPU I am using (though UD-Q2-K-XL is about two times smaller than Q4).

So how to check beforehand what the performance will be depending on the hardware specifications?

RTX 3090 is using Ampere architecture, which does not have support for FP8 execution.

Well, yeah, kinda. But does it really matter? Right now only about 75% of the GPU VRAM is used with 32k context. So how exactly the native support of FP8 would speed up the things? Are you actually sure that UD-Q2-K-XL for example would benefit from FP8? Even if one would imagine that there is some benefit regarding the FP8 then may be it would be much better just to use FP16 for a certain weights (which RTX 3090 actually supports natively), because some extra VRAM is available huh? Again, may be some debug log for ktransformers is available in order to understand what exactly is processed by the GPU?

Also another thing I am failing to understand is that why there is no support for the tensor parallelism in ktransformers (just like in VLLM)? It would be much better to split what can be computed on the multiple GPUs most efficiently rather than just to use one GPU. Additionally, the RTX 3090 (like many other NVIDIA GPUs) actually have a native support for p2p transfer meaning that p2p gpu bandwidth would be around 50 Gb/s.

daweiba12 · 2025-03-25T01:54:11Z

daweiba12
Mar 25, 2025

Upgrading GPU will not help much as the bottleneck is memory bandwidth. If money is not the issue, build a new system with,

Gigabyte MZ73 motherboard
Dual AMD EPYC 9005 processor
DDR5 6000 RAM

This should give you over 20 t/s for Q4 or 15 t/s for Q8. It will cost a lot of money though.

3 replies

magikRUKKOLA Apr 5, 2025
Author

Which CPUs exactly would you recommend for Q4? And how much RAM? Will there be any benefit with 192 cores per CPU? Or not so much?

What do you think regarding RTX 4090 with 48GB VRAM as a custom-made one? Should be much better for the context length than GPU with 24 GB VRAM, right? Or should I rather use 1 or 3 additional 3090 to try to do parallel processing of the KV cache (distributed calculations just like in vllm)?

Linus467 Jun 9, 2025

Im not sure if the bandwidth is the problem, he got similar results with a epyc 9005 32core:
https://www.reddit.com/r/LocalLLaMA/comments/1joyl9t/comment/ml1lgob/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

magikRUKKOLA Jul 3, 2025
Author

Im not sure if the bandwidth is the problem, he got similar results with a epyc 9005 32core: https://www.reddit.com/r/LocalLLaMA/comments/1joyl9t/comment/ml1lgob/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

but its a different project -- its ik_llama.cpp, not ktransformers. I tried them, the perplexity looks good but they dont support tool calling yet which is sad.

vibe-Chen · 2025-04-08T23:40:14Z

vibe-Chen
Apr 8, 2025

I'm running with the hardware environment listed below:

A single Intel Xeon E5 2686 V4 CPU, bought secondhand from some another server
A HUANANZHI X10X99-16D Motherboard (though the second CPU slot remains empty)
8 64GB DDR4 2400MHz Memories, provides a total 512GB memory for Deepseek-671B-Q4 to use, though also bought secondhand
A (said modified) NVIDIA RTX 3080, with massive 20GB GPU memory for use
Sufficient power, seemly insufficient heatsink, and some other I/O devices

And because of the heat limit, CPU load is capped at ~50% with 1.2GHz(default is 2.3GHz at 100%). Under certain limitations, my Q4 can run at maximum 4tk/s(with most cache hit), average 3tk/s. ~~That may be further increased if heat limit is removed, though I haven't found suitable solution.~~
Different from other architecture, KTransformers load models into CPU memories instead of GPU. That can be cheaper as memories integrated into GPU is VERY small and expensive, and KT almost bypasses this. Thus, the main limitation can be CPU memory access speed (at most of time) or CPU speed (as my case, do not try), adding more GPU provides less help (and is costly).
It's said that KT 0.24 adds load balance, allowing users to make use of their second CPU to support multiple API requests to be processed in the same time. Sadly I don't have a second 2686, neither could I afford another 512GB memories for use in short time. If you're in need of such case, you may give it a try.

0 replies

mtcl · 2025-06-14T05:42:10Z

mtcl
Jun 14, 2025

Watch this build, it uses Intel Engineering sample processor QYFS for $140 on ebay (56 cores 112 threads) and supports DDR5 ram.

https://www.youtube.com/watch?v=oLvkBZHU23Y

https://www.youtube.com/watch?v=Xui3_bA26LE

0 replies

mtcl · 2025-06-14T05:48:53Z

mtcl
Jun 14, 2025

have noticed that FP8 optimizations are available starting from RTX 3090 (https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/fp8_kernel.md) What would be better to upgrade in my case -- to put some crazy motherboard with two CPUs? or I would be better off just by upgrading the GPU? Please advise.

FP8 optimizations are not available on 3090. I have 2x3090, 2x4090 and 2x5090. Native FP8 processing starts from 40 series. I just upgraded to 5090s and I am trying to sell my 4090s, if you are in MN we can connect, I can happily help you in upgrading.

3 replies

magikRUKKOLA Jul 3, 2025
Author

Yeah, I would be interested in one RTX 4090. But I was unable to contact you (not sure if the github is supporting the personal messages). May be you can post your contact info? What would be the price? Can you do overseas parcel (I am placed in EU)?

mtcl Jul 3, 2025

Yeah, I would be interested in one RTX 4090. But I was unable to contact you (not sure if the github is supporting the personal messages). May be you can post your contact info? What would be the price? Can you do overseas parcel (I am placed in EU)?

They are all sold now locally :)

magikRUKKOLA Jul 3, 2025
Author

Yeah, I would be interested in one RTX 4090. But I was unable to contact you (not sure if the github is supporting the personal messages). May be you can post your contact info? What would be the price? Can you do overseas parcel (I am placed in EU)?

They are all sold now locally :)

Yeah I figured. Pretty good stuff. :)

magikRUKKOLA · 2025-07-05T21:49:27Z

magikRUKKOLA
Jul 5, 2025
Author

 ummm... am I correctly understanding that ummm... native support of the lower floating point data precision format
 doesn't really matter much in terms of inference
 because of the introduction of the K-marlin kernels
 ummm... which as far as I am understanding
 are able to dynamically decompress
 ummm... the quantized data
 ummm... via the CUDA code
 basically the CUDA kernel
 custom optimized CUDA kernel
 ummm... meaning that
 the software optimizations
 as of now
 far outperformed
 the hardware possibilities
 of the current chips
 is that more or less correct understanding?
 please advise

0 replies

Deepseek-R1 -- 12 TPS with RTX 3090 #959

Uh oh!

Uh oh!

Replies: 9 comments · 8 replies

Uh oh!

magikRUKKOLA Mar 22, 2025 Author

Uh oh!

magikRUKKOLA Mar 22, 2025 Author

Uh oh!

magikRUKKOLA Mar 22, 2025 Author

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Mar 24, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Apr 5, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Jul 3, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

magikRUKKOLA Jul 3, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Jul 3, 2025 Author

Uh oh!

Uh oh!

magikRUKKOLA Jul 5, 2025 Author

Replies: 9 comments 8 replies

magikRUKKOLA
Mar 22, 2025
Author

magikRUKKOLA
Mar 22, 2025
Author

magikRUKKOLA
Mar 22, 2025
Author

magikRUKKOLA Mar 24, 2025
Author

magikRUKKOLA Apr 5, 2025
Author

magikRUKKOLA Jul 3, 2025
Author

magikRUKKOLA Jul 3, 2025
Author

magikRUKKOLA Jul 3, 2025
Author

magikRUKKOLA
Jul 5, 2025
Author