RPC results between main llamacpp and ik lcpp (GLM 4.6 full GPU, soon DeepSeek V3/R1 or Kimi K2) #1043

Panchovix · 2025-12-06T17:04:53Z

Panchovix
Dec 6, 2025

Hello guys, hope you're having a good day.

Just some small tests with RPC as today, 6th December of 2025 if it helps for reference.

My setup is:

Server PC:

AMD Ryzen 9 9900X
192GB DDR5 6000Mhz
Aorus Master X670E
RTX 5090 x2 (each at x8/x8 5.0, from CPU lanes)
RTX 4090 x2 (each at X4/X4 4.0, from CPU lanes)
RTX A6000 (X4 4.0 from Chipset lanes)
RTX 3090 (X4 4.0 from Chipset lanes)
NVIDIA A40 (X4 4.0 from Chipset lanes)
Mellanox ConnectX-3 Pro (max 40Gbps, but using a X2 3.0 PCIe slot, so max is about 14 Gbps or 1.75 GiB/s)
Fedora 42

Client PC:

AMD Ryzen 7 7800X3D
64GB DDR5 6000Mhz
MSI Carbon X670E
RTX 5090 (X16 5.0 from CPU lanes)
Mellanox ConnectX-3 Pro (max 40Gbps, but using a X4 3.0 PCIe slot, so max is about 28 Gbps or 3.5 GiB/s)
Windows 11

For the first test, I used GLM 4.6

Without RPC, using all GPUs on the server but the 3090s:

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf' \
  -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16).ffn.=CUDA0" \
  -ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29).ffn.=CUDA1" \
  -ot "blk.(30|31|32|33|34|35|36|37|38|39).ffn.=CUDA2" \
  -ot "blk.(40|41|42|43|44|45|46|47|48|49).ffn.=CUDA3" \
  -ot "blk.(50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70).ffn.=CUDA4" \
  -ot "blk.(71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA5" \
  -mg 0 \
  -ub 2048 -b 2048

In lcpp: 1105.13 t/s PP, 27.80 t/s TG
In iklcpp (24576 ctx): 1176.91 t/s PP, 26.12 t/s TG

I had to reduce the ctx on iklcpp as I was getting OOM.

With RPC, replacing 1 5090 with RPC

./llama-server \
  -m '/run/media/pancho/MyDrive/models_llm_2tb/GLM-4.6-IQ4_XS.gguf' \
  --rpc 192.168.50.2:50052 -c 32768 \
  --no-mmap \
  -ngl 999 \
  -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16).ffn.=CUDA0" \
  -ot "blk.(17|18|19|20|21|22|23|24|25|26|27|28|29).ffn.=RPC0[192.168.50.2:50052]" \
  -ot "blk.(30|31|32|33|34|35|36|37|38|39).ffn.=CUDA1" \
  -ot "blk.(40|41|42|43|44|45|46|47|48|49).ffn.=CUDA2" \
  -ot "blk.(50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66|67|68|69|70).ffn.=CUDA3" \
  -ot "blk.(71|72|73|74|75|76|77|78|79|80|81|82|83|84|85|86|87|88|89|90|91|92).ffn.=CUDA4" \
  -mg 0 \
  -ub 2048 -b 2048 --device CUDA0,RPC0,CUDA1,CUDA2,CUDA3,CUDA4

In lcpp: 782.66 t/s PP, 23.88 t/s TG
In iklcpp (24576 ctx): 825.39 t/s PP, 22.5 t/s TG (use RPC0[192.168.50.2:50052] instead of RPC0 on devices)

For DeepSeek V3 0324:

RPC without a 5090

./llama-server -m '/run/media/pancho/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap --rpc 192.168.50.2:50052 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11|12).ffn.=RPC0[192.168.50.2:50052]" \
-ot "blk.(13|14|15|16).ffn.=CUDA1" \
-ot "blk.(17|18|19|20).ffn.=CUDA2" \
-ot "blk.(21|22|23).ffn.=CUDA3" \
-ot "blk.(24|25|26|27|28|29|30|31).ffn.=CUDA4" \
-ot "blk.(32|33|34|35|36|37|38|39).ffn.=CUDA5" \
-ot "exps=CPU" \
-mg 0 -ub 2560 -b 2560 -mla 1 --device CUDA0,RPC0[192.168.50.2:50052],CUDA1,CUDA2,CUDA3,CUDA4,CUDA5

lcpp: 211.25 t/s PP, 10.73 t/s TG
iklcpp: 217.68 t/s PP, 10.63 t/s TG

RPC with 8 GPUs

./llama-server -m '/run/media/pancho/Drive1_8TB/models_llm_2tb/DeepSeek-V3-0324-UD-Q3_K_XL.gguf' -c 32768 -ngl 999 --no-mmap --rpc 192.168.50.2:50052 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11|12).ffn.=CUDA1" \
-ot "blk.(13|14|15|16|17).ffn.=RPC0[192.168.50.2:50052]" \
-ot "blk.(18|19|20|21).ffn.=CUDA2" \
-ot "blk.(22|23|24|25).ffn.=CUDA3" \
-ot "blk.(26|27|28).ffn.=CUDA4" \
-ot "blk.(29|30|31|32|33|34|35|36).ffn.=CUDA5" \
-ot "blk.(37|38|39|40|41|42|43).ffn.=CUDA6" \
-ot "exps=CPU" \
-mg 0 -ub 2560 -b 2560 -mla 1 --device CUDA0,CUDA1,RPC0[192.168.50.2:50052],CUDA2,CUDA3,CUDA4,CUDA5,CUDA6

lcpp: 216.95 t/s PP, 11.43 t/s TG
iklcpp: 234.02 t/s PP, 11.52 t/s TG

Hope it helps!

EDIT 19/12/25: Added info of DeepSeek V3 0324 Q3_K_XL with RPC-

magikRUKKOLA · 2025-12-06T19:26:54Z

magikRUKKOLA
Dec 6, 2025

Why not @Thireus quants? Why not llama-sweep-bench ?

2 replies

Panchovix Dec 6, 2025
Author

Thireus quants and llama sweep bench is not on main llamacpp, so can't compare 1:1.

It is a good idea though.

magikRUKKOLA Dec 6, 2025

@Panchovix

Thireus quants and llama sweep bench is not on main llamacpp, so can't compare 1:1.

Just pick the quants with similar PPL. :)

firecoperana · 2025-12-06T21:48:20Z

firecoperana
Dec 6, 2025
Collaborator

@Panchovix For the OOM issues in ik_llama, you might want to add swa-full and --kv-unified in mainline llama-server, and compare the memory usage of the kv cache from here and there.

2 replies

ciprianveg Dec 19, 2025

On the client PC, do you need to start also that model? Or what do I need to run on the client PC to accept rpc connection? Thank you!

Panchovix Dec 19, 2025
Author

@ciprianveg you have to just start rpc, for example with:

.\rpc-server.exe -H 0.0.0.0 -p 50052

(If windows)

Or

./rpc-server-H 0.0.0.0 -p 50052

(If Linux)

Then you connect to that machine via the IP and port 50052

ciprianveg · 2025-12-19T15:11:29Z

ciprianveg
Dec 19, 2025

Thank you!

0 replies

magikRUKKOLA · 2025-12-21T14:49:21Z

magikRUKKOLA
Dec 21, 2025

@Panchovix

DeepSeek-V3-0324-UD-Q3_K_XL.gguf
iklcpp: 234.02 t/s PP, 11.52 t/s TG

Well, I am using the DeepSeek-V3.1-Terminus-5.4498bpw which is 100GB larger than the quant you're using and having about 8 tps in decode (prefill is about the same) with DDR4 RAM.
So, with DeepSeek-V3-0324-UD-Q3_K_XL I should get about echo "scale=2;(100/(321*100/427))*8"|bc == 10.64 tps. That is, there is almost no difference in decode in case you're offloading only 43 out of 60 layers, huh?

1 reply

Panchovix Dec 21, 2025
Author

@magikRUKKOLA depends of your CPU as well. A ddr4 threadripper/epyc with 4 mem channels should have equal or a bit better TG, and with 8 mem channels for sure would be faster. 9900X max bandwidth I see is just 65-70 GiB/s.

RPC results between main llamacpp and ik lcpp (GLM 4.6 full GPU, soon DeepSeek V3/R1 or Kimi K2) #1043

Uh oh!

Uh oh!

Panchovix Dec 6, 2025

Replies: 4 comments · 5 replies

Uh oh!

magikRUKKOLA Dec 6, 2025

Uh oh!

Panchovix Dec 6, 2025 Author

Uh oh!

magikRUKKOLA Dec 6, 2025

Uh oh!

firecoperana Dec 6, 2025 Collaborator

Uh oh!

ciprianveg Dec 19, 2025

Uh oh!

Panchovix Dec 19, 2025 Author

Uh oh!

ciprianveg Dec 19, 2025

Uh oh!

magikRUKKOLA Dec 21, 2025

Uh oh!

Panchovix Dec 21, 2025 Author

Panchovix
Dec 6, 2025

Replies: 4 comments 5 replies

magikRUKKOLA
Dec 6, 2025

Panchovix Dec 6, 2025
Author

firecoperana
Dec 6, 2025
Collaborator

Panchovix Dec 19, 2025
Author

ciprianveg
Dec 19, 2025

magikRUKKOLA
Dec 21, 2025

Panchovix Dec 21, 2025
Author