I need help for my intel mini pc #898
-
|
I have a mini pc with intel 125h cpu. It has 96 gb DDR5 ram so I can load big MOE models like GLM 4.5 Air on it. I run local services on it 24/7 and I like loading local AI on it to have access to all the time with low power consumption. I get about 4 t/s with GLM-Air in Q4_K_S quant using Kobold.cpp which is like a fork of llama.cpp. I can live with this token generation speed but prompt processing is also about 4-5 t/s and that is abysmally slow. I feel like I should be getting a lot more than that with 125h and DDR5 ram so I don't know what I am doing wrong. I have some questions.
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 18 replies
-
|
For CPU-only inference my expectation is that you should get much better PP performance with I can run GLM-4.5-Air on my Ryzen-5975WX CPU. I get about 300 t/s PP. If what I find on cpubenchmark.net is representative for LLM inference performance, my CPU is about 3.75X faster than yours, so from that I would estimate about 80 t/s PP. Although, if a significant portion of the cpubenchmark.net multi-threaded score comes from the efficiency cores, then perhaps you will get significantly less, perhaps more in the 20-40 t/s range (see also the last bullet point below). It is easiest to start with the same model(s) that you have been using. For a MoE model such as GLM-4.5-Air, the main parameters that may influence performance are
Another argument you may want to try is Other command line options are set by default to the best possible value, so it is unlikely to gain performance by changing them. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I’ve been experimenting with both ik_llama and the mainline llama.cpp for a while. Since you mentioned PP speed @ikawrakow , I’d like to share my test results on a Dual Xeon E5-2696 v3 system (not fancy, but each CPU has 18 cores, so 36 physical cores and 72 threads in total). 1. llama.cpp resultbuild: 3cfa9c3f1 (6840)
system_info: n_threads = 72 (n_threads_batch = 72) / 72 | CPU: SSE3=1 | SSSE3=1 | AVX=1 | AVX2=1 | F16C=1 | FMA=1 | BMI2=1 | LLAMAFILE=1 | OPENMP=1 | REPACK=1 2. ik_llama with BLASbuild: 575e2c2 (3958)
system_info: n_threads = 72 / 72 | AVX=1 | AVX2=1 | AVX512=0 | FMA=1 | F16C=1 | BLAS=1 | SSE3=1 | SSSE3=1 | LLAMAFILE=1 3. ik_llama without BLASbuild: 575e2c2 (3958)
system_info: n_threads = 72 / 72 | AVX=1 | AVX2=1 | AVX512=0 | FMA=1 | F16C=1 | BLAS=0 | SSE3=1 | SSSE3=1 | LLAMAFILE=1 My questions: Why does the model size differ between the llama.cpp and ik_llama builds (603.87 MiB vs 761.24 MiB)? Could this be due to repack compression? From my perspective, for embedding models, PP speed seems more critical than TG speed. Does that sound right? |
Beta Was this translation helpful? Give feedback.
-
Repack does not compress. Most likely the model does not have a dedicated output tensor but uses the token embedding tensor for that. In this situation
Same reason as above
But apart from this, if batches of 512 tokens represents your use case, then Also, |
Beta Was this translation helpful? Give feedback.
For CPU-only inference my expectation is that you should get much better PP performance with
ik_llama.cppthan with mainline. TG is mostly memory bound, so there performance gains, if any, will be small (and somewhat quantization type dependent).I can run GLM-4.5-Air on my Ryzen-5975WX CPU. I get about 300 t/s PP. If what I find on cpubenchmark.net is representative for LLM inference performance, my CPU is about 3.75X faster than yours, so from that I would estimate about 80 t/s PP. Although, if a significant portion of the cpubenchmark.net multi-threaded score comes from the efficiency cores, then perhaps you will get significantly less, perhaps more in the 20-40 t/s range (see also the…