I need help for my intel mini pc #898

lannashelton · 2025-11-04T11:08:44Z

lannashelton
Nov 4, 2025

I have a mini pc with intel 125h cpu. It has 96 gb DDR5 ram so I can load big MOE models like GLM 4.5 Air on it. I run local services on it 24/7 and I like loading local AI on it to have access to all the time with low power consumption. I get about 4 t/s with GLM-Air in Q4_K_S quant using Kobold.cpp which is like a fork of llama.cpp. I can live with this token generation speed but prompt processing is also about 4-5 t/s and that is abysmally slow. I feel like I should be getting a lot more than that with 125h and DDR5 ram so I don't know what I am doing wrong. I have some questions.

If I use ik_llama.cpp, would it speed things up for me?
Which quant should I use? One of those R4 ones? iq4_NL? Q4_0? Which one is best for cpu only inference?
Is there any settings or flags I should enable/disable?

Answered by ikawrakow

Nov 4, 2025

For CPU-only inference my expectation is that you should get much better PP performance with ik_llama.cpp than with mainline. TG is mostly memory bound, so there performance gains, if any, will be small (and somewhat quantization type dependent).

I can run GLM-4.5-Air on my Ryzen-5975WX CPU. I get about 300 t/s PP. If what I find on cpubenchmark.net is representative for LLM inference performance, my CPU is about 3.75X faster than yours, so from that I would estimate about 80 t/s PP. Although, if a significant portion of the cpubenchmark.net multi-threaded score comes from the efficiency cores, then perhaps you will get significantly less, perhaps more in the 20-40 t/s range (see also the…

View full answer

ikawrakow · 2025-11-04T11:44:21Z

ikawrakow
Nov 4, 2025
Maintainer

For CPU-only inference my expectation is that you should get much better PP performance with ik_llama.cpp than with mainline. TG is mostly memory bound, so there performance gains, if any, will be small (and somewhat quantization type dependent).

I can run GLM-4.5-Air on my Ryzen-5975WX CPU. I get about 300 t/s PP. If what I find on cpubenchmark.net is representative for LLM inference performance, my CPU is about 3.75X faster than yours, so from that I would estimate about 80 t/s PP. Although, if a significant portion of the cpubenchmark.net multi-threaded score comes from the efficiency cores, then perhaps you will get significantly less, perhaps more in the 20-40 t/s range (see also the last bullet point below).

It is easiest to start with the same model(s) that you have been using. ik_llama.cpp supports all quantization type supported by llama.cpp (or Kobold.cpp). You can worry about quantization types later, after confirming that you get better performance.

For a MoE model such as GLM-4.5-Air, the main parameters that may influence performance are

-ub or --ubatch-size. This can have quite a big impact on PP performance, but the best value tends to be model dependent (mostly dependent on the ratio of total to active experts). The default is 512, so you should definitely try 1024 and 2048 to see where PP performance is best
On the CPU using Q8_0-quantized K-cache can improve PP and TG performance for long context quite a bi compared to the default f16 KV cache type. Quantized K-cache is more important for performance than quantized V cache. So, unless you believe that KV cache quantization reduces model output quality as some people around the Internet do, always add -ctk q8_0 to your command line.
For CPUs with big and little cores such as the 125h, you may need to explicitly set the number of threads to be the same as the number of performance cores (4 for the 125h?) using -t. Is this what you do with Kobold.cpp?

Another argument you may want to try is -mqkv. This will merge the Q, K, and V attention tensors into a single tensor if possible, which may result in a slightly better performance. I say "may" instead of "will", because on the CPU things often depend on how the stars fall (i.e., where tensor data ends up in RAM). If Q, K and V use a different quantization type there will be no difference.

Other command line options are set by default to the best possible value, so it is unlikely to gain performance by changing them.

13 replies

ikawrakow Nov 4, 2025
Maintainer

OK, sorry, I had forgotten that I threw out the BLAS parts exactly because people were building ik_llama.cpp with BLAS enabled and getting lower performance than they could be. So, yes, the cmake BLAS setting does not make a difference.

So, I guess, 17 t/s it is on that system.

In case you feel like downloading ik_llama.cpp specific quants, you can try for instance this one: https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ4_KS. In terms of quantization quality it should be roughly equivalent or perhaps even slightly better than the Q4_K_S/IQ4_NL quants you are currently using. But as I don't have much experience with this class of CPU, I cannot really predict if this model will perform better on you system.

ikawrakow Nov 4, 2025
Maintainer

Sorry, wrong link. I see @ubergarm doesn't have IQ4_KS for GLM-Air. https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ4_K is the one he has in this range of quantized model size.

lannashelton Nov 4, 2025
Author

OK, sorry, I had forgotten that I threw out the BLAS parts exactly because people were building ik_llama.cpp with BLAS enabled and getting lower performance than they could be. So, yes, the cmake BLAS setting does not make a difference.

So, I guess, 17 t/s it is on that system.

In case you feel like downloading ik_llama.cpp specific quants, you can try for instance this one: https://huggingface.co/ubergarm/GLM-4.6-GGUF/tree/main/IQ4_KS. In terms of quantization quality it should be roughly equivalent or perhaps even slightly better than the Q4_K_S/IQ4_NL quants you are currently using. But as I don't have much experience with this class of CPU, I cannot really predict if this model will perform better on you system.

Yeah, this might be the limit for my poor lil pc. Thanks a lot. Even now it feels much better than how it was at start!

lannashelton Nov 4, 2025
Author

Sorry, wrong link. I see @ubergarm doesn't have IQ4_KS for GLM-Air. https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF/tree/main/IQ4_K is the one he has in this range of quantized model size.

I was going to get the KSS one, is that not good?

ikawrakow Nov 4, 2025
Maintainer

That one works too

Kotali-2019 · 2025-11-06T14:37:48Z

Kotali-2019
Nov 6, 2025

Hi, I’ve been experimenting with both ik_llama and the mainline llama.cpp for a while. Since you mentioned PP speed @ikawrakow , I’d like to share my test results on a Dual Xeon E5-2696 v3 system (not fancy, but each CPU has 18 cores, so 36 physical cores and 72 threads in total).
I ran the same llama-bench command on three different builds under Debian 12:
root@LocalAI:~/llama-bin# ./llama-bench -m /root/models/Qwen/Qwen3-Embedding-0.6B-Q8_0.gguf -t 72 -ub 4906 -n 0

1. llama.cpp result

build: 3cfa9c3f1 (6840)
load_backend: loaded RPC backend from /root/llama-bin/libggml-rpc.so
load_backend: loaded CPU backend from /root/llama-bin/libggml-cpu-haswell.so

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3 0.6B Q8_0	603.87 MiB	595.78 M	CPU	72	4906	pp512	627.86 ± 20.64

system_info: n_threads = 72 (n_threads_batch = 72) / 72 | CPU: SSE3=1 | SSSE3=1 | AVX=1 | AVX2=1 | F16C=1 | FMA=1 | BMI2=1 | LLAMAFILE=1 | OPENMP=1 | REPACK=1

2. ik_llama with BLAS

build: 575e2c2 (3958)

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3 0.6B Q8_0	761.24 MiB	751.09 M	BLAS	72	4906	pp512	1143.57 ± 81.61

system_info: n_threads = 72 / 72 | AVX=1 | AVX2=1 | AVX512=0 | FMA=1 | F16C=1 | BLAS=1 | SSE3=1 | SSSE3=1 | LLAMAFILE=1

3. ik_llama without BLAS

build: 575e2c2 (3958)

model	size	params	backend	threads	n_ubatch	test	t/s
qwen3 0.6B Q8_0	761.24 MiB	751.09 M	CPU	72	4906	pp512	1169.02 ± 54.91

system_info: n_threads = 72 / 72 | AVX=1 | AVX2=1 | AVX512=0 | FMA=1 | F16C=1 | BLAS=0 | SSE3=1 | SSSE3=1 | LLAMAFILE=1

My questions:

Why does the model size differ between the llama.cpp and ik_llama builds (603.87 MiB vs 761.24 MiB)? Could this be due to repack compression?
Why do the parameter counts differ?
Why is there such a big difference in the ± values for t/s (81.61 vs 54.91)?

From my perspective, for embedding models, PP speed seems more critical than TG speed. Does that sound right?

0 replies

ikawrakow · 2025-11-06T15:15:53Z

ikawrakow
Nov 6, 2025
Maintainer

Why does the model size differ between the llama.cpp and ik_llama builds (603.87 MiB vs 761.24 MiB)? Could this be due to repack compression?

Repack does not compress. Most likely the model does not have a dedicated output tensor but uses the token embedding tensor for that. In this situation ik_llama.cpp duplicates the token embedding tensor, so it gets counted twice. This used to be what llama.cpp does as well, but probably they have changed to avoid the duplication (or simply do not report the size of the duplicated tensor).

Why do the parameter counts differ?

Same reason as above

Why is there such a big difference in the ± values for t/s (81.61 vs 54.91)?

llama-bench runs by default 5 repetitions and uses the standard deviation of this small sample for the +/-. These runs are pretty short, so one gets a lot of fluctuation, so standard deviation is high and depends somewhat how the stars fell in that moment. You can get a better estimate by increasing the repetitions with, e.g., -r 20.

But apart from this, if batches of 512 tokens represents your use case, then -p 512 is a good benchmark. But if your batches are much bigger, it is better to benchmark with a higher -p that is more in line with your use case.

Also, -ub 4096 does not work, it will get truncated to the default batch size of 2048. If you want micro batches of 4096, you need to use -b 4096 -ub 4096. But in this benchmark this will do nothing as the prompt is only 512 tokens, so batch/u-batch of 512 will get used. To see how things depend on the u-batch size, you can use for instance

llama-bench -p 4096 -n 0 -b 4096 -ub 256,512,1024,2048,4096 $other_args

5 replies

ikawrakow Nov 6, 2025
Maintainer

Another comment: using hyper-threading (almost) never helps performance. 1100-1200 t/s for a 0.6B parameter model seemed quite low, so I downloaded and tested the same model on my Ryzen-7950X CPU (16 cores, 32 threads). I get 3100 t/s for pp512 using 16 threads, and 2980 t/s using 32 threads. llama.cpp manages 1260 t/s on the Ryzen-7950X.

Kotali-2019 Nov 7, 2025

Hi, thank you so much for the explanations and advice — much appreciated! @ikawrakow
I found this suggestion on Hugging Face (https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF) to run Qwen embedding with -ub 8192 using the following command:
./build/bin/llama-server -m model.gguf --embedding --pooling last -ub 8192 --verbose-prompt
I tried this because, during my RAG process, I occasionally encountered an error suggesting that I should increase the physical batch size (unfortunately, I haven’t been able to reproduce that error yet to show you). Using -ub 8192 seemed to help, especially since I was running with --parallel 2. Later, I tested with -ub 4096 during benchmarking.
Maybe I’m missing something regarding this optimization? For context, I’ve been embedding with a dimension of 1024 (dim = 1024).

ikawrakow Nov 7, 2025
Maintainer

Qwen3-0.6B is a dense model. The wisdom to use large batch sizes floating around the Internet only apply to MoE models running on the GPU, but even then a batch size of 8192 is pushing it, unless one is using hybrid inference (MoE tensors stored in RAM but copied to the GPU for computation). For Qwen3-0.6B running on the CPU you will get the best performance by leaving batch and u-batch size at their default values.

Anyway, computing embeddings is not a use case I have been focusing on, so let me know if you run into issues.

Kotali-2019 Nov 7, 2025

Hi @ikawrakow ,
Sorry to bother you again. I’m eager to benchmark and test the performance of the ik-llama build. While testing with the curl command, I noticed that the embedding result seems completely incorrect.
Could you please advise whether I should open a new issue for this, or if a quick suggestion from you here would be sufficient?
Here’s what I ran:
Server command:

./llama-server -m /root/models/Draft/Embedding-Qwen3-0.6B-Q8_0.gguf \
-t 72 --host 0.0.0.0 --port 10002 --parallel 2 --embedding \
--pooling last --embd-output-format json --alias Embedding-Qwen3-0.6B-Q8_0

Curl command:

curl --location 'http://192.168.11.101:10002/v1/embeddings' \
--header 'Content-Type: application/json' \
--data '{
    "model": "Embedding-Qwen3-0.6B-Q8_0",
    "input": "Test"
}'

Expected result:

{"model":"Embedding-Qwen3-0.6B-Q8_0","object":"list","usage":{"prompt_tokens":2,"total_tokens":2},"data":[{"embedding":[-0.012152200564742088,-0.059889741241931915,...0.01874898374080658,0.004860232584178448],"index":0,"object":"embedding"}]}

ik-llama build result:

{"model":"Embedding-Qwen3-0.6B-Q8_0","object":"list","usage":{"prompt_tokens":0,"total_tokens":0},"data":[{"embedding":[],"index":0,"object":"embedding"}]}

ikawrakow Nov 7, 2025
Maintainer

Yes, enter an issue. Personally I have never used the embedding endpoint, so it is possible it is broken.

I need help for my intel mini pc #898

Uh oh!

lannashelton Nov 4, 2025

Replies: 3 comments · 18 replies

Uh oh!

ikawrakow Nov 4, 2025 Maintainer

Uh oh!

ikawrakow Nov 4, 2025 Maintainer

Uh oh!

ikawrakow Nov 4, 2025 Maintainer

Uh oh!

lannashelton Nov 4, 2025 Author

Uh oh!

lannashelton Nov 4, 2025 Author

Uh oh!

ikawrakow Nov 4, 2025 Maintainer

Uh oh!

Uh oh!

Kotali-2019 Nov 6, 2025

1. llama.cpp result

2. ik_llama with BLAS

3. ik_llama without BLAS

Uh oh!

ikawrakow Nov 6, 2025 Maintainer

Uh oh!

ikawrakow Nov 6, 2025 Maintainer

Uh oh!

Kotali-2019 Nov 7, 2025

Uh oh!

ikawrakow Nov 7, 2025 Maintainer

Uh oh!

Kotali-2019 Nov 7, 2025

Uh oh!

ikawrakow Nov 7, 2025 Maintainer

lannashelton
Nov 4, 2025

Replies: 3 comments 18 replies

ikawrakow
Nov 4, 2025
Maintainer

ikawrakow Nov 4, 2025
Maintainer

ikawrakow Nov 4, 2025
Maintainer

lannashelton Nov 4, 2025
Author

lannashelton Nov 4, 2025
Author

ikawrakow Nov 4, 2025
Maintainer

Kotali-2019
Nov 6, 2025

ikawrakow
Nov 6, 2025
Maintainer

ikawrakow Nov 6, 2025
Maintainer

ikawrakow Nov 7, 2025
Maintainer

ikawrakow Nov 7, 2025
Maintainer