Skip to content
Discussion options

You must be logged in to vote

For CPU-only inference my expectation is that you should get much better PP performance with ik_llama.cpp than with mainline. TG is mostly memory bound, so there performance gains, if any, will be small (and somewhat quantization type dependent).

I can run GLM-4.5-Air on my Ryzen-5975WX CPU. I get about 300 t/s PP. If what I find on cpubenchmark.net is representative for LLM inference performance, my CPU is about 3.75X faster than yours, so from that I would estimate about 80 t/s PP. Although, if a significant portion of the cpubenchmark.net multi-threaded score comes from the efficiency cores, then perhaps you will get significantly less, perhaps more in the 20-40 t/s range (see also the…

Replies: 3 comments 18 replies

Comment options

You must be logged in to vote
13 replies
@ikawrakow
Comment options

@ikawrakow
Comment options

@lannashelton
Comment options

@lannashelton
Comment options

@ikawrakow
Comment options

Answer selected by ikawrakow
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
5 replies
@ikawrakow
Comment options

@Kotali-2019
Comment options

@ikawrakow
Comment options

@Kotali-2019
Comment options

@ikawrakow
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants