-
Notifications
You must be signed in to change notification settings - Fork 35
Interleave 8 rows (Q8_0, IQ4_XS) #178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads).
It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression.
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before.
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same.
Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation.
Tested on my Xeon E5-2683 v4 machine via llama-bench.
If you want me to test on my other machine (dual socket Xeon E5-2690 v3) or other models let me know. Also any chance you can sync the RPC code (mostly care about #11047 and to a lesser degree #9389 and #11424/#9296), if not I'll do it when I have some free time and submit a PR. |
Testing the batch performance difference showing the peak numbers IQ4_XS_R8:
IQ4_XS_R4:
|
So, it looks like a small (~2%) improvement. OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is |
Yes, it is an improvement, (there is an edge case where R4 was better at batch size 4).
Yes, it is okay to merge. That model is an IQ4_K_R4 (and IQ4_K), not IQ4_XS, as I prefer your quants over the mainline ones. Which is why I didn't have comparison data for it to mainline. On the note of R1, this PR llama.cpp/pull/11446 will make me reconvert anyway, I want to use it and also it is easy to grab it now before the KV refactor it is waiting for to implement MLA KV cache. I was going to bring that up anyway in the Deepseek PR because it is a change to the the GGUF for Deepseek. #11397 is also showing significant improvements to Deepseek. |
What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used? |
That would be my modified llama-bench from this PR: ggml-org/llama.cpp#11126 For the graph I used |
@fairydreaming Thanks for the clarification. I played a bit with your PR 11466. TG after a long prompt looks great compared to
(I did not have the patience to wait for the 16k tokens benchmark to finish). |
@ikawrakow Yup, I noticed this. I plan to reorganize tensor dimensions for the prompt processing in the PR, hopefully this will fix the issue. Edit: it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation |
Can't this be done already with batched-bench by setting a batch size of 1, and it has the benefit of showing PP speed as well.
Can you push that change? For my use cases the TG benefits outweigh the loss in PP, I'll try looking into the performance as well. |
That is correct.
Pushed. |
|
Will do when I come back from FOSDEM. |
One can get better performance on
AVX2/Zen4
by interleaving 8 instead of 4 rows. I did not do it earlier because in my previous attempts performance onARM
suffered significantly. But in this PR I found anARM_NEON
implementation for 8 interleaved rows forQ8_0
andIQ4_XS
that is not slower or is even slightly faster than 4 interleaved rows.Run-time-repacking from
Q8_0/IQ4_XS
will of course work, but models quantized toQ8_0_R4
orIQ4_XS_R4
will stop working, so putting it out there for testing and feedback.I did not rename the types to
_R8
yet but will in case this gets merged.Below is a graph showing prompt processing (a.k.a. prefill) performance for LLaMA-3.1-8B quantized with
IQ4_XS
on a Ryzen-7950X CPU. The cyan symbols are the results with this PR. We now get over 300 t/s for prompts less than 1000 tokens.@saood06 Can you test if this improves
IQ4_XS_R4
performance on your system?