Skip to content

Interleave 8 rows (Q8_0, IQ4_XS) #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 27, 2025
Merged

Interleave 8 rows (Q8_0, IQ4_XS) #178

merged 10 commits into from
Jan 27, 2025

Conversation

ikawrakow
Copy link
Owner

One can get better performance on AVX2/Zen4 by interleaving 8 instead of 4 rows. I did not do it earlier because in my previous attempts performance on ARM suffered significantly. But in this PR I found an ARM_NEON implementation for 8 interleaved rows for Q8_0 and IQ4_XS that is not slower or is even slightly faster than 4 interleaved rows.

Run-time-repacking from Q8_0/IQ4_XS will of course work, but models quantized to Q8_0_R4 or IQ4_XS_R4 will stop working, so putting it out there for testing and feedback.

I did not rename the types to _R8 yet but will in case this gets merged.

Below is a graph showing prompt processing (a.k.a. prefill) performance for LLaMA-3.1-8B quantized with IQ4_XS on a Ryzen-7950X CPU. The cyan symbols are the results with this PR. We now get over 300 t/s for prompts less than 1000 tokens.

pp512_vs_ctx

@saood06 Can you test if this improves IQ4_XS_R4 performance on your system?

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B.
TG-128 reaches max. performance at 2 threads and is slightly
higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads
and 14/28 t/s @ 4 threads).
It is also faster on AVX2.

This is the NEON implementation. It is tiny bit faster than
4 interleaved rows (~0.5%).

So, this looks like a winner given the Zen4/AVX2 improvement
without associated NEON egression.
PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved.
TG-128 reaches peak of 8.16 t/s at just 2 threads compared
to 7.95 t/s @ 4 threads before.
PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the
same.
Very slightly faster than the general purpose gemm, slightly
slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128.
Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have
enough vector registers to hold 8 interleaved rows, so there is
no point to have the special purpose implementation.
@saood06
Copy link
Collaborator

saood06 commented Jan 26, 2025

@ikawrakow

Tested on my Xeon E5-2683 v4 machine via llama-bench.

model size params fa rtr test master t/s PR t/s
llama 70B IQ4_XS - 4.25 bpw 34.30 GiB 68.98 B 1 1 pp512 7.00 7.10

If you want me to test on my other machine (dual socket Xeon E5-2690 v3) or other models let me know.

Also any chance you can sync the RPC code (mostly care about #11047 and to a lesser degree #9389 and #11424/#9296), if not I'll do it when I have some free time and submit a PR.

@saood06
Copy link
Collaborator

saood06 commented Jan 27, 2025

Testing the batch performance difference showing the peak numbers

IQ4_XS_R8:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 14 1920 18.944 6.76 272.880 6.57 291.824 6.58

IQ4_XS_R4:

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 11 1536 19.367 6.61 220.288 6.39 239.655 6.41

@ikawrakow
Copy link
Owner Author

So, it looks like a small (~2%) improvement. OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is IQ4_XS_R4.

@saood06
Copy link
Collaborator

saood06 commented Jan 27, 2025

So, it looks like a small (~2%) improvement.

Yes, it is an improvement, (there is an edge case where R4 was better at batch size 4).

OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is IQ4_XS_R4.

Yes, it is okay to merge. That model is an IQ4_K_R4 (and IQ4_K), not IQ4_XS, as I prefer your quants over the mainline ones. Which is why I didn't have comparison data for it to mainline.

On the note of R1, this PR llama.cpp/pull/11446 will make me reconvert anyway, I want to use it and also it is easy to grab it now before the KV refactor it is waiting for to implement MLA KV cache. I was going to bring that up anyway in the Deepseek PR because it is a change to the the GGUF for Deepseek.

#11397 is also showing significant improvements to Deepseek.

@ikawrakow ikawrakow merged commit d9c4ea4 into main Jan 27, 2025
@ikawrakow
Copy link
Owner Author

On the note of R1, this PR 11446 will make me reconvert anyway

What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used?

@fairydreaming
Copy link

fairydreaming commented Jan 27, 2025

On the note of R1, this PR 11446 will make me reconvert anyway

What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used?

That would be my modified llama-bench from this PR: ggml-org/llama.cpp#11126
It allows to measure token generation rate after processing a prompt of given size.

For the graph I used -gp <prompt size>,32 options, so it's mean token generation rate of 32 tokens after processing the prompt of <prompt size>.

@ikawrakow
Copy link
Owner Author

@fairydreaming Thanks for the clarification.

I played a bit with your PR 11466. TG after a long prompt looks great compared to llama.cpp, but it seems this comes at the expense of a much reduced prompt processing speed? Here is what I get on my Ryzen-7950X

  • llama.cpp
model size params backend threads test t/s
deepseek2 16B F16 29.26 GiB 15.71 B CPU 16 pp256 150.29 ± 0.31
deepseek2 16B F16 29.26 GiB 15.71 B CPU 16 pp512 153.23 ± 0.13
deepseek2 16B F16 29.26 GiB 15.71 B CPU 16 pp1024 149.27 ± 0.22
deepseek2 16B F16 29.26 GiB 15.71 B CPU 16 pp4096 133.74 ± 0.20
deepseek2 16B F16 29.26 GiB 15.71 B CPU 16 pp8192 117.74 ± 0.03
  • PR 11466
model size params backend threads test t/s
deepseek2 16B F16 29.37 GiB 15.76 B CPU 16 pp256 142.08 ± 0.27
deepseek2 16B F16 29.37 GiB 15.76 B CPU 16 pp512 140.53 ± 0.03
deepseek2 16B F16 29.37 GiB 15.76 B CPU 16 pp1024 133.17 ± 0.12
deepseek2 16B F16 29.37 GiB 15.76 B CPU 16 pp4096 101.17 ± 0.10
deepseek2 16B F16 29.37 GiB 15.76 B CPU 16 pp8192 77.08 ± 0.08

(I did not have the patience to wait for the 16k tokens benchmark to finish).

@fairydreaming
Copy link

fairydreaming commented Jan 28, 2025

@ikawrakow Yup, I noticed this. I plan to reorganize tensor dimensions for the prompt processing in the PR, hopefully this will fix the issue.

Edit: it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation

@saood06
Copy link
Collaborator

saood06 commented Jan 29, 2025

@fairydreaming

It allows to measure token generation rate after processing a prompt of given size.

Can't this be done already with batched-bench by setting a batch size of 1, and it has the benefit of showing PP speed as well.

it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation.

Can you push that change? For my use cases the TG benefits outweigh the loss in PP, I'll try looking into the performance as well.

@fairydreaming
Copy link

@saood06

@fairydreaming

It allows to measure token generation rate after processing a prompt of given size.

Can't this be done already with batched-bench by setting a batch size of 1, and it has the benefit of showing PP speed as well.

That is correct.

it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation.

Can you push that change? For my use cases the TG benefits outweigh the loss in PP, I'll try looking into the performance as well.

Pushed.

@saood06
Copy link
Collaborator

saood06 commented Jan 30, 2025

@ikawrakow

I did not rename the types to _R8 yet but will in case this gets merged.

@ikawrakow
Copy link
Owner Author

Will do when I come back from FOSDEM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants