Interleave 8 rows (Q8_0, IQ4_XS) #178

ikawrakow · 2025-01-26T11:32:05Z

One can get better performance on AVX2/Zen4 by interleaving 8 instead of 4 rows. I did not do it earlier because in my previous attempts performance on ARM suffered significantly. But in this PR I found an ARM_NEON implementation for 8 interleaved rows for Q8_0 and IQ4_XS that is not slower or is even slightly faster than 4 interleaved rows.

Run-time-repacking from Q8_0/IQ4_XS will of course work, but models quantized to Q8_0_R4 or IQ4_XS_R4 will stop working, so putting it out there for testing and feedback.

I did not rename the types to _R8 yet but will in case this gets merged.

Below is a graph showing prompt processing (a.k.a. prefill) performance for LLaMA-3.1-8B quantized with IQ4_XS on a Ryzen-7950X CPU. The cyan symbols are the results with this PR. We now get over 300 t/s for prompts less than 1000 tokens.

@saood06 Can you test if this improves IQ4_XS_R4 performance on your system?

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads).

It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression.

PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before.

PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same.

Very slightly faster than the general purpose gemm, slightly slower than the D = 128 special case gemm mul_mat_q8_0_r4_q8_0_128. Still removing mul_mat_q8_0_r4_q8_0_128 as we simply don't have enough vector registers to hold 8 interleaved rows, so there is no point to have the special purpose implementation.

saood06 · 2025-01-26T17:03:11Z

@ikawrakow

Tested on my Xeon E5-2683 v4 machine via llama-bench.

model	size	params	fa	rtr	test	master t/s	PR t/s
llama 70B IQ4_XS - 4.25 bpw	34.30 GiB	68.98 B	1	1	pp512	7.00	7.10

If you want me to test on my other machine (dual socket Xeon E5-2690 v3) or other models let me know.

Also any chance you can sync the RPC code (mostly care about #11047 and to a lesser degree #9389 and #11424/#9296), if not I'll do it when I have some free time and submit a PR.

saood06 · 2025-01-27T13:06:04Z

Testing the batch performance difference showing the peak numbers

IQ4_XS_R8:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	14	1920	18.944	6.76	272.880	6.57	291.824	6.58

IQ4_XS_R4:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	11	1536	19.367	6.61	220.288	6.39	239.655	6.41

ikawrakow · 2025-01-27T13:28:46Z

So, it looks like a small (~2%) improvement. OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is IQ4_XS_R4.

saood06 · 2025-01-27T14:12:11Z

So, it looks like a small (~2%) improvement.

Yes, it is an improvement, (there is an edge case where R4 was better at batch size 4).

OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is IQ4_XS_R4.

Yes, it is okay to merge. That model is an IQ4_K_R4 (and IQ4_K), not IQ4_XS, as I prefer your quants over the mainline ones. Which is why I didn't have comparison data for it to mainline.

On the note of R1, this PR llama.cpp/pull/11446 will make me reconvert anyway, I want to use it and also it is easy to grab it now before the KV refactor it is waiting for to implement MLA KV cache. I was going to bring that up anyway in the Deepseek PR because it is a change to the the GGUF for Deepseek.

#11397 is also showing significant improvements to Deepseek.

ikawrakow · 2025-01-27T15:41:40Z

On the note of R1, this PR 11446 will make me reconvert anyway

What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used?

fairydreaming · 2025-01-27T19:42:36Z

On the note of R1, this PR 11446 will make me reconvert anyway

What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used?

That would be my modified llama-bench from this PR: ggml-org/llama.cpp#11126
It allows to measure token generation rate after processing a prompt of given size.

For the graph I used -gp <prompt size>,32 options, so it's mean token generation rate of 32 tokens after processing the prompt of <prompt size>.

ikawrakow · 2025-01-28T14:06:19Z

@fairydreaming Thanks for the clarification.

I played a bit with your PR 11466. TG after a long prompt looks great compared to llama.cpp, but it seems this comes at the expense of a much reduced prompt processing speed? Here is what I get on my Ryzen-7950X

llama.cpp

model	size	params	backend	threads	test	t/s
deepseek2 16B F16	29.26 GiB	15.71 B	CPU	16	pp256	150.29 ± 0.31
deepseek2 16B F16	29.26 GiB	15.71 B	CPU	16	pp512	153.23 ± 0.13
deepseek2 16B F16	29.26 GiB	15.71 B	CPU	16	pp1024	149.27 ± 0.22
deepseek2 16B F16	29.26 GiB	15.71 B	CPU	16	pp4096	133.74 ± 0.20
deepseek2 16B F16	29.26 GiB	15.71 B	CPU	16	pp8192	117.74 ± 0.03

PR 11466

model	size	params	backend	threads	test	t/s
deepseek2 16B F16	29.37 GiB	15.76 B	CPU	16	pp256	142.08 ± 0.27
deepseek2 16B F16	29.37 GiB	15.76 B	CPU	16	pp512	140.53 ± 0.03
deepseek2 16B F16	29.37 GiB	15.76 B	CPU	16	pp1024	133.17 ± 0.12
deepseek2 16B F16	29.37 GiB	15.76 B	CPU	16	pp4096	101.17 ± 0.10
deepseek2 16B F16	29.37 GiB	15.76 B	CPU	16	pp8192	77.08 ± 0.08

(I did not have the patience to wait for the 16k tokens benchmark to finish).

fairydreaming · 2025-01-28T14:12:33Z

@ikawrakow Yup, I noticed this. I plan to reorganize tensor dimensions for the prompt processing in the PR, hopefully this will fix the issue.

Edit: it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation

saood06 · 2025-01-29T09:03:52Z

@fairydreaming

It allows to measure token generation rate after processing a prompt of given size.

Can't this be done already with batched-bench by setting a batch size of 1, and it has the benefit of showing PP speed as well.

it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation.

Can you push that change? For my use cases the TG benefits outweigh the loss in PP, I'll try looking into the performance as well.

fairydreaming · 2025-01-29T10:09:22Z

@saood06

@fairydreaming

It allows to measure token generation rate after processing a prompt of given size.

Can't this be done already with batched-bench by setting a batch size of 1, and it has the benefit of showing PP speed as well.

That is correct.

it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation.

Can you push that change? For my use cases the TG benefits outweigh the loss in PP, I'll try looking into the performance as well.

Pushed.

saood06 · 2025-01-30T19:32:55Z

@ikawrakow

I did not rename the types to _R8 yet but will in case this gets merged.

ikawrakow · 2025-01-31T06:31:03Z

Will do when I come back from FOSDEM.

Kawrakow added 10 commits January 25, 2025 11:01

Try interleaving 8 rows for iq4_xs

1ac69af

On Zen4, PP-512 goes up from ~260 t/s to 288 t/s for L3-8B. TG-128 reaches max. performance at 2 threads and is slightly higher than 4 interleaved rows (14.48 t/s vs 13.11 t/s @ 2 threads and 14/28 t/s @ 4 threads).

Try interleaving 8 iq4_xs rows

9354ea2

It is also faster on AVX2. This is the NEON implementation. It is tiny bit faster than 4 interleaved rows (~0.5%). So, this looks like a winner given the Zen4/AVX2 improvement without associated NEON egression.

Cleanup

3bfe569

8-rows interleaved q8_0 (AVX2)

1053ac5

8-rows interleaved q8_0 (Zen4)

1774ef6

8-rows interleaved q8_0 (Zen4) - slightly better

4507557

PP-512 is now 284 t/s compared to 257 t/s for 4-rows interleaved. TG-128 reaches peak of 8.16 t/s at just 2 threads compared to 7.95 t/s @ 4 threads before.

8-rows interleaved q8_0 (NEON)

4de6088

PP-512 is slightly better (138 t/s vs 132.5 t/s), TG-128 is about the same.

FA: repack Q8_0 to Q8_0_R8

cc43818

Remove special purpose mul_mat_q8_0_r4_q8_1_128 (Zen4)

3484ee6

ikawrakow added the Breaking change label Jan 26, 2025

ikawrakow merged commit d9c4ea4 into main Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interleave 8 rows (Q8_0, IQ4_XS) #178

Interleave 8 rows (Q8_0, IQ4_XS) #178

Uh oh!

ikawrakow commented Jan 26, 2025

Uh oh!

saood06 commented Jan 26, 2025

Uh oh!

saood06 commented Jan 27, 2025

Uh oh!

ikawrakow commented Jan 27, 2025

Uh oh!

saood06 commented Jan 27, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Jan 27, 2025

Uh oh!

fairydreaming commented Jan 27, 2025 •

edited

Loading

Uh oh!

ikawrakow commented Jan 28, 2025

Uh oh!

fairydreaming commented Jan 28, 2025 •

edited

Loading

Uh oh!

saood06 commented Jan 29, 2025

Uh oh!

fairydreaming commented Jan 29, 2025

Uh oh!

saood06 commented Jan 30, 2025

Uh oh!

ikawrakow commented Jan 31, 2025

Uh oh!

Uh oh!

Interleave 8 rows (Q8_0, IQ4_XS) #178

Interleave 8 rows (Q8_0, IQ4_XS) #178

Uh oh!

Conversation

ikawrakow commented Jan 26, 2025

Uh oh!

saood06 commented Jan 26, 2025

Uh oh!

saood06 commented Jan 27, 2025

Uh oh!

ikawrakow commented Jan 27, 2025

Uh oh!

saood06 commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Jan 27, 2025

Uh oh!

fairydreaming commented Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow commented Jan 28, 2025

Uh oh!

fairydreaming commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saood06 commented Jan 29, 2025

Uh oh!

fairydreaming commented Jan 29, 2025

Uh oh!

saood06 commented Jan 30, 2025

Uh oh!

ikawrakow commented Jan 31, 2025

Uh oh!

Uh oh!

saood06 commented Jan 27, 2025 •

edited

Loading

fairydreaming commented Jan 27, 2025 •

edited

Loading

fairydreaming commented Jan 28, 2025 •

edited

Loading