July 2024: Token generation performance comparison

Performance comparison to llama.cpp

The results in the following tables are obtained with these parameters:

Model is LLaMA-v3-8B for AVX2 and LLaMA-v2-7B for ARM_NEON
The AVX2 CPU is a 16-core Ryzen-7950X
The ARM_NEON CPU is M2-Max
tinyBLAS is enabled in llama.cpp
llama.cpp results are for build: 081fe431 (3441), which was the current llama.cpp master branch when I pulled on July 23 2024.
The projects are built without CUDA support, no BLAS, and Accelerate framework disabled

Token generation

On the Ryzen-7950X TG is memory bound, and for many quantization types peak performance is achieved at just 4 threads. Hence, only results for 2 and 4 threads are shown for AVX2. The M2-Max has a much more capable memory subsystem and as a result performance keep increasing up to 8 threads. Thus, results are given for up to 8 threads for ARM_NEON.

The command line to generate the data was

./bin/llama-bench -m $model -p 0 -n 128 -t $num_threads -ngl 0

Quantization	size	backend	threads	t/s (llama.cpp)	t/s (iqk_mul_mat)	Speedup
8B F16	14.96 GiB	AVX2	1	2.20 ± 0.00	2.25 ± 0.00	1.023
			2	3.63 ± 0.00	3.68 ± 0.00	1.014
			4	4.20 ± 0.00	4.20 ± 0.00	1.000
7B F16	12.55 GiB	NEON	2	6.94 ± 0.27	7.40 ± 0.01	1.066
			4	8.73 ± 0.01	8.83 ± 0.01	1.011
			6	9.05 ± 0.02	9.05 ± 0.01	1.000
8B Q8_0	7.95 GiB	AVX2	2	5.03 ± 0.00	7.87 ± 0.00	1.565
			4	7.40 ± 0.00	7.82 ± 0.00	1.057
7B Q8_0	6.67 GiB	NEON	2	8.29 ± 0.44	12.07 ± 0.10	1.456
			4	13.53 ± 0.03	15.77 ± 0.08	1.166
			8	16.24 ± 0.10	16.94 ± 0.04	1.043
8B Q4_0	4.35 GiB	AVX2	2	6.36 ± 0.00	10.28 ± 0.00	1.616
			4	10.97 ± 0.06	13.55 ± 0.07	1.235
7B Q4_0	3.57 GiB	NEON	2	9.77 ± 0.02	13.69 ± 0.03	1.401
			4	17.82 ± 0.06	23.98 ± 0.11	1.346
			8	26.63 ± 0.41	29.86 ± 0.04	1.121
8B Q4_1	4.77 GiB	AVX2	2	5.11 ± 0.00	11.45 ± 0.00	2.241
			4	9.08 ± 0.02	12.58 ± 0.00	1.385
7B Q4_1	3.95 GiB	NEON	2	9.11 ± 0.06	14.62 ± 0.04	1.605
			4	17.04 ± 0.09	24.08 ± 0.28	1.413
			8	25.26 ± 0.24	27.23 ± 0.14	1.078
8B Q5_0	5.22 GiB	AVX2	2	5.31 ± 0.01	8.30 ± 0.01	1.563
			4	9.40 ± 0.01	11.47 ± 0.00	1.220
7B Q5_0	4.34 GiB	NEON	2	7.26 ± 0.06	7.52 ± 0.00	1.036
			4	13.63 ± 0.18	14.16 ± 0.10	1.039
			8	22.55 ± 0.35	24.34 ± 0.22	1.079
8B Q5_1	5.64 GiB	AVX2	2	4.52 ± 0.00	8.86 ± 0.00	1.960
			4	7.72 ± 0.05	10.68 ± 0.03	1.383
7B Q5_1	4.72 GiB	NEON	2	6.51 ± 0.01	6.42 ± 0.03	0.986
			4	12.26 ± 0.18	12.21 ± 0.14	0.996
			8	20.33 ± 0.52	21.85 ± 0.22	1.075
8B Q2_K_S	2.78 GiB	AVX2	2	11.30 ± 0.00	13.06 ± 0.01	1.156
			4	18.70 ± 0.00	19.04 ± 0.65	1.014
7B Q2_K_S	2.16 GiB	NEON	2	8.42 ± 0.05	11.97 ± 0.10	1.422
			4	15.74 ± 0.01	22.09 ± 0.08	1.403
			8	27.35 ± 0.05	38.32 ± 0.05	1.401
8B Q3_K_S	3.41 GiB	AVX2	2	8.58 ± 0.00	10.82 ± 0.00	1.261
			4	15.26 ± 0.01	16.25 ± 0.01	1.065
7B Q3_K_S	2.75 GiB	NEON	2	6.40 ± 0.02	9.12 ± 0.09	1.425
			4	12.17 ± 0.00	17.11 ± 0.03	1.406
			8	22.04 ± 0.08	31.39 ± 0.31	1.424
8B Q4_K_S	4.36 GiB	AVX2	2	9.61 ± 0.00	10.72 ± 0.01	1.116
			4	13.24 ± 0.31	13.28 ± 0.01	1.003
7B Q4_K_S	3.59 GiB	NEON	2	11.15 ± 0.05	12.93 ± 0.09	1.160
			4	20.24 ± 0.16	23.49 ± 0.29	1.161
			8	25.76 ± 0.07	28.31 ± 0.22	1.099
8B Q5_K_S	5.21 GiB	AVX2	2	7.45 ± 0.00	9.73 ± 0.00	1.306
			4	11.05 ± 0.33	11.43 ± 0.02	1.034
7B Q5_K_S	4.33 GiB	NEON	2	7.20 ± 0.04	8.81 ± 0.04	1.224
			4	13.62 ± 0.15	16.81 ± 0.16	1.234
			8	20.56 ± 0.19	23.96 ± 0.14	1.165
8B Q6_K	6.14 GiB	AVX2	2	7.53 ± 0.00	9.42 ± 0.00	1.251
			4	9.74 ± 0.00	9.97 ± 0.01	1.024
7B Q6_K	5.15 GiB	NEON	2	6.85 ± 0.04	8.30 ± 0.06	1.212
			4	13.03 ± 0.05	15.47 ± 0.17	1.187
			8	18.52 ± 0.07	20.67 ± 0.08	1.116
8B IQ2_XXS	2.23 GiB	AVX2	2	5.33 ± 0.01	6.40 ± 0.00	1.201
			4	10.06 ± 0.03	11.76 ± 0.03	1.169
7B IQ2_XXS	1.73 GiB	NEON	2	5.07 ± 0.04	5.22 ± 0.05	1.030
			4	9.63 ± 0.00	9.91 ± 0.07	1.029
			8	17.40 ± 0.50	18.65 ± 0.22	1.072
8B IQ2_XS	2.42 GiB	AVX2	2	5.83 ± 0.00	6.55 ± 0.00	1.123
			4	10.88 ± 0.09	12.07 ± 0.07	1.109
7B IQ2_XS	1.89 GiB	NEON	2	5.52 ± 0.01	5.60 ± 0.00	1.014
			4	10.50 ± 0.01	11.15 ± 0.00	1.062
			8	18.19 ± 1.30	20.94 ± 0.19	1.151
8B IQ2_M	2.74 GiB	AVX2	2	5.12 ± 0.01	5.17 ± 0.00	1.010
			4	9.60 ± 0.28	9.68 ± 0.16	1.008
7B IQ2_M	2.20 GiB	NEON	2	3.73 ± 0.02	4.53 ± 0.00	1.214
			4	7.14 ± 0.05	8.70 ± 0.06	1.218
			8	11.99 ± 0.48	16.41 ± 0.05	1.369
8B IQ3_XXS	3.04 GiB	AVX2	2	4.06 ± 0.01	5.00 ± 0.00	1.232
			4	7.75 ± 0.02	9.13 ± 0.45	1.178
7B IQ3_XXS	2.41 GiB	NEON	2	3.53 ± 0.00	3.82 ± 0.00	1.082
			4	6.74 ± 0.04	7.42 ± 0.07	1.103
			8	11.96 ± 0.40	13.19 ± 0.29	1.103
8B IQ3_S	3.42 GiB	AVX2	2	3.62 ± 0.00	4.06 ± 0.00	1.122
			4	6.80 ± 0.01	7.62 ± 0.10	1.121
7B IQ3_S	2.75 GiB	NEON	2	2.96 ± 0.01	3.21 ± 0.03	1.084
			4	5.68 ± 0.01	6.25 ± 0.05	1.100
			8	10.32 ± 0.25	11.11 ± 0.37	1.077
8B IQ4_XS	4.13 GiB	AVX2	2	8.08 ± 0.00	11.35 ± 0.00	1.405
			4	13.36 ± 0.72	14.32 ± 0.24	1.072
7B IQ4_XS	3.37 GiB	NEON	2	9.87 ± 0.03	12.06 ± 0.00	1.222
			4	17.78 ± 0.23	22.06 ± 0.28	1.241
			8	27.62 ± 0.09	29.70 ± 0.39	1.075
8B IQ4_NL	4.35 GiB	AVX2	2	5.52 ± 0.00	10.26 ± 0.00	1.859
			4	10.78 ± 0.01	13.69 ± 0.08	1.270
7B IQ4_NL	3.56 GiB	NEON	2	8.32 ± 0.01	13.54 ± 0.01	1.627
			4	15.89 ± 0.00	24.28 ± 0.29	1.528
			8	26.56 ± 0.36	29.87 ± 0.08	1.125

Here gains are generally lower compared to PP due to TG performance being limited by memory bandwidth. Nevertheless, for some quants/architectures/threads the speedup is quite remarkable (e.g., almost a factor of 2 for Q5_1 on AVX2 with 2 threads).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

July 2024: Token generation performance comparison

Performance comparison to llama.cpp

Token generation

Clone this wiki locally