Skip to content

July 2024: Token generation performance comparison

Kawrakow edited this page Jan 20, 2025 · 2 revisions

Performance comparison to llama.cpp

The results in the following tables are obtained with these parameters:

  • Model is LLaMA-v3-8B for AVX2 and LLaMA-v2-7B for ARM_NEON
  • The AVX2 CPU is a 16-core Ryzen-7950X
  • The ARM_NEON CPU is M2-Max
  • tinyBLAS is enabled in llama.cpp
  • llama.cpp results are for build: 081fe431 (3441), which was the current llama.cpp master branch when I pulled on July 23 2024.
  • The projects are built without CUDA support, no BLAS, and Accelerate framework disabled

Token generation

On the Ryzen-7950X TG is memory bound, and for many quantization types peak performance is achieved at just 4 threads. Hence, only results for 2 and 4 threads are shown for AVX2. The M2-Max has a much more capable memory subsystem and as a result performance keep increasing up to 8 threads. Thus, results are given for up to 8 threads for ARM_NEON.

The command line to generate the data was

./bin/llama-bench -m $model -p 0 -n 128 -t $num_threads -ngl 0
Quantization size backend threads t/s (llama.cpp) t/s (iqk_mul_mat) Speedup
8B F16 14.96 GiB AVX2 1 2.20 ± 0.00 2.25 ± 0.00 1.023
2 3.63 ± 0.00 3.68 ± 0.00 1.014
4 4.20 ± 0.00 4.20 ± 0.00 1.000
7B F16 12.55 GiB NEON 2 6.94 ± 0.27 7.40 ± 0.01 1.066
4 8.73 ± 0.01 8.83 ± 0.01 1.011
6 9.05 ± 0.02 9.05 ± 0.01 1.000
8B Q8_0 7.95 GiB AVX2 2 5.03 ± 0.00 7.87 ± 0.00 1.565
4 7.40 ± 0.00 7.82 ± 0.00 1.057
7B Q8_0 6.67 GiB NEON 2 8.29 ± 0.44 12.07 ± 0.10 1.456
4 13.53 ± 0.03 15.77 ± 0.08 1.166
8 16.24 ± 0.10 16.94 ± 0.04 1.043
8B Q4_0 4.35 GiB AVX2 2 6.36 ± 0.00 10.28 ± 0.00 1.616
4 10.97 ± 0.06 13.55 ± 0.07 1.235
7B Q4_0 3.57 GiB NEON 2 9.77 ± 0.02 13.69 ± 0.03 1.401
4 17.82 ± 0.06 23.98 ± 0.11 1.346
8 26.63 ± 0.41 29.86 ± 0.04 1.121
8B Q4_1 4.77 GiB AVX2 2 5.11 ± 0.00 11.45 ± 0.00 2.241
4 9.08 ± 0.02 12.58 ± 0.00 1.385
7B Q4_1 3.95 GiB NEON 2 9.11 ± 0.06 14.62 ± 0.04 1.605
4 17.04 ± 0.09 24.08 ± 0.28 1.413
8 25.26 ± 0.24 27.23 ± 0.14 1.078
8B Q5_0 5.22 GiB AVX2 2 5.31 ± 0.01 8.30 ± 0.01 1.563
4 9.40 ± 0.01 11.47 ± 0.00 1.220
7B Q5_0 4.34 GiB NEON 2 7.26 ± 0.06 7.52 ± 0.00 1.036
4 13.63 ± 0.18 14.16 ± 0.10 1.039
8 22.55 ± 0.35 24.34 ± 0.22 1.079
8B Q5_1 5.64 GiB AVX2 2 4.52 ± 0.00 8.86 ± 0.00 1.960
4 7.72 ± 0.05 10.68 ± 0.03 1.383
7B Q5_1 4.72 GiB NEON 2 6.51 ± 0.01 6.42 ± 0.03 0.986
4 12.26 ± 0.18 12.21 ± 0.14 0.996
8 20.33 ± 0.52 21.85 ± 0.22 1.075
8B Q2_K_S 2.78 GiB AVX2 2 11.30 ± 0.00 13.06 ± 0.01 1.156
4 18.70 ± 0.00 19.04 ± 0.65 1.014
7B Q2_K_S 2.16 GiB NEON 2 8.42 ± 0.05 11.97 ± 0.10 1.422
4 15.74 ± 0.01 22.09 ± 0.08 1.403
8 27.35 ± 0.05 38.32 ± 0.05 1.401
8B Q3_K_S 3.41 GiB AVX2 2 8.58 ± 0.00 10.82 ± 0.00 1.261
4 15.26 ± 0.01 16.25 ± 0.01 1.065
7B Q3_K_S 2.75 GiB NEON 2 6.40 ± 0.02 9.12 ± 0.09 1.425
4 12.17 ± 0.00 17.11 ± 0.03 1.406
8 22.04 ± 0.08 31.39 ± 0.31 1.424
8B Q4_K_S 4.36 GiB AVX2 2 9.61 ± 0.00 10.72 ± 0.01 1.116
4 13.24 ± 0.31 13.28 ± 0.01 1.003
7B Q4_K_S 3.59 GiB NEON 2 11.15 ± 0.05 12.93 ± 0.09 1.160
4 20.24 ± 0.16 23.49 ± 0.29 1.161
8 25.76 ± 0.07 28.31 ± 0.22 1.099
8B Q5_K_S 5.21 GiB AVX2 2 7.45 ± 0.00 9.73 ± 0.00 1.306
4 11.05 ± 0.33 11.43 ± 0.02 1.034
7B Q5_K_S 4.33 GiB NEON 2 7.20 ± 0.04 8.81 ± 0.04 1.224
4 13.62 ± 0.15 16.81 ± 0.16 1.234
8 20.56 ± 0.19 23.96 ± 0.14 1.165
8B Q6_K 6.14 GiB AVX2 2 7.53 ± 0.00 9.42 ± 0.00 1.251
4 9.74 ± 0.00 9.97 ± 0.01 1.024
7B Q6_K 5.15 GiB NEON 2 6.85 ± 0.04 8.30 ± 0.06 1.212
4 13.03 ± 0.05 15.47 ± 0.17 1.187
8 18.52 ± 0.07 20.67 ± 0.08 1.116
8B IQ2_XXS 2.23 GiB AVX2 2 5.33 ± 0.01 6.40 ± 0.00 1.201
4 10.06 ± 0.03 11.76 ± 0.03 1.169
7B IQ2_XXS 1.73 GiB NEON 2 5.07 ± 0.04 5.22 ± 0.05 1.030
4 9.63 ± 0.00 9.91 ± 0.07 1.029
8 17.40 ± 0.50 18.65 ± 0.22 1.072
8B IQ2_XS 2.42 GiB AVX2 2 5.83 ± 0.00 6.55 ± 0.00 1.123
4 10.88 ± 0.09 12.07 ± 0.07 1.109
7B IQ2_XS 1.89 GiB NEON 2 5.52 ± 0.01 5.60 ± 0.00 1.014
4 10.50 ± 0.01 11.15 ± 0.00 1.062
8 18.19 ± 1.30 20.94 ± 0.19 1.151
8B IQ2_M 2.74 GiB AVX2 2 5.12 ± 0.01 5.17 ± 0.00 1.010
4 9.60 ± 0.28 9.68 ± 0.16 1.008
7B IQ2_M 2.20 GiB NEON 2 3.73 ± 0.02 4.53 ± 0.00 1.214
4 7.14 ± 0.05 8.70 ± 0.06 1.218
8 11.99 ± 0.48 16.41 ± 0.05 1.369
8B IQ3_XXS 3.04 GiB AVX2 2 4.06 ± 0.01 5.00 ± 0.00 1.232
4 7.75 ± 0.02 9.13 ± 0.45 1.178
7B IQ3_XXS 2.41 GiB NEON 2 3.53 ± 0.00 3.82 ± 0.00 1.082
4 6.74 ± 0.04 7.42 ± 0.07 1.103
8 11.96 ± 0.40 13.19 ± 0.29 1.103
8B IQ3_S 3.42 GiB AVX2 2 3.62 ± 0.00 4.06 ± 0.00 1.122
4 6.80 ± 0.01 7.62 ± 0.10 1.121
7B IQ3_S 2.75 GiB NEON 2 2.96 ± 0.01 3.21 ± 0.03 1.084
4 5.68 ± 0.01 6.25 ± 0.05 1.100
8 10.32 ± 0.25 11.11 ± 0.37 1.077
8B IQ4_XS 4.13 GiB AVX2 2 8.08 ± 0.00 11.35 ± 0.00 1.405
4 13.36 ± 0.72 14.32 ± 0.24 1.072
7B IQ4_XS 3.37 GiB NEON 2 9.87 ± 0.03 12.06 ± 0.00 1.222
4 17.78 ± 0.23 22.06 ± 0.28 1.241
8 27.62 ± 0.09 29.70 ± 0.39 1.075
8B IQ4_NL 4.35 GiB AVX2 2 5.52 ± 0.00 10.26 ± 0.00 1.859
4 10.78 ± 0.01 13.69 ± 0.08 1.270
7B IQ4_NL 3.56 GiB NEON 2 8.32 ± 0.01 13.54 ± 0.01 1.627
4 15.89 ± 0.00 24.28 ± 0.29 1.528
8 26.56 ± 0.36 29.87 ± 0.08 1.125

Here gains are generally lower compared to PP due to TG performance being limited by memory bandwidth. Nevertheless, for some quants/architectures/threads the speedup is quite remarkable (e.g., almost a factor of 2 for Q5_1 on AVX2 with 2 threads).