-
Notifications
You must be signed in to change notification settings - Fork 32
July 2024: Token generation performance comparison
Kawrakow edited this page Jan 20, 2025
·
2 revisions
The results in the following tables are obtained with these parameters:
- Model is LLaMA-v3-8B for
AVX2
and LLaMA-v2-7B forARM_NEON
- The
AVX2
CPU is a 16-core Ryzen-7950X - The
ARM_NEON
CPU is M2-Max -
tinyBLAS
is enabled inllama.cpp
-
llama.cpp
results are forbuild: 081fe431 (3441)
, which was the currentllama.cpp
master branch when I pulled on July 23 2024. - The projects are built without
CUDA
support, noBLAS
, and Accelerate framework disabled
On the Ryzen-7950X TG is memory bound, and for many quantization types peak performance is achieved at just 4 threads. Hence, only results for 2 and 4 threads are shown for AVX2
. The M2-Max has a much more capable memory subsystem and as a result performance keep increasing up to 8 threads. Thus, results are given for up to 8 threads for ARM_NEON
.
The command line to generate the data was
./bin/llama-bench -m $model -p 0 -n 128 -t $num_threads -ngl 0
Quantization | size | backend | threads | t/s (llama.cpp) | t/s (iqk_mul_mat) | Speedup |
---|---|---|---|---|---|---|
8B F16 | 14.96 GiB | AVX2 | 1 | 2.20 ± 0.00 | 2.25 ± 0.00 | 1.023 |
2 | 3.63 ± 0.00 | 3.68 ± 0.00 | 1.014 | |||
4 | 4.20 ± 0.00 | 4.20 ± 0.00 | 1.000 | |||
7B F16 | 12.55 GiB | NEON | 2 | 6.94 ± 0.27 | 7.40 ± 0.01 | 1.066 |
4 | 8.73 ± 0.01 | 8.83 ± 0.01 | 1.011 | |||
6 | 9.05 ± 0.02 | 9.05 ± 0.01 | 1.000 | |||
8B Q8_0 | 7.95 GiB | AVX2 | 2 | 5.03 ± 0.00 | 7.87 ± 0.00 | 1.565 |
4 | 7.40 ± 0.00 | 7.82 ± 0.00 | 1.057 | |||
7B Q8_0 | 6.67 GiB | NEON | 2 | 8.29 ± 0.44 | 12.07 ± 0.10 | 1.456 |
4 | 13.53 ± 0.03 | 15.77 ± 0.08 | 1.166 | |||
8 | 16.24 ± 0.10 | 16.94 ± 0.04 | 1.043 | |||
8B Q4_0 | 4.35 GiB | AVX2 | 2 | 6.36 ± 0.00 | 10.28 ± 0.00 | 1.616 |
4 | 10.97 ± 0.06 | 13.55 ± 0.07 | 1.235 | |||
7B Q4_0 | 3.57 GiB | NEON | 2 | 9.77 ± 0.02 | 13.69 ± 0.03 | 1.401 |
4 | 17.82 ± 0.06 | 23.98 ± 0.11 | 1.346 | |||
8 | 26.63 ± 0.41 | 29.86 ± 0.04 | 1.121 | |||
8B Q4_1 | 4.77 GiB | AVX2 | 2 | 5.11 ± 0.00 | 11.45 ± 0.00 | 2.241 |
4 | 9.08 ± 0.02 | 12.58 ± 0.00 | 1.385 | |||
7B Q4_1 | 3.95 GiB | NEON | 2 | 9.11 ± 0.06 | 14.62 ± 0.04 | 1.605 |
4 | 17.04 ± 0.09 | 24.08 ± 0.28 | 1.413 | |||
8 | 25.26 ± 0.24 | 27.23 ± 0.14 | 1.078 | |||
8B Q5_0 | 5.22 GiB | AVX2 | 2 | 5.31 ± 0.01 | 8.30 ± 0.01 | 1.563 |
4 | 9.40 ± 0.01 | 11.47 ± 0.00 | 1.220 | |||
7B Q5_0 | 4.34 GiB | NEON | 2 | 7.26 ± 0.06 | 7.52 ± 0.00 | 1.036 |
4 | 13.63 ± 0.18 | 14.16 ± 0.10 | 1.039 | |||
8 | 22.55 ± 0.35 | 24.34 ± 0.22 | 1.079 | |||
8B Q5_1 | 5.64 GiB | AVX2 | 2 | 4.52 ± 0.00 | 8.86 ± 0.00 | 1.960 |
4 | 7.72 ± 0.05 | 10.68 ± 0.03 | 1.383 | |||
7B Q5_1 | 4.72 GiB | NEON | 2 | 6.51 ± 0.01 | 6.42 ± 0.03 | 0.986 |
4 | 12.26 ± 0.18 | 12.21 ± 0.14 | 0.996 | |||
8 | 20.33 ± 0.52 | 21.85 ± 0.22 | 1.075 | |||
8B Q2_K_S | 2.78 GiB | AVX2 | 2 | 11.30 ± 0.00 | 13.06 ± 0.01 | 1.156 |
4 | 18.70 ± 0.00 | 19.04 ± 0.65 | 1.014 | |||
7B Q2_K_S | 2.16 GiB | NEON | 2 | 8.42 ± 0.05 | 11.97 ± 0.10 | 1.422 |
4 | 15.74 ± 0.01 | 22.09 ± 0.08 | 1.403 | |||
8 | 27.35 ± 0.05 | 38.32 ± 0.05 | 1.401 | |||
8B Q3_K_S | 3.41 GiB | AVX2 | 2 | 8.58 ± 0.00 | 10.82 ± 0.00 | 1.261 |
4 | 15.26 ± 0.01 | 16.25 ± 0.01 | 1.065 | |||
7B Q3_K_S | 2.75 GiB | NEON | 2 | 6.40 ± 0.02 | 9.12 ± 0.09 | 1.425 |
4 | 12.17 ± 0.00 | 17.11 ± 0.03 | 1.406 | |||
8 | 22.04 ± 0.08 | 31.39 ± 0.31 | 1.424 | |||
8B Q4_K_S | 4.36 GiB | AVX2 | 2 | 9.61 ± 0.00 | 10.72 ± 0.01 | 1.116 |
4 | 13.24 ± 0.31 | 13.28 ± 0.01 | 1.003 | |||
7B Q4_K_S | 3.59 GiB | NEON | 2 | 11.15 ± 0.05 | 12.93 ± 0.09 | 1.160 |
4 | 20.24 ± 0.16 | 23.49 ± 0.29 | 1.161 | |||
8 | 25.76 ± 0.07 | 28.31 ± 0.22 | 1.099 | |||
8B Q5_K_S | 5.21 GiB | AVX2 | 2 | 7.45 ± 0.00 | 9.73 ± 0.00 | 1.306 |
4 | 11.05 ± 0.33 | 11.43 ± 0.02 | 1.034 | |||
7B Q5_K_S | 4.33 GiB | NEON | 2 | 7.20 ± 0.04 | 8.81 ± 0.04 | 1.224 |
4 | 13.62 ± 0.15 | 16.81 ± 0.16 | 1.234 | |||
8 | 20.56 ± 0.19 | 23.96 ± 0.14 | 1.165 | |||
8B Q6_K | 6.14 GiB | AVX2 | 2 | 7.53 ± 0.00 | 9.42 ± 0.00 | 1.251 |
4 | 9.74 ± 0.00 | 9.97 ± 0.01 | 1.024 | |||
7B Q6_K | 5.15 GiB | NEON | 2 | 6.85 ± 0.04 | 8.30 ± 0.06 | 1.212 |
4 | 13.03 ± 0.05 | 15.47 ± 0.17 | 1.187 | |||
8 | 18.52 ± 0.07 | 20.67 ± 0.08 | 1.116 | |||
8B IQ2_XXS | 2.23 GiB | AVX2 | 2 | 5.33 ± 0.01 | 6.40 ± 0.00 | 1.201 |
4 | 10.06 ± 0.03 | 11.76 ± 0.03 | 1.169 | |||
7B IQ2_XXS | 1.73 GiB | NEON | 2 | 5.07 ± 0.04 | 5.22 ± 0.05 | 1.030 |
4 | 9.63 ± 0.00 | 9.91 ± 0.07 | 1.029 | |||
8 | 17.40 ± 0.50 | 18.65 ± 0.22 | 1.072 | |||
8B IQ2_XS | 2.42 GiB | AVX2 | 2 | 5.83 ± 0.00 | 6.55 ± 0.00 | 1.123 |
4 | 10.88 ± 0.09 | 12.07 ± 0.07 | 1.109 | |||
7B IQ2_XS | 1.89 GiB | NEON | 2 | 5.52 ± 0.01 | 5.60 ± 0.00 | 1.014 |
4 | 10.50 ± 0.01 | 11.15 ± 0.00 | 1.062 | |||
8 | 18.19 ± 1.30 | 20.94 ± 0.19 | 1.151 | |||
8B IQ2_M | 2.74 GiB | AVX2 | 2 | 5.12 ± 0.01 | 5.17 ± 0.00 | 1.010 |
4 | 9.60 ± 0.28 | 9.68 ± 0.16 | 1.008 | |||
7B IQ2_M | 2.20 GiB | NEON | 2 | 3.73 ± 0.02 | 4.53 ± 0.00 | 1.214 |
4 | 7.14 ± 0.05 | 8.70 ± 0.06 | 1.218 | |||
8 | 11.99 ± 0.48 | 16.41 ± 0.05 | 1.369 | |||
8B IQ3_XXS | 3.04 GiB | AVX2 | 2 | 4.06 ± 0.01 | 5.00 ± 0.00 | 1.232 |
4 | 7.75 ± 0.02 | 9.13 ± 0.45 | 1.178 | |||
7B IQ3_XXS | 2.41 GiB | NEON | 2 | 3.53 ± 0.00 | 3.82 ± 0.00 | 1.082 |
4 | 6.74 ± 0.04 | 7.42 ± 0.07 | 1.103 | |||
8 | 11.96 ± 0.40 | 13.19 ± 0.29 | 1.103 | |||
8B IQ3_S | 3.42 GiB | AVX2 | 2 | 3.62 ± 0.00 | 4.06 ± 0.00 | 1.122 |
4 | 6.80 ± 0.01 | 7.62 ± 0.10 | 1.121 | |||
7B IQ3_S | 2.75 GiB | NEON | 2 | 2.96 ± 0.01 | 3.21 ± 0.03 | 1.084 |
4 | 5.68 ± 0.01 | 6.25 ± 0.05 | 1.100 | |||
8 | 10.32 ± 0.25 | 11.11 ± 0.37 | 1.077 | |||
8B IQ4_XS | 4.13 GiB | AVX2 | 2 | 8.08 ± 0.00 | 11.35 ± 0.00 | 1.405 |
4 | 13.36 ± 0.72 | 14.32 ± 0.24 | 1.072 | |||
7B IQ4_XS | 3.37 GiB | NEON | 2 | 9.87 ± 0.03 | 12.06 ± 0.00 | 1.222 |
4 | 17.78 ± 0.23 | 22.06 ± 0.28 | 1.241 | |||
8 | 27.62 ± 0.09 | 29.70 ± 0.39 | 1.075 | |||
8B IQ4_NL | 4.35 GiB | AVX2 | 2 | 5.52 ± 0.00 | 10.26 ± 0.00 | 1.859 |
4 | 10.78 ± 0.01 | 13.69 ± 0.08 | 1.270 | |||
7B IQ4_NL | 3.56 GiB | NEON | 2 | 8.32 ± 0.01 | 13.54 ± 0.01 | 1.627 |
4 | 15.89 ± 0.00 | 24.28 ± 0.29 | 1.528 | |||
8 | 26.56 ± 0.36 | 29.87 ± 0.08 | 1.125 |
Here gains are generally lower compared to PP due to TG performance being limited by memory bandwidth. Nevertheless, for some quants/architectures/threads the speedup is quite remarkable (e.g., almost a factor of 2 for Q5_1
on AVX2
with 2 threads).