-
Notifications
You must be signed in to change notification settings - Fork 32
July 2024: Prompt processing perfomance comparison
The results in the following tables are obtained with these parameters:
- Model is LLaMA-v3-8B for
AVX2
and LLaMA-v2-7B forARM_NEON
- The
AVX2
CPU is a 16-core Ryzen-7950X - The
ARM_NEON
CPU is M2-Max -
tinyBLAS
is enabled inllama.cpp
-
llama.cpp
results are forbuild: 081fe431 (3441)
, which was the currentllama.cpp
master branch when I pulled on July 23 2024. - The projects are built without
CUDA
support, noBLAS
, and Accelerate framework disabled
Here I set the number of threads to be equal to the number of (performance) cores of the CPU, so 16 threads for the Ryzen-7950X and 8 threads for the M2-Max. The following table summarizes the results. To not make the table too long, I have listed only quantized models containing predominantly one quantization type (i.e., excluded the QX_K - Medium/Large
variants, which are typically a mix of QX_K
and Q(X+1)_K
, as well as IQ2_S
and IQ3_XS
).
The command line to generate the benchmark data is
./bin/llama-bench -m $model -p 512 -n 0 -t $num_threads -ngl 0
Quantization | size | backend | threads | t/s (llama.cpp) | t/s (iqk_mul_mat) | Speedup |
---|---|---|---|---|---|---|
8B F16 | 14.96 GiB | AVX2 | 16 | 112.37 ± 0.40 | 131.27 ± 0.38 | 1.168 |
7B F16 | 12.55 GiB | NEON | 8 | 90.28 ± 1.25 | 95.34 ± 0.15 | 1.056 |
8B Q8_0 | 7.95 GiB | AVX2 | 16 | 118.07 ± 0.53 | 134.00 ± 0.47 | 1.135 |
7B Q8_0 | 6.67 GiB | NEON | 8 | 77.25 ± 1.81 | 94.14 ± 1.15 | 1.219 |
8B Q4_0 | 4.35 GiB | AVX2 | 16 | 104.46 ± 0.33 | 130.20 ± 0.29 | 1.246 |
7B Q4_0 | 3.57 GiB | NEON | 8 | 65.46 ± 0.79 | 76.22 ± 0.71 | 1.164 |
8B Q4_1 | 4.77 GiB | AVX2 | 16 | 57.83 ± 0.24 | 160.69 ± 0.49 | 2.779 |
7B Q4_1 | 3.95 GiB | NEON | 8 | 37.40 ± 0.50 | 65.83 ± 0.98 | 1.760 |
8B Q5_0 | 5.22 GiB | AVX2 | 16 | 53.50 ± 0.35 | 122.62 ± 0.48 | 2.292 |
7B Q5_0 | 4.34 GiB | NEON | 8 | 29.31 ± 0.51 | 67.51 ± 1.17 | 2.303 |
8B Q5_1 | 5.64 GiB | AVX2 | 16 | 50.85 ± 0.36 | 147.15 ± 0.47 | 2.894 |
7B Q5_1 | 4.72 GiB | NEON | 8 | 26.02 ± 0.37 | 58.49 ± 0.85 | 2.248 |
8B Q2_K_S | 2.78 GiB | AVX2 | 16 | 110.11 ± 0.28 | 192.47 ± 1.35 | 1.748 |
7B Q2_K_S | 2.16 GiB | NEON | 8 | 35.44 ± 0.06 | 77.93 ± 1.64 | 2.199 |
8B Q3_K_S | 3.41 GiB | AVX2 | 16 | 77.42 ± 0.36 | 181.64 ± 0.44 | 2.346 |
7B Q3_K_S | 2.75 GiB | NEON | 8 | 26.79 ± 0.03 | 59.38 ± 1.08 | 2.216 |
8B Q4_K_S | 4.36 GiB | AVX2 | 16 | 98.92 ± 0.34 | 185.35 ± 0.39 | 1.874 |
7B Q4_K_S | 3.59 GiB | NEON | 8 | 46.55 ± 0.67 | 76.31 ± 0.38 | 1.639 |
8B Q5_K_S | 5.21 GiB | AVX2 | 16 | 69.44 ± 0.31 | 179.62 ± 0.69 | 2.587 |
7B Q5_K_S | 4.33 GiB | NEON | 8 | 30.18 ± 0.23 | 65.34 ± 0.79 | 2.165 |
8B Q6_K | 6.14 GiB | AVX2 | 16 | 74.89 ± 0.26 | 181.86 ± 0.55 | 2.428 |
7B Q6_K | 5.15 GiB | NEON | 8 | 28.12 ± 1.24 | 60.75 ± 1.15 | 2.160 |
8B IQ2_XXS | 2.23 GiB | AVX2 | 16 | 42.57 ± 0.16 | 126.63 ± 0.55 | 2.975 |
7B IQ2_XXS | 1.73 GiB | NEON | 8 | 20.87 ± 0.20 | 64.29 ± 1.12 | 3.080 |
8B IQ2_XS | 2.42 GiB | AVX2 | 16 | 46.45 ± 0.27 | 125.46 ± 0.43 | 2.701 |
7B IQ2_XS | 1.89 GiB | NEON | 8 | 22.77 ± 0.21 | 51.15 ± 0.24 | 2.246 |
8B IQ2_M | 2.74 GiB | AVX2 | 16 | 40.76 ± 0.18 | 113.07 ± 0.48 | 2.774 |
7B IQ2_M | 2.20 GiB | NEON | 8 | 14.95 ± 0.26 | 44.87 ± 0.50 | 3.001 |
8B IQ3_XXS | 3.04 GiB | AVX2 | 16 | 31.95 ± 0.20 | 109.86 ± 0.45 | 3.438 |
7B IQ3_XXS | 2.41 GiB | NEON | 8 | 14.40 ± 0.10 | 53.58 ± 0.85 | 3.721 |
8B IQ3_S | 3.42 GiB | AVX2 | 16 | 28.04 ± 0.08 | 96.28 ± 0.45 | 3.434 |
7B IQ3_S | 2.75 GiB | NEON | 8 | 12.08 ± 0.30 | 49.72 ± 0.06 | 4.116 |
8B IQ4_XS | 4.13 GiB | AVX2 | 16 | 68.98 ± 0.31 | 180.34 ± 0.55 | 2.614 |
7B IQ4_XS | 3.37 GiB | NEON | 8 | 40.67 ± 1.97 | 75.11 ± 1.97 | 1.847 |
8B IQ4_NL | 4.35 GiB | AVX2 | 16 | 59.94 ± 0.21 | 129.06 ± 0.43 | 2.153 |
7B IQ4_NL | 3.56 GiB | NEON | 8 | 34.36 ± 0.81 | 76.02 ± 1.36 | 2.212 |
We see that llama.cpp
achieves respectable performance for fp16
, Q8_0
, and Q4_0
, being only up to 25% slower than this implementation. This is thanks to the use of Justine Tunney's tinyBLAS
, which is utilized for these quantization types. For all other quants we observe performance gains in the 1.75X - 4X
range, which is not a small feat considering that the ggml
matrix multiplication functions has been rewritten several times since llama.cpp
was first published. Performance gains are larger for i-quants due to the higher quant unpacking cost (see discussion in "To tile or not to tile")