Skip to content

July 2024: Prompt processing perfomance comparison

Kawrakow edited this page Jan 20, 2025 · 1 revision

Performance comparison to llama.cpp

The results in the following tables are obtained with these parameters:

  • Model is LLaMA-v3-8B for AVX2 and LLaMA-v2-7B for ARM_NEON
  • The AVX2 CPU is a 16-core Ryzen-7950X
  • The ARM_NEON CPU is M2-Max
  • tinyBLAS is enabled in llama.cpp
  • llama.cpp results are for build: 081fe431 (3441), which was the current llama.cpp master branch when I pulled on July 23 2024.
  • The projects are built without CUDA support, no BLAS, and Accelerate framework disabled

Prompt processing

Here I set the number of threads to be equal to the number of (performance) cores of the CPU, so 16 threads for the Ryzen-7950X and 8 threads for the M2-Max. The following table summarizes the results. To not make the table too long, I have listed only quantized models containing predominantly one quantization type (i.e., excluded the QX_K - Medium/Large variants, which are typically a mix of QX_K and Q(X+1)_K, as well as IQ2_S and IQ3_XS).

The command line to generate the benchmark data is

./bin/llama-bench -m $model -p 512 -n 0 -t $num_threads -ngl 0
Quantization size backend threads t/s (llama.cpp) t/s (iqk_mul_mat) Speedup
8B F16 14.96 GiB AVX2 16 112.37 ± 0.40 131.27 ± 0.38 1.168
7B F16 12.55 GiB NEON 8 90.28 ± 1.25 95.34 ± 0.15 1.056
8B Q8_0 7.95 GiB AVX2 16 118.07 ± 0.53 134.00 ± 0.47 1.135
7B Q8_0 6.67 GiB NEON 8 77.25 ± 1.81 94.14 ± 1.15 1.219
8B Q4_0 4.35 GiB AVX2 16 104.46 ± 0.33 130.20 ± 0.29 1.246
7B Q4_0 3.57 GiB NEON 8 65.46 ± 0.79 76.22 ± 0.71 1.164
8B Q4_1 4.77 GiB AVX2 16 57.83 ± 0.24 160.69 ± 0.49 2.779
7B Q4_1 3.95 GiB NEON 8 37.40 ± 0.50 65.83 ± 0.98 1.760
8B Q5_0 5.22 GiB AVX2 16 53.50 ± 0.35 122.62 ± 0.48 2.292
7B Q5_0 4.34 GiB NEON 8 29.31 ± 0.51 67.51 ± 1.17 2.303
8B Q5_1 5.64 GiB AVX2 16 50.85 ± 0.36 147.15 ± 0.47 2.894
7B Q5_1 4.72 GiB NEON 8 26.02 ± 0.37 58.49 ± 0.85 2.248
8B Q2_K_S 2.78 GiB AVX2 16 110.11 ± 0.28 192.47 ± 1.35 1.748
7B Q2_K_S 2.16 GiB NEON 8 35.44 ± 0.06 77.93 ± 1.64 2.199
8B Q3_K_S 3.41 GiB AVX2 16 77.42 ± 0.36 181.64 ± 0.44 2.346
7B Q3_K_S 2.75 GiB NEON 8 26.79 ± 0.03 59.38 ± 1.08 2.216
8B Q4_K_S 4.36 GiB AVX2 16 98.92 ± 0.34 185.35 ± 0.39 1.874
7B Q4_K_S 3.59 GiB NEON 8 46.55 ± 0.67 76.31 ± 0.38 1.639
8B Q5_K_S 5.21 GiB AVX2 16 69.44 ± 0.31 179.62 ± 0.69 2.587
7B Q5_K_S 4.33 GiB NEON 8 30.18 ± 0.23 65.34 ± 0.79 2.165
8B Q6_K 6.14 GiB AVX2 16 74.89 ± 0.26 181.86 ± 0.55 2.428
7B Q6_K 5.15 GiB NEON 8 28.12 ± 1.24 60.75 ± 1.15 2.160
8B IQ2_XXS 2.23 GiB AVX2 16 42.57 ± 0.16 126.63 ± 0.55 2.975
7B IQ2_XXS 1.73 GiB NEON 8 20.87 ± 0.20 64.29 ± 1.12 3.080
8B IQ2_XS 2.42 GiB AVX2 16 46.45 ± 0.27 125.46 ± 0.43 2.701
7B IQ2_XS 1.89 GiB NEON 8 22.77 ± 0.21 51.15 ± 0.24 2.246
8B IQ2_M 2.74 GiB AVX2 16 40.76 ± 0.18 113.07 ± 0.48 2.774
7B IQ2_M 2.20 GiB NEON 8 14.95 ± 0.26 44.87 ± 0.50 3.001
8B IQ3_XXS 3.04 GiB AVX2 16 31.95 ± 0.20 109.86 ± 0.45 3.438
7B IQ3_XXS 2.41 GiB NEON 8 14.40 ± 0.10 53.58 ± 0.85 3.721
8B IQ3_S 3.42 GiB AVX2 16 28.04 ± 0.08 96.28 ± 0.45 3.434
7B IQ3_S 2.75 GiB NEON 8 12.08 ± 0.30 49.72 ± 0.06 4.116
8B IQ4_XS 4.13 GiB AVX2 16 68.98 ± 0.31 180.34 ± 0.55 2.614
7B IQ4_XS 3.37 GiB NEON 8 40.67 ± 1.97 75.11 ± 1.97 1.847
8B IQ4_NL 4.35 GiB AVX2 16 59.94 ± 0.21 129.06 ± 0.43 2.153
7B IQ4_NL 3.56 GiB NEON 8 34.36 ± 0.81 76.02 ± 1.36 2.212

We see that llama.cpp achieves respectable performance for fp16, Q8_0, and Q4_0, being only up to 25% slower than this implementation. This is thanks to the use of Justine Tunney's tinyBLAS, which is utilized for these quantization types. For all other quants we observe performance gains in the 1.75X - 4X range, which is not a small feat considering that the ggml matrix multiplication functions has been rewritten several times since llama.cpp was first published. Performance gains are larger for i-quants due to the higher quant unpacking cost (see discussion in "To tile or not to tile")