The performance gap between OpenBLAS and Accelerate is evident in the measured GEMM results presented here. I believe the main reason for this large difference is that OpenBLAS performs GEMM computations on CPU cores, whereas Accelerate may utilize the SME engine for GEMM. This can be verified through a simple experiment: increase the number of threads in the benchmark code to 2, 3, and 4. For OpenBLAS, I observed a proportional increase in performance. In contrast, the performance of Accelerate remains unchanged as the number of threads increases. Therefore, I think OpenBLAS and Accelerate use different execution units on Apple silicon.
The performance gap between OpenBLAS and Accelerate is evident in the measured GEMM results presented here. I believe the main reason for this large difference is that OpenBLAS performs GEMM computations on CPU cores, whereas Accelerate may utilize the SME engine for GEMM. This can be verified through a simple experiment: increase the number of threads in the benchmark code to 2, 3, and 4. For OpenBLAS, I observed a proportional increase in performance. In contrast, the performance of Accelerate remains unchanged as the number of threads increases. Therefore, I think OpenBLAS and Accelerate use different execution units on Apple silicon.