GEMM performance: OpenBLAS vs Accelerate

The performance gap between OpenBLAS and Accelerate is evident in the measured GEMM results presented [here](https://julialinearalgebra.github.io/AppleAccelerate.jl/dev/benchmarks/#GEMM-(mul!)-%E2%80%94-GFLOPS-(higher-is-better)). I believe the main reason for this large difference is that OpenBLAS performs GEMM computations on CPU cores, whereas Accelerate may utilize the SME engine for GEMM. This can be verified through a simple experiment: increase the number of threads in [the benchmark code](https://github.com/JuliaLinearAlgebra/AppleAccelerate.jl/blob/master/test/bench/bench_dense.jl) to 2, 3, and 4. For OpenBLAS, I observed a proportional increase in performance. In contrast, the performance of Accelerate remains unchanged as the number of threads increases. Therefore, I think OpenBLAS and Accelerate use different execution units on Apple silicon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMM performance: OpenBLAS vs Accelerate #132

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GEMM performance: OpenBLAS vs Accelerate #132

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions