Open
Description
With no code or hardware change at all, after month there is a 2x perf regression, OpenBLAS also is a bit slower (with no package update):
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 224.000 GFLOP/s
Theoretical peak multi: 4032.000 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
OpenBLAS benchmark
Collected 10 samples in 0.101 seconds
Average time: 9.440 ms
Stddev time: 0.141 ms
Min time: 9.315 ms
Max time: 9.733 ms
Perf: 1499.508 GFLOP/s
Laser production implementation
Collected 10 samples in 0.146 seconds
Average time: 14.000 ms
Stddev time: 25.706 ms
Min time: 5.839 ms
Max time: 87.161 ms
Perf: 1011.102 GFLOP/s
PyTorch Glow: libjit matmul implementation (with AVX+FMA)
Collected 10 samples in 2.041 seconds
Average time: 204.123 ms
Stddev time: 0.763 ms
Min time: 203.362 ms
Max time: 205.862 ms
Perf: 69.349 GFLOP/s
MKL-DNN reference GEMM benchmark
Collected 10 samples in 0.351 seconds
Average time: 34.305 ms
Stddev time: 5.588 ms
Min time: 30.013 ms
Max time: 49.684 ms
Perf: 412.645 GFLOP/s
MKL-DNN JIT AVX benchmark
Collected 10 samples in 0.130 seconds
Average time: 11.230 ms
Stddev time: 8.353 ms
Min time: 7.725 ms
Max time: 34.426 ms
Perf: 1260.573 GFLOP/s
MKL-DNN JIT AVX512 benchmark
Collected 10 samples in 0.083 seconds
Average time: 7.716 ms
Stddev time: 7.932 ms
Min time: 4.601 ms
Max time: 30.078 ms
Perf: 1834.643 GFLOP/s
Mean Relative Error compared to vendor BLAS: 3.045843413929106e-06
I suspect an issue with glibc OpenMP. (MKL-DNN is linked to Intel OpenMP)
Metadata
Metadata
Assignees
Labels
No labels