Open
Description
With #20, the parallel schedule seems to scale perfectly on many cores:
$ OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 230.400 GFLOP/s
Theoretical peak multi: 4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
OpenBLAS benchmark
Collected 10 samples in 1.238 seconds
Average time: 123.713 ms
Stddev time: 0.444 ms
Min time: 123.335 ms
Max time: 124.890 ms
Perf: 114.425 GFLOP/s
Laser production implementation
Collected 10 samples in 1.465 seconds
Average time: 146.392 ms
Stddev time: 0.644 ms
Min time: 146.006 ms
Max time: 147.802 ms
Perf: 96.697 GFLOP/s
Mean Relative Error compared to OpenBLAS: 1.243059557509696e-07
------------------------------------------------------------
$ ./build/gemm_f32_omp
Warmup: 0.9021 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 230.400 GFLOP/s
Theoretical peak multi: 4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
OpenBLAS benchmark
Collected 10 samples in 0.079 seconds
Average time: 7.739 ms
Stddev time: 4.368 ms
Min time: 6.020 ms
Max time: 20.097 ms
Perf: 1829.200 GFLOP/s
Laser production implementation
Collected 10 samples in 0.083 seconds
Average time: 8.126 ms
Stddev time: 4.777 ms
Min time: 6.241 ms
Max time: 21.632 ms
Perf: 1742.123 GFLOP/s
Mean Relative Error compared to OpenBLAS: 0.01456451416015625
with 96.7 GFLOP/s * 18 cores = 1740 on my machine.
However the single-threaded implementation is still quite often below OpenBLAS.
Causes:
- To fix regressions in Improve gemm threading #20, interleaving loading the next A micropanel with the computation on the current A micro panel had to be removed and is currently commented out: https://github.com/numforge/laser/blob/ebb01ad40f30d495f0f4b02ef1ff49c3f54230cd/laser/primitives/matrix_multiplication/gemm_ukernel_generator.nim#L237-L242
It should be reintroduced.
- mc and kc should be tuned depending on available L1 and L2 cache and the TLB.
Metadata
Metadata
Assignees
Labels
No labels