[GEMM] Enhance serial implementation

With #20, the parallel schedule seems to scale perfectly on many cores:

```
$  OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 1.238 seconds
Average time: 123.713 ms
Stddev  time: 0.444 ms
Min     time: 123.335 ms
Max     time: 124.890 ms
Perf:         114.425 GFLOP/s

Laser production implementation
Collected 10 samples in 1.465 seconds
Average time: 146.392 ms
Stddev  time: 0.644 ms
Min     time: 146.006 ms
Max     time: 147.802 ms
Perf:         96.697 GFLOP/s
Mean Relative Error compared to OpenBLAS: 1.243059557509696e-07

------------------------------------------------------------

$  ./build/gemm_f32_omp
Warmup: 0.9021 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    230.400 GFLOP/s
Theoretical peak multi:         4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

OpenBLAS benchmark
Collected 10 samples in 0.079 seconds
Average time: 7.739 ms
Stddev  time: 4.368 ms
Min     time: 6.020 ms
Max     time: 20.097 ms
Perf:         1829.200 GFLOP/s

Laser production implementation
Collected 10 samples in 0.083 seconds
Average time: 8.126 ms
Stddev  time: 4.777 ms
Min     time: 6.241 ms
Max     time: 21.632 ms
Perf:         1742.123 GFLOP/s
Mean Relative Error compared to OpenBLAS: 0.01456451416015625
```

with 96.7 GFLOP/s * 18 cores = 1740 on my machine.

However the single-threaded implementation is still quite often below OpenBLAS.

Causes:

1. To fix regressions in #20, interleaving loading the next A micropanel with the computation on the current A micro panel had to be removed and is currently commented out: https://github.com/numforge/laser/blob/ebb01ad40f30d495f0f4b02ef1ff49c3f54230cd/laser/primitives/matrix_multiplication/gemm_ukernel_generator.nim#L237-L242

It should be reintroduced.

2. mc and kc should be tuned depending on available L1 and L2 cache and the TLB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GEMM] Enhance serial implementation #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[GEMM] Enhance serial implementation #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions