We wanted to get an idea of what might be different between DGEMM from gh-146, which is brute force "for loops," and the faster-performing version in OpenBLAS.
From a quick scan of the OpenBLAS code base at the time of writing, there are a few relevant things to note:
- OpenBLAS uses hand crafted assembly for many architectures and algorithms, for example
kernel/x86_64/dgemm_kernel_4x8_haswell.S has 5000 lines of assembly for DGEMM stuff--so that's obviously something that might be optimized at a different level of "tuning" vs. us at the moment; I'm not sure how easy it will be for me to read the assembly and check for specific types of optimizations like pipelining and so on..
- they also seem to be able to leverage i.e.,
cuda_dgemm_kernel so it may make sense to compare with them on the GPU with a specific compilation of OpenBLAS for that scenario (though this is less convenient for the current benchmarks, because SciPy is not GPU-swappable off the shelf; could look at CuPy CuBLAS or some other Python interface for comparison maybe)
git grep -E -i "strassen" returns no results in OpenBLAS, so the usage of an algorithm with a fundamentally different asymptotic behavior is perhaps a bit less likely; also, see related discussion agreeing with this: https://stackoverflow.com/a/11421344/2942522
- in short, it seems like Strassen may have substantial algorithm coefficients/caching issues and maybe even numerical stability issues that prevent it from being the primary choice despite the asymptotic advantages (this may also be why I don't think it is even mentioned in the IEEE paper we were looking at..)
Not sure how helpful all of this is, but my initial impression is that low-level optimizations in assembly for specific architectures drive a lot of the improvements, rather than fancier/asymptotically-superior algorithms that are far more complex.
We wanted to get an idea of what might be different between DGEMM from gh-146, which is brute force "for loops," and the faster-performing version in OpenBLAS.
From a quick scan of the OpenBLAS code base at the time of writing, there are a few relevant things to note:
kernel/x86_64/dgemm_kernel_4x8_haswell.Shas 5000 lines of assembly forDGEMMstuff--so that's obviously something that might be optimized at a different level of "tuning" vs. us at the moment; I'm not sure how easy it will be for me to read the assembly and check for specific types of optimizations like pipelining and so on..cuda_dgemm_kernelso it may make sense to compare with them on the GPU with a specific compilation of OpenBLAS for that scenario (though this is less convenient for the current benchmarks, because SciPy is not GPU-swappable off the shelf; could look at CuPy CuBLAS or some other Python interface for comparison maybe)git grep -E -i "strassen"returns no results in OpenBLAS, so the usage of an algorithm with a fundamentally different asymptotic behavior is perhaps a bit less likely; also, see related discussion agreeing with this: https://stackoverflow.com/a/11421344/2942522Not sure how helpful all of this is, but my initial impression is that low-level optimizations in assembly for specific architectures drive a lot of the improvements, rather than fancier/asymptotically-superior algorithms that are far more complex.