Open
Description
kaldi/src/matrix/kaldi-matrix.cc
Lines 55 to 56 in 813b731
The L1 cacheline size on all Xeons and AMD EPYCs I checked is 64. The actual alignment is invariably 16 and is sprinkled all over the code. This causes a performance hit. For the L1 constructive interference, the alignment should be 64 on x64 hardware.
Proposal:
- The C++17 "the toolchain knows it better" approach being out of reach, add a preprocessor variable
KALDI_CONSTRUCTIVE_MEMALIGN
, to signify the constructive cacheline alignment, and default it to 64 if not defined otherwise with the-D
compiler switch during compilation. - Replace compiler macros
KALDI_MEMALIGN
andKALDI_MEMALIGN_FREE
with functions templated on the type being allocated, allowing the above code be rewrittencalloc
-style (number of elements) without explicit arithmetics and casting. This will make it easier to add an overload that would take an alternative value for the alignment, should such a need arise. This will also ease a later transition to C++17std::aligned_alloc
and/or the aligned signature ofoperator new
(signatures: throwing (4) and non-throwing (8)). Example:
if ((p_work = KaldiAlignedAlloc<Real>(l_work)) == nullptr)
. . .
KaldiAlignedFree(p_work);
On Linux, CPU cache descriptors are exposed through the sys filesystem. In particular,
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
index0 = the cache closest to the CPU = L1.