Skip to content

Default matrix memory alignment mismatches the L1 cacheline size of most (all?) x64 CPU #4404

Open
@kkm000

Description

@kkm000

if ((p_work = static_cast<Real*>(
KALDI_MEMALIGN(16, sizeof(Real)*l_work, &temp))) == NULL) {

The L1 cacheline size on all Xeons and AMD EPYCs I checked is 64. The actual alignment is invariably 16 and is sprinkled all over the code. This causes a performance hit. For the L1 constructive interference, the alignment should be 64 on x64 hardware.

Proposal:

  • The C++17 "the toolchain knows it better" approach being out of reach, add a preprocessor variable KALDI_CONSTRUCTIVE_MEMALIGN, to signify the constructive cacheline alignment, and default it to 64 if not defined otherwise with the -D compiler switch during compilation.
  • Replace compiler macros KALDI_MEMALIGN and KALDI_MEMALIGN_FREE with functions templated on the type being allocated, allowing the above code be rewritten calloc-style (number of elements) without explicit arithmetics and casting. This will make it easier to add an overload that would take an alternative value for the alignment, should such a need arise. This will also ease a later transition to C++17 std::aligned_alloc and/or the aligned signature of operator new (signatures: throwing (4) and non-throwing (8)). Example:
if ((p_work = KaldiAlignedAlloc<Real>(l_work)) == nullptr)
 . . .
KaldiAlignedFree(p_work);

On Linux, CPU cache descriptors are exposed through the sys filesystem. In particular,

$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64

index0 = the cache closest to the CPU = L1.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions