Default matrix memory alignment mismatches the L1 cacheline size of most (all?) x64 CPU

https://github.com/kaldi-asr/kaldi/blob/813b73185a18725e4f6021981d17221d6ee23a19/src/matrix/kaldi-matrix.cc#L55-L56

The L1 cacheline size on all Xeons and AMD EPYCs I checked is 64. The actual alignment is invariably 16 and is sprinkled all over the code. This causes a performance hit. For the L1 constructive interference, the alignment should be 64 on x64 hardware.

Proposal:
 * [The C++17 "the toolchain knows it better" approach](https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size) being out of reach, add a preprocessor variable `KALDI_CONSTRUCTIVE_MEMALIGN`,  to signify the constructive cacheline alignment, and default it to 64 if not defined otherwise with the `-D` compiler switch during compilation.
 * Replace compiler macros `KALDI_MEMALIGN` and `KALDI_MEMALIGN_FREE` with functions templated on the type being allocated, allowing the above code be rewritten `calloc`-style (number of elements) without explicit arithmetics and casting. This will make it easier to add an overload that would take an alternative value for the alignment, should such a need arise. This will also ease a later transition to [C++17 `std::aligned_alloc`](https://en.cppreference.com/w/cpp/memory/c/aligned_alloc) and/or the [aligned signature of `operator new` (signatures: throwing (4) and non-throwing (8))](https://en.cppreference.com/w/cpp/memory/new/operator_new).  Example:  
```c++
if ((p_work = KaldiAlignedAlloc<Real>(l_work)) == nullptr)
 . . .
KaldiAlignedFree(p_work);
```
---

On Linux, CPU cache descriptors are exposed through the sys filesystem. In particular,
```
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
```
index0 = the cache closest to the CPU = L1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Default matrix memory alignment mismatches the L1 cacheline size of most (all?) x64 CPU #4404

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if ((p_work = static_cast<Real*>(
	KALDI_MEMALIGN(16, sizeof(Real)*l_work, &temp))) == NULL) {

Default matrix memory alignment mismatches the L1 cacheline size of most (all?) x64 CPU #4404

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions