In this design, one or multiple AI Engine compute cores (spread across hardware columns, configurable as n_cores) perform a matrix-vector multiplication. We use a bfloat16 data type, and the dimensions of the A matrix M×K are set to 288×288 by default (N, the number of columns in B, is always 1, since B is a vector). The kernel itself consumes chunks of 32×32 (M×K) of A, so it is invoked multiple times to complete the full result.
This design relies on the same basic concepts as the whole-array matrix-matrix multiplication design, and it is structured very similarly to that design. Please refer to the in-depth explanation of that design along with the below outlined differences for a better understanding of this design.
The orignal implementation of the design is found at matrix_vector.py. An alternative version of the design, featuring different runtime operations, is found at matrix_vector_placed.py. A version written in a higher-level form of IRON is found at matrix_vector_iron.py.
Differences from the Whole-Array Matrix-Matrix Multiplication Design
- A specialized matrix-vector microkernel, named
matvec_vectorizedis used in this design, as opposed to the more general matrix-matrix microkernel (matmul_vectorized) used in the matrix-matrix-multiplication designs. - The data movement in this design varies as follows: An identical
32-element chunk of the vectorBis broadcast to the cores in all columns, whereas distinct subsequent32×32-sized tiles of theAmatrix are distributed to the cores. As such, each core is responsible for a distinct32-element chunk of the output vectorC. These chunks are assembled (joined) at the shim tile level (in theaiex.runtime_sequence()). - This design does not use all available compute cores. Instead, it uses at most one core in each hardware column. The variable
n_coresdefines the number of columns to be used. It would however be possible to extend this design to use all cores.
You need C++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
To compile and run the original design:
make
make matrix_vector.exe
make runTo compile and run the placed design:
env use_placed=1 make
env use_placed=1 make matrix_vector.exe
env use_placed=1 make runTo compile and run the higher-level IRON design:
env use_iron=1 make
env use_iron=1 make matrix_vector.exe
env use_iron=1 make run