Name	Name	Last commit message	Last commit date
parent directory ..
tests	tests
Makefile	Makefile
README.md	README.md
matrix_vector.py	matrix_vector.py
matrix_vector_iron.py	matrix_vector_iron.py
matrix_vector_placed.py	matrix_vector_placed.py
test.cpp	test.cpp

Name

Last commit message

Last commit date

Makefile

README.md

matrix_vector.py

matrix_vector_iron.py

matrix_vector_placed.py

test.cpp

Matrix-Vector Multiplication

In this design, one or multiple AI Engine compute cores (spread across hardware columns, configurable as n_cores) perform a matrix-vector multiplication. We use a bfloat16 data type, and the dimensions of the A matrix M×K are set to 288×288 by default (N, the number of columns in B, is always 1, since B is a vector). The kernel itself consumes chunks of 32×32 (M×K) of A, so it is invoked multiple times to complete the full result.

This design relies on the same basic concepts as the whole-array matrix-matrix multiplication design, and it is structured very similarly to that design. Please refer to the in-depth explanation of that design along with the below outlined differences for a better understanding of this design.

The orignal implementation of the design is found at matrix_vector.py. An alternative version of the design, featuring different runtime operations, is found at matrix_vector_placed.py. A version written in a higher-level form of IRON is found at matrix_vector_iron.py.

Differences from the Whole-Array Matrix-Matrix Multiplication Design

A specialized matrix-vector microkernel, named matvec_vectorized is used in this design, as opposed to the more general matrix-matrix microkernel (matmul_vectorized) used in the matrix-matrix-multiplication designs.
The data movement in this design varies as follows: An identical 32-element chunk of the vector B is broadcast to the cores in all columns, whereas distinct subsequent 32×32-sized tiles of the A matrix are distributed to the cores. As such, each core is responsible for a distinct 32-element chunk of the output vector C. These chunks are assembled (joined) at the shim tile level (in the aiex.runtime_sequence()).
This design does not use all available compute cores. Instead, it uses at most one core in each hardware column. The variable n_cores defines the number of columns to be used. It would however be possible to extend this design to use all cores.

Building and Running the Design

You need C++23 for bfloat16_t support. It can be found in g++-13: https://lindevs.com/install-g-on-ubuntu

To compile and run the original design:

make
make matrix_vector.exe
make run

To compile and run the placed design:

env use_placed=1 make
env use_placed=1 make matrix_vector.exe
env use_placed=1 make run

To compile and run the higher-level IRON design:

env use_iron=1 make
env use_iron=1 make matrix_vector.exe
env use_iron=1 make run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Matrix-Vector Multiplication

Differences from the Whole-Array Matrix-Matrix Multiplication Design

Building and Running the Design

FilesExpand file tree

matrix_vector

Directory actions

More options

Directory actions

More options

Latest commit

History

matrix_vector

Folders and files

parent directory

README.md

Matrix-Vector Multiplication

Differences from the Whole-Array Matrix-Matrix Multiplication Design

Building and Running the Design