The code in this directory showcases an example matrix multiplication design for a Ryzen AI device with an NPU (Neural Processing Unit). The NPU consists of an array of compute cores, called AI Engines (AIEs). The example design configures each of those compute cores to perform multiplications of distinct sub-matrices in parallel.
At a high level, the code does the following (in order):
-
Defining Matrix Dimensions and Data Types: We first specify the dimensions
M,K,Nfor the input matricesA(M×K), andB(K×N), and the output matrixC(M×N), as well as their data type. To enable efficient computation, our design will split large input matrices into smaller sub-matrix blocks on two levels; we thus also define the sizes of those sub-matrices. At the first level, the constantsm,k, andndefine the size of the submatrices processed by each AIE core. At the second level, we further subdivide using smaller sizesr,sandt-- these are the sizes of required by the vector computation intrinsics of the AIEs. -
Constructing an AIE Array Configuration: The NPU hardware is comprised of components laid out in a two-dimensional grid of rows and columns. Based on the matrix sizes and tiling factors, we choose the number of rows, columns, and total number of compute cores of the AIE device that the design should utilize. We then configure the AI Engine array, memory tiles, and shim tiles.
-
Defining Data Movement Inside the NPU: ObjectFIFOs are a data movement abstraction for buffering data and synchronizing between AIE components. We configure ObjectFIFOs for
A,BandCto transfer and buffer data between AIE components in chunks of the previously defined sizes (m×k,k×nandm×n, respecively). -
Defining Core Computations: The
core_body()function contains the code that will be loaded onto each AIE core. This code describes the matrix multiplication using the input submatricesaandbacquired through the ObjectFIFOs. The results are accumulated in the output submatrixc. -
Defining External Data Transfer Sequences: The
aie.runtime_sequence()op sets up matrix data movement from the host into the AIE compute cores, and back to the host after computation. It initializes Direct Memory Access (DMA) transfers, sets memory access patterns, and performs synchronization. -
Generating the Design: The
my_matmul()function triggers the code generation process and represents the main entry point of the design. The final print statement outputs the MLIR representation of the AIE array configuration.
In summary, this design leverages an AI Engine accelerator to accomplish matrix multiplication efficiently by breaking large matrices into smaller, manageable submatrices. The design uses parallelism, pipelining, and efficient data movement strategies to minimize computation time on the AI Engine array.
With the default configuration, this design will set up an array of AIEs to perform matrix-matrix multiplication on a int16 input data type (int32 output). The tiling size is configured as 64 × 64 for a, b, and c by default.
You will need C++23 for bfloat16_t support in the test.cpp, which can be found in g++-13: https://lindevs.com/install-g-on-ubuntu
To compile and run the design:
make
make whole_array.exe
make runTo compile and run the placed design with tiling:
env use_placed=1 make
env use_placed=1 make whole_array.exe
env use_placed=1 make runTo compile and run the placed design with higher-level IRON:
env use_iron=1 make
env use_iron=1 make whole_array.exe
env use_iron=1 make runThe configuration of the AI Engine array is described in the whole_array.py file. There are two placed versions of this design:
whole_array_placed.py: This design integrates some data visualization tools for runtime data movement, which can be viewed using the accompanying notebook. It also features the use of placed instructions in the runtime sequence but is intended to be functionally equivalent to the orginal design.whole_array_iron.py: This design uses a higher-level version of IRON but is also intended to be functionally equivalent. Note that this design does not support tracing at this time.
It is linked against a compute microkernel which is implemented in C++. The following sections elaborate on each of the steps outlined in the high-level summary above.
Note: The term "tile" has two distinct meanings in the following discussion that should be distinguishable from context:
- AIE tiles are components of the hardware, specifically Shim, Memory and Compute tiles.
- Matrix tiles are smaller sub-matrices of the larger input and output matrices.
In the first section of the code in whole_array.py, we define the following constants:
| Matrix | Size | Submatrix Size (1.) | Vector Intrinsic Size (2.) |
|---|---|---|---|
A (Input) |
M × K |
m × k |
r × s |
B (Input) |
K × N |
k × n |
s × t |
C (Output) |
M × N |
m × n |
r × t |
The input and output matrix sizes are given by the user. We subdivide the input matrices A, B and the output matrix C into smaller, manageable "tiles" (or submatrices) at two levels:
-
Tiling to Compute Core Submatrix Chunks: The input and output matrices stream to/from the AIE compute cores in chunks of size of
m×k,k×nandn×m. Tiling into these chunks allows each of the computation cores to concurrently work on distinct sub-sections of the input matrices in parallel, which improves performance. This also reduces on-chip memory requirements. The final result is re-assembled using the sub-matrix results of all cores.This tiling occurs in the
aie.runtime_sequence()operation describing the host-to-memory-tile transfer. We describe it further below, in section "5. Defining External Data Transfer Sequences". -
Tiling to Vector Intrinsic Size: The AIE compute cores calculate the matrix multiplication using efficient "multiply-accumulate" vector intrinsic instructions (
MACinstructions). These hardware instructions process very small blocks of the matrix: sizer×sblocks ofAand sizes×tblocks ofB, producing an output of sizer×t(C).This tiling occurs in the inner-AIE data movements. We describe it in the section "3. Defining Data Movement Inside the NPU".
The vector intrinsic size is dictated by the hardware and the compute microkernel.
In the next section of the code, we obtain handles to the components of the hardware.
The Neural Processing Unit (NPU) is physically structured as an array of 6 rows and 4 columns. The lower two rows contain so-called "shim" and "memory" tiles, and the upper four rows are made up of AIE compute cores (AIEs):
-
Shim tiles: A single row of shim tiles on the bottom of the core array is responsible for interfacing with the external host for data movement. In our code, they are represented by a list:
[_0_ShimTile, _1_ShimTile, _2_ShimTile, _3_ShimTile] -
Memory tiles: A row of memory tiles with scratchpad memory is located above the shim tiles. These memory cores are responsible for staging and distributing the data during processing. In our code, they are represented by a list:
[_0_MemTile, _1_MemTile, _2_MemTile, _3_MemTile] -
Compute tiles: In each of the four columns, there are 4 rows of computation tiles above the memory tiles. This makes for a total of 16 computation cores, which in this design are configured to perform the matrix multiplication. In our code, they are represented by a list of lists,
cores, showing their two-dimensional arrangement.
We use "ObjectFIFOs" to abstractly describe the data movement and synchronization between AIE Compute, Memory and Shim tiles. ObjectFIFOs present an interface that behaves like a First-In-First-Out queue. To achieve this, they take care of DMA configuration, acquiring and releasing locks, and managing buffers.
There are several ObjectFIFOs used in this design, which are created using the object_fifo() Python binding:
-
Host → Memory Tiles:
inA_fifos,inB_fifosmove the input matrices from the external host (via the shim tiles) in row 0 to the memory tiles in row 1. -
Memory Tiles → Compute Tiles:
memA_fifos,memB_fifosmove input data from the memory tiles in row 1 to the compute tiles in rows 2-5. -
Compute Tiles → Memory Tiles → Host: Analogously,
memC_fifosandOutC_fifosmove the output data out from the compute cores to the memory tiles (memC_fifos) and from there out to the external host via the shim tiles (OutC_fifos).
Each of inA_fifos, inB_fifos, OutC_fifos, memA_fifos, memB_fifos and memC_fifos are Python dictionaries, containing a separate ObjectFIFO instance for each column of AIE compute cores in the array. The respective *_names lists contain the names of these ObjectFIFOs.
Of note is the object_fifo_link() operation. This operation establishes a connection between the mem* FIFOs and the in* and outC FIFOs. By linking ObjectFIFOs, the output received at one end of the source FIFO is fed as input into the ObjectFIFO listed as the destination.
We assume our data are stored in row-major format in the host's memory. For processing on the AIE compute cores, we need to transform the data layouts, such the above listed sub-matrix tiles are laid out contiguously in AIE compute core memory. Thankfully, AIE hardware has extensive support for transforming data using the DMAs as it is received and sent with zero cost. In the following, we will explain how we make use of this hardware feature to transform our data.
There is a notebook that includes visualization for the runtime sequence npu_dma_memcpy_nd operations use to transfer matrices A, B, and C.
To run the notebook:
- Start a jupyter server at the root directory of your clone of
mlir-aie. Make sure you use a terminal that has run theutils/setup_env.shscript so that the correct environment variables are percolated to jupyter. Below is an example of how to start a jupyter server:python3 -m jupyter notebook --no-browser --port=8080
- In your browser, navigate to the URL (which includes a token) which is found in the output of the above command.
- Navigate to
programming_examples/basic/matrix_multiplication/whole_array - Double click
mat_mul_whole_array_visualization.ipynbto start the notebook; choose the ipykernel calledironenv. - You should now be good to go! Note that generating the animations in the notebook can take several minutes.
make clean
make runThe memA_fifos and memB_fifos receive sub-matrices of size m×k and k×n, respectively. The FIFOs translate those matrices from a row-major format (or, placedly, column-major for B if b_col_maj is set) into the r×s-sized and s×t-sized blocks required by the hardware's vector instrinsics before sending them into the compute cores memory.
For matrix A (memA_fifos), this transformation is expressed using the following wraps and strides as a list of tuples (wrap, stride), given as arguments to the object_fifo() operation:
(Note that // denotes integer floor-division in Python.)
[
(m // r, r * k), # Pair 1
(k // s, s), # Pair 2
(r, k), # Pair 3
(s, 1), # Pair 4
]Let us break down each component of this pattern. We do so back-to-front for ease of understanding:
- Pair 4:
(s, 1)- This dimension represents the transfer of a single row of a
r×s-sized tile (our target tile size after the transformation). - Wrap:
sis the length of a row of ar×s-sized block in units of 4 bytes (i32 elements). - Stride: A stride of
1retrieves contiguous elements.
- This dimension represents the transfer of a single row of a
- Pair 3:
(r, k)- Together with the previous dimension, this dimenison represents the transfer of a single
r×s-sized tile. - Wrap:
ris the number of rows of ar×s-sized tile. - Stride:
kis the stride between first element of each consecutive row along themdimension, i.e. adding this stride to a memory address points to the element in the matrix directly below the original address.
- Together with the previous dimension, this dimenison represents the transfer of a single
- Pair 2:
(k // s, s)- Together with the previous dimensions, this dimension represents the transfer of one row of
r×s-sized tiles, i.e. the firstk×selements of the input array. - Wrap:
k // sis the number ofr×s-sized tiles along thek(columns) dimension. - Stride:
sis the stride between starting elements of consecutive blocks along thekdimension, i.e. adding this stridde to a memory address points to the same element in ther×s-sized block directly to the right of the block of the original address.
- Together with the previous dimensions, this dimension represents the transfer of one row of
- Pair 1:
(m // r, r * k)- Together with the previous dimensions, this dimension transfers the entire
m×k-sized matrix as blocks ofr×s-sized tiles. - Wrap:
m // ris the number ofr×s-sized blocks along them(rows) dimension. - Stride:
r * kis the stride between starting elements of consecutive blocks along themdimension, i.e. adding this stride to a memory address points to the same element in ther×s-sized block directly below the block of the original address.
- Together with the previous dimensions, this dimension transfers the entire
You can use this data layout visualizer to better understand data layout transformations expressed as wraps and strides.
The matrix B transformation (memB_fifos) is equivalent after substituting the correct dimensions (k×n instead of m×k and s×t isntead of r×s). If a column-major layout is used for B (argument b_col_maj is set), the transformation is analogous but transposed.
Analogously, the output matrix C is transformed back from r×t-sized blocks back into a row-major matrix of contiguous rows with size m×n.
The core_body() function defines the computation that each core will perform.
We define a core_body() function for each compute core i, inside of which we do the following:
- We acquire a slot in the output buffer into which we will produce the next
m×n-tile of output inmemC_fifos. We name the acquired bufferelem_out. - We zero out the acquired output slot, since it may contain stale results using
call(zero [elem_out]). K // ktimes, we:- We acquire the next
m×k-tile ofA, and the nextk×ntile ofBfrom ObjectFIFOsmemA_fifos[i]andmemB_fifos[i], respectively, aselem_in_aandelem_in_b. - We call our compute microkernel (implemented in C++ and linked against this design) to perform the matrix multiplication calculation, with
call(matmul, [elem_in_a, elem_in_b, elem_out]). The result is summed element-wise inelem_outtogether with previous iterations. - We release
elem_in_aandelem_in_b.
- We acquire the next
- After the complete result for the current
m×n-block has been calculated, we can releaseelem_out.
The signature of the aie.runtime_sequence() operation lists as its arguments all the external buffers from the host that we wish to read from or write to on the AI Engine's shim tiles. The body of this function describes how these buffers are transfered from and to the host, including tiling the input matrices into m×k and k×n-sized sub-matrices, and combining the m×n-sized output tiles into the larger output M×N matrix buffer.
- The
tbvariable segments the M (rows of A) into smaller chunks, each containingtb_max_rowstile rows. This is done so the buffer descriptors (BDs) can be reused for efficient DMA transfers. - For each column
i:- For each
tile_rowin the current row block:- The DMA transfer function
npu_dma_memcpy_ndloads a segment of matrix A and matrix B data (submatrix a, submatrix b) from the host into the correspondinginA_fifosfor the respective column, maintaining the appropriate strides and offsets. - Analogously to the data layout transformations described further above to translate a
m×kmatrix into blocks ofr×s-submatrices, this transfer translates the inputM×KandK×Nmatrices into submatrices of sizem×kandk×n.
- The DMA transfer function
- The DMA transfer function
npu_dma_memcpy_ndsends a segment of matrix C data (submatrix c) from the correspondingoutC_fifosfor the respective column, back to the host while maintaining the appropriate strides and offsets. - After completing DMA transfers for each column,
dma_waitis used to synchronize their completion.
- For each
The aforementioned transfers of rows of tiles of the A matrix are further split into a "ping" and a "pong" phase.
This allows us to reconfigure half of the buffer descriptors used for transferring A concurrently with the other half running (transferring data).
This interleaved design improves performance thanks to overlapped reconfiguration and data movement, especially if there is a large number of rows of tiles of A.
This C++ code demonstrates how to implement matrix multiplication for different data types and operations using AIE (AI Engine) API and templates. The AI Engine is designed for efficient computation and data movement, especially for matrix multiplication-intensive machine learning workloads. The code has the following main components:
-
matmul_scalar: A scalar function that performs matrix multiplication for input matricesaandband adds the result to matrixc. This function iterates through each row in matrixaand each column in matrixb, performing the multiplication of the corresponding elements and accumulating their sum to populate matrixc. -
matmul_vectorizedandmatmul_vectorized_XxX: Vectorized matrix multiplication functions for different block sizes and input/output types for the AI Engine. These functions use the AIE API for efficient vectorized matrix multiplication, with support for various input and output tensor data types (e.g., int16, bfloat16). These functions expand the vectorized matrix multiplications to different shapes (4x4, 2x2, 4x4) to achieve higher kernel efficiency through higher accumulator register usage. -
matmul_vectorized_4x4x4_i16_i16,matmul_vectorized_4x8x4_bf16_bf16,matmul_vectorized_4x8x4_bf16_f32, ... : Helper functions for calling the correspondingmatmul_vectorizedfunctions with specific input and output types and block sizes. The shapes of the intrinsic calls (ex:4x8x4) have been selected among the available ones for their higher performance. The full list of available matrix multiplication modes can be found here. -
Extern "C" interface functions: These functions provide a C-compatible interface to the main matrix multiplication functions, making it easier to call these functions from other languages or environments.
-
Zeroing functions: Functions like
zero_vectorizedandzero_scalarinitialize the output matrix (c_out) with all zero values. -
matmul_vectorized_b_col_majfunctions: These functions are identical to thematmul_vectorized_2x2implementation except for diffrences in pointer arithmetic for accessing theBmatrix and issuing a transpose instruction forB. This allows us to feed column-majors×t-sized tiles into the compute kernel, which then transposes those into row-major.
This code showcases efficient performance in matrix multiplication-intensive workloads and can be adapted for other types of inputs and operations as needed.
