Skip to content

Latest commit

 

History

History
 
 

README.md

Matrix Multiplication - Whole Array Design

The code in this directory showcases an example matrix multiplication design for a Ryzen AI device with an NPU (Neural Processing Unit). The NPU consists of an array of compute cores, called AI Engines (AIEs). The example design configures each of those compute cores to perform multiplications of distinct sub-matrices in parallel.

At a high level, the code does the following (in order):

  1. Defining Matrix Dimensions and Data Types: We first specify the dimensions M, K, N for the input matrices A (M×K), and B (K×N), and the output matrix C (M×N), as well as their data type. To enable efficient computation, our design will split large input matrices into smaller sub-matrix blocks on two levels; we thus also define the sizes of those sub-matrices. At the first level, the constants m, k, and n define the size of the submatrices processed by each AIE core. At the second level, we further subdivide using smaller sizes r, s and t -- these are the sizes of required by the vector computation intrinsics of the AIEs.

  2. Constructing an AIE Array Configuration: The NPU hardware is comprised of components laid out in a two-dimensional grid of rows and columns. Based on the matrix sizes and tiling factors, we choose the number of rows, columns, and total number of compute cores of the AIE device that the design should utilize. We then configure the AI Engine array, memory tiles, and shim tiles.

  3. Defining Data Movement Inside the NPU: ObjectFIFOs are a data movement abstraction for buffering data and synchronizing between AIE components. We configure ObjectFIFOs for A, B and C to transfer and buffer data between AIE components in chunks of the previously defined sizes (m×k, k×n and m×n, respecively).

  4. Defining Core Computations: The core_body() function contains the code that will be loaded onto each AIE core. This code describes the matrix multiplication using the input submatrices a and b acquired through the ObjectFIFOs. The results are accumulated in the output submatrix c.

  5. Defining External Data Transfer Sequences: The aie.runtime_sequence() op sets up matrix data movement from the host into the AIE compute cores, and back to the host after computation. It initializes Direct Memory Access (DMA) transfers, sets memory access patterns, and performs synchronization.

  6. Generating the Design: The my_matmul() function triggers the code generation process and represents the main entry point of the design. The final print statement outputs the MLIR representation of the AIE array configuration.

In summary, this design leverages an AI Engine accelerator to accomplish matrix multiplication efficiently by breaking large matrices into smaller, manageable submatrices. The design uses parallelism, pipelining, and efficient data movement strategies to minimize computation time on the AI Engine array.

Building and Running the Design

With the default configuration, this design will set up an array of AIEs to perform matrix-matrix multiplication on a int16 input data type (int32 output). The tiling size is configured as 64 × 64 for a, b, and c by default.

You will need C++23 for bfloat16_t support in the test.cpp, which can be found in g++-13: https://lindevs.com/install-g-on-ubuntu

To compile and run the design:

make
make whole_array.exe
make run

To compile and run the placed design with tiling:

env use_placed=1 make
env use_placed=1 make whole_array.exe
env use_placed=1 make run

To compile and run the placed design with higher-level IRON:

env use_iron=1 make
env use_iron=1 make whole_array.exe
env use_iron=1 make run

Detailed Design Explanation

The configuration of the AI Engine array is described in the whole_array.py file. There are two placed versions of this design:

  • whole_array_placed.py: This design integrates some data visualization tools for runtime data movement, which can be viewed using the accompanying notebook. It also features the use of placed instructions in the runtime sequence but is intended to be functionally equivalent to the orginal design.
  • whole_array_iron.py: This design uses a higher-level version of IRON but is also intended to be functionally equivalent. Note that this design does not support tracing at this time.

It is linked against a compute microkernel which is implemented in C++. The following sections elaborate on each of the steps outlined in the high-level summary above.

Note: The term "tile" has two distinct meanings in the following discussion that should be distinguishable from context:

  • AIE tiles are components of the hardware, specifically Shim, Memory and Compute tiles.
  • Matrix tiles are smaller sub-matrices of the larger input and output matrices.

1. Defining Matrix Dimensions and Data Types

In the first section of the code in whole_array.py, we define the following constants:

Matrix Size Submatrix Size (1.) Vector Intrinsic Size (2.)
A (Input) M × K m × k r × s
B (Input) K × N k × n s × t
C (Output) M × N m × n r × t

The input and output matrix sizes are given by the user. We subdivide the input matrices A, B and the output matrix C into smaller, manageable "tiles" (or submatrices) at two levels:

  1. Tiling to Compute Core Submatrix Chunks: The input and output matrices stream to/from the AIE compute cores in chunks of size of m×k, k×n and n×m. Tiling into these chunks allows each of the computation cores to concurrently work on distinct sub-sections of the input matrices in parallel, which improves performance. This also reduces on-chip memory requirements. The final result is re-assembled using the sub-matrix results of all cores.

    This tiling occurs in the aie.runtime_sequence() operation describing the host-to-memory-tile transfer. We describe it further below, in section "5. Defining External Data Transfer Sequences".

  2. Tiling to Vector Intrinsic Size: The AIE compute cores calculate the matrix multiplication using efficient "multiply-accumulate" vector intrinsic instructions (MAC instructions). These hardware instructions process very small blocks of the matrix: size r×s blocks of A and size s×t blocks of B, producing an output of size r×t (C).

    This tiling occurs in the inner-AIE data movements. We describe it in the section "3. Defining Data Movement Inside the NPU".

    The vector intrinsic size is dictated by the hardware and the compute microkernel.

2. Constructing an AIE Array Configuration

In the next section of the code, we obtain handles to the components of the hardware.

The Neural Processing Unit (NPU) is physically structured as an array of 6 rows and 4 columns. The lower two rows contain so-called "shim" and "memory" tiles, and the upper four rows are made up of AIE compute cores (AIEs):

  1. Shim tiles: A single row of shim tiles on the bottom of the core array is responsible for interfacing with the external host for data movement. In our code, they are represented by a list: [_0_ShimTile, _1_ShimTile, _2_ShimTile, _3_ShimTile]

  2. Memory tiles: A row of memory tiles with scratchpad memory is located above the shim tiles. These memory cores are responsible for staging and distributing the data during processing. In our code, they are represented by a list: [_0_MemTile, _1_MemTile, _2_MemTile, _3_MemTile]

  3. Compute tiles: In each of the four columns, there are 4 rows of computation tiles above the memory tiles. This makes for a total of 16 computation cores, which in this design are configured to perform the matrix multiplication. In our code, they are represented by a list of lists, cores, showing their two-dimensional arrangement.

3. Defining Data Movement Inside the NPU:

We use "ObjectFIFOs" to abstractly describe the data movement and synchronization between AIE Compute, Memory and Shim tiles. ObjectFIFOs present an interface that behaves like a First-In-First-Out queue. To achieve this, they take care of DMA configuration, acquiring and releasing locks, and managing buffers.

There are several ObjectFIFOs used in this design, which are created using the object_fifo() Python binding:

  1. Host → Memory Tiles: inA_fifos, inB_fifos move the input matrices from the external host (via the shim tiles) in row 0 to the memory tiles in row 1.

  2. Memory Tiles → Compute Tiles: memA_fifos, memB_fifos move input data from the memory tiles in row 1 to the compute tiles in rows 2-5.

  3. Compute Tiles → Memory Tiles → Host: Analogously, memC_fifos and OutC_fifos move the output data out from the compute cores to the memory tiles (memC_fifos) and from there out to the external host via the shim tiles (OutC_fifos).

Each of inA_fifos, inB_fifos, OutC_fifos, memA_fifos, memB_fifos and memC_fifos are Python dictionaries, containing a separate ObjectFIFO instance for each column of AIE compute cores in the array. The respective *_names lists contain the names of these ObjectFIFOs.

Of note is the object_fifo_link() operation. This operation establishes a connection between the mem* FIFOs and the in* and outC FIFOs. By linking ObjectFIFOs, the output received at one end of the source FIFO is fed as input into the ObjectFIFO listed as the destination.

data movement diagram

Tiling and Data Layout Transformations

We assume our data are stored in row-major format in the host's memory. For processing on the AIE compute cores, we need to transform the data layouts, such the above listed sub-matrix tiles are laid out contiguously in AIE compute core memory. Thankfully, AIE hardware has extensive support for transforming data using the DMAs as it is received and sent with zero cost. In the following, we will explain how we make use of this hardware feature to transform our data.

Runtime Sequence Tiling and Data Layout Transformations Notebook

There is a notebook that includes visualization for the runtime sequence npu_dma_memcpy_nd operations use to transfer matrices A, B, and C.

To run the notebook:

  • Start a jupyter server at the root directory of your clone of mlir-aie. Make sure you use a terminal that has run the utils/setup_env.sh script so that the correct environment variables are percolated to jupyter. Below is an example of how to start a jupyter server:
    python3 -m jupyter notebook --no-browser --port=8080
  • In your browser, navigate to the URL (which includes a token) which is found in the output of the above command.
  • Navigate to programming_examples/basic/matrix_multiplication/whole_array
  • Double click mat_mul_whole_array_visualization.ipynb to start the notebook; choose the ipykernel called ironenv.
  • You should now be good to go! Note that generating the animations in the notebook can take several minutes.

Run the Notebook as a Script

make clean
make run
Tiling to Vector Intrinsic Size

The memA_fifos and memB_fifos receive sub-matrices of size m×k and k×n, respectively. The FIFOs translate those matrices from a row-major format (or, placedly, column-major for B if b_col_maj is set) into the r×s-sized and s×t-sized blocks required by the hardware's vector instrinsics before sending them into the compute cores memory.

For matrix A (memA_fifos), this transformation is expressed using the following wraps and strides as a list of tuples (wrap, stride), given as arguments to the object_fifo() operation: (Note that // denotes integer floor-division in Python.)

    [
        (m // r, r * k),   # Pair 1
        (k // s, s),       # Pair 2
        (r, k),            # Pair 3
        (s, 1),            # Pair 4
    ]

Let us break down each component of this pattern. We do so back-to-front for ease of understanding:

  • Pair 4: (s, 1)
    • This dimension represents the transfer of a single row of a r×s-sized tile (our target tile size after the transformation).
    • Wrap: s is the length of a row of a r×s-sized block in units of 4 bytes (i32 elements).
    • Stride: A stride of 1 retrieves contiguous elements.
  • Pair 3: (r, k)
    • Together with the previous dimension, this dimenison represents the transfer of a single r×s-sized tile.
    • Wrap: r is the number of rows of a r×s-sized tile.
    • Stride: k is the stride between first element of each consecutive row along the m dimension, i.e. adding this stride to a memory address points to the element in the matrix directly below the original address.
  • Pair 2: (k // s, s)
    • Together with the previous dimensions, this dimension represents the transfer of one row of r×s-sized tiles, i.e. the first k×s elements of the input array.
    • Wrap: k // s is the number of r×s-sized tiles along the k (columns) dimension.
    • Stride: s is the stride between starting elements of consecutive blocks along the k dimension, i.e. adding this stridde to a memory address points to the same element in the r×s-sized block directly to the right of the block of the original address.
  • Pair 1: (m // r, r * k)
    • Together with the previous dimensions, this dimension transfers the entire m×k-sized matrix as blocks of r×s-sized tiles.
    • Wrap: m // r is the number of r×s-sized blocks along the m (rows) dimension.
    • Stride: r * k is the stride between starting elements of consecutive blocks along the m dimension, i.e. adding this stride to a memory address points to the same element in the r×s-sized block directly below the block of the original address.

You can use this data layout visualizer to better understand data layout transformations expressed as wraps and strides.

The matrix B transformation (memB_fifos) is equivalent after substituting the correct dimensions (k×n instead of m×k and s×t isntead of r×s). If a column-major layout is used for B (argument b_col_maj is set), the transformation is analogous but transposed.

Analogously, the output matrix C is transformed back from r×t-sized blocks back into a row-major matrix of contiguous rows with size m×n.

4. Defining Core Computations

The core_body() function defines the computation that each core will perform. We define a core_body() function for each compute core i, inside of which we do the following:

  • We acquire a slot in the output buffer into which we will produce the next m×n-tile of output in memC_fifos. We name the acquired buffer elem_out.
  • We zero out the acquired output slot, since it may contain stale results using call(zero [elem_out]).
  • K // k times, we:
    • We acquire the next m×k-tile of A, and the next k×n tile of B from ObjectFIFOs memA_fifos[i] and memB_fifos[i], respectively, as elem_in_a and elem_in_b.
    • We call our compute microkernel (implemented in C++ and linked against this design) to perform the matrix multiplication calculation, with call(matmul, [elem_in_a, elem_in_b, elem_out]). The result is summed element-wise in elem_out together with previous iterations.
    • We release elem_in_a and elem_in_b.
  • After the complete result for the current m×n-block has been calculated, we can release elem_out.

5. Defining External Data Transfer Sequences

The signature of the aie.runtime_sequence() operation lists as its arguments all the external buffers from the host that we wish to read from or write to on the AI Engine's shim tiles. The body of this function describes how these buffers are transfered from and to the host, including tiling the input matrices into m×k and k×n-sized sub-matrices, and combining the m×n-sized output tiles into the larger output M×N matrix buffer.

  • The tb variable segments the M (rows of A) into smaller chunks, each containing tb_max_rows tile rows. This is done so the buffer descriptors (BDs) can be reused for efficient DMA transfers.
  • For each column i:
    • For each tile_row in the current row block:
      • The DMA transfer function npu_dma_memcpy_nd loads a segment of matrix A and matrix B data (submatrix a, submatrix b) from the host into the corresponding inA_fifos for the respective column, maintaining the appropriate strides and offsets.
      • Analogously to the data layout transformations described further above to translate a m×k matrix into blocks of r×s-submatrices, this transfer translates the input M×K and K×N matrices into submatrices of size m×k and k×n.
    • The DMA transfer function npu_dma_memcpy_nd sends a segment of matrix C data (submatrix c) from the corresponding outC_fifos for the respective column, back to the host while maintaining the appropriate strides and offsets.
    • After completing DMA transfers for each column, dma_wait is used to synchronize their completion.

The aforementioned transfers of rows of tiles of the A matrix are further split into a "ping" and a "pong" phase. This allows us to reconfigure half of the buffer descriptors used for transferring A concurrently with the other half running (transferring data). This interleaved design improves performance thanks to overlapped reconfiguration and data movement, especially if there is a large number of rows of tiles of A.

Compute Microkernels

This C++ code demonstrates how to implement matrix multiplication for different data types and operations using AIE (AI Engine) API and templates. The AI Engine is designed for efficient computation and data movement, especially for matrix multiplication-intensive machine learning workloads. The code has the following main components:

  1. matmul_scalar: A scalar function that performs matrix multiplication for input matrices a and b and adds the result to matrix c. This function iterates through each row in matrix a and each column in matrix b, performing the multiplication of the corresponding elements and accumulating their sum to populate matrix c.

  2. matmul_vectorized and matmul_vectorized_XxX: Vectorized matrix multiplication functions for different block sizes and input/output types for the AI Engine. These functions use the AIE API for efficient vectorized matrix multiplication, with support for various input and output tensor data types (e.g., int16, bfloat16). These functions expand the vectorized matrix multiplications to different shapes (4x4, 2x2, 4x4) to achieve higher kernel efficiency through higher accumulator register usage.

  3. matmul_vectorized_4x4x4_i16_i16, matmul_vectorized_4x8x4_bf16_bf16, matmul_vectorized_4x8x4_bf16_f32, ... : Helper functions for calling the corresponding matmul_vectorized functions with specific input and output types and block sizes. The shapes of the intrinsic calls (ex: 4x8x4) have been selected among the available ones for their higher performance. The full list of available matrix multiplication modes can be found here.

  4. Extern "C" interface functions: These functions provide a C-compatible interface to the main matrix multiplication functions, making it easier to call these functions from other languages or environments.

  5. Zeroing functions: Functions like zero_vectorized and zero_scalar initialize the output matrix (c_out) with all zero values.

  6. matmul_vectorized_b_col_maj functions: These functions are identical to the matmul_vectorized_2x2 implementation except for diffrences in pointer arithmetic for accessing the B matrix and issuing a transpose instruction for B. This allows us to feed column-major s×t-sized tiles into the compute kernel, which then transposes those into row-major.

This code showcases efficient performance in matrix multiplication-intensive workloads and can be adapted for other types of inputs and operations as needed.