This IRON design flow example, called "Passthrough Kernel", demonstrates a simple AIE implementation for vectorized memcpy on a vector of integers. In this design, a single AIE core performs the memcpy operation on a vector with a default length 4096. The kernel is configured to work on 1024 element-sized subvectors and is invoked multiple times to complete the full copy. The example consists of two primary design files: passthrough_kernel.py and passThrough.cc, and a testbench test.cpp or test.py.
-
passthrough_kernel.py: A Python script that defines the AIE array structural design using MLIR-AIE operations. The file generates MLIR that is then compiled usingaieccto produce design binaries (ie. XCLBIN and inst.bin for the NPU in Ryzen™ AI). -
passthrough_kernel_placed.py: A Python script that defines the AIE array structural design using an alternatives IRON syntax that yields MLIR-AIE operations. The file generates MLIR that is then compiled usingaieccto produce design binaries (ie. XCLBIN and inst.bin for the NPU in Ryzen™ AI). -
passThrough.cc: A C++ implementation of vectorized memcpy operations for AIE cores. Found here. -
test.cpp: This C++ code is a testbench for the Passthrough Kernel design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data. -
test.py: This Python code is a testbench for the Passthrough Kernel design example. The code is responsible for loading the compiled XCLBIN file, configuring the AIE module, providing input data, and executing the AIE design on the NPU. After executing, the script verifies the memcpy results and optionally outputs trace data.
This simple example effectively passes data through a single compute tile in the NPU's AIE array. The design is described as shown in the figure to the right. The overall design flow is as follows:
- An object FIFO called "of_in" connects a Shim Tile to a Compute Tile, and another called "of_out" connects the Compute Tile back to the Shim Tile.
- The runtime data movement is expressed to read
4096uint8_tdata from host memory to the compute tile and write the4096data back to host memory. - The compute tile acquires this input data in "object" sized (
1024) blocks from "of_in" and copies them to another output "object" it has acquired from "of_out". Note that a vectorized kernel running on the Compute Tile's AIE core copies the data from the input "object" to the output "object". - After the vectorized copy is performed, the Compute Tile releases the "objects", allowing the DMAs (abstracted by the object FIFO) to transfer the data back to host memory and copy additional blocks into the Compute Tile, "of_out" and "of_in" respectively.
It is important to note that the Shim Tile and Compute Tile DMAs move data concurrently, and the Compute Tile's AIE Core also processes data concurrently with the data movement. This is made possible by expressing depth 2 in declaring the ObjectFifo, for example, ObjectFifo(line_ty, name="in", default_depth=2) to denote ping-pong buffers. If default_depth is not declared, the default is 2 in reference to this pattern.
This design performs a memcpy operation on a vector of input data. The AIE design is described in a Python module as follows:
-
Constants & Configuration: The script defines input/output dimensions (
N,n), buffer sizes inlineWidthInBytesandlineWidthInInt32s, and tracing support. -
AIE Device Definition:
@devicedefines the target device. Thedevice_bodyfunction contains the AIE array design definition. -
Kernel Function Declarations:
passThroughLineis an external function imported frompassThrough.cc. -
Tile Definitions:
ShimTilehandles data movement, andComputeTile2processes the memcpy operations. -
Object Fifos:
of_inandof_outare defined to facilitate communication betweenShimTileandComputeTile2. -
Tracing Flow Setup (Optional): A circuit-switched flow is set up for tracing information when enabled.
-
Core Definition: The
core_bodyfunction loops through sub-vectors of the input data, acquiring elements fromof_in, processing usingpassThroughLine, and outputting the result toof_out. -
Data Movement Configuration: The
aie.runtime_sequenceoperation configures data movement and synchronization on theShimTilefor input and output buffer management. -
Tracing Configuration (Optional): Trace control, event groups, and buffer descriptors are set up in the
aie.runtime_sequenceoperation when tracing is enabled. -
Generate the design: The
passthroughKernel()function triggers the code generation process. The final print statement outputs the MLIR representation of the AIE array configuration.
passThrough.cc contains a C++ implementation of vectorized memcpy operation designed for AIE cores. It consists of two main sections:
-
Vectorized Copying: The
passThrough_aie()function processes multiple data elements simultaneously, taking advantage of AIE vector datapath capabilities to load, copy and store data elements. -
C-style Wrapper Functions:
passThroughLine()andpassThroughTile()are two C-style wrapper functions to call the templatedpassThrough_aie()vectorized memcpy implementation from the AIE design implemented inpassthrough_kernel.py. ThepassThroughLine()andpassThroughTile()functions are compiled foruint8_t,int16_t, orint32_tdetermined by the value theBIT_WIDTHvariable defines.
To compile the design:
makeTo compile the placed design:
env use_placed=1 makeTo complete compiling the C++ testbench and run the design:
make runTo run the design:
make run_py