The npu_dma_memcpy_nd function is key for enabling non-blocking, multi-dimensional data transfers between different memory regions between the AI Engine array and external memory. This function is essential in developing real applications like signal processing, machine learning, and video processing.
Function Signature and Parameters:
npu_dma_memcpy_nd(metadata, bd_id, mem, offsets=None, sizes=None, strides=None)metadata: This is a reference to the object FIFO or the string name of an object FIFO that records a Shim Tile and one of its DMA channels allocated for the host-side memory transfer. In order to associate the memcpy operation with an object FIFO, this metadata string needs to match the object FIFO name string.bd_id: Identifier integer for the particular Buffer Descriptor control registers used for this memcpy. A buffer descriptor contains all information needed for a DMA transfer described in the parameters below.mem: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.tap(optional): ATensorAccessPatternis an alternative method of specifyingoffset/sizes/stridesfor determining an access pattern over themembuffer.offsets(optional): Start points for data transfer in each dimension. There is a maximum of four offset dimensions.sizes: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.strides(optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.burst_length(optional): The configuration of the burst length for the DMA task. If0, defaults to the highest available value.
The strides and sizes express data transformations analogously to those described in Section 2C.
Example Usage:
npu_dma_memcpy_nd(of_in, 0, input_buffer, sizes=[1, 1, 1, 30])The example above describes a linear transfer of 30 data elements, or 120 Bytes, from the input_buffer in host memory into an object FIFO with matching metadata labeled "of_in". The size dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to 1.
For high-performance computing applications on AMD's AI Engine, mastering the npu_dma_memcpy_nd function for complex data movements is crucial. Here, we focus on using the sizes, strides, and offsets parameters to effectively manage intricate data transfers.
A common tasks such as tiling a 2D matrix can be implemented using the npu_dma_memcpy_nd operation. Here’s a simplified example that demonstrates the description.
Scenario: Tiling a 2D matrix from shape [100, 200] to [20, 20] and the data type int16. With the convention [row, col].
1. Configuration to transfer one tile:
metadata = of_in
bd_id = 3
mem = matrix_memory # Memory object for the matrix
# Sizes define the extent of the tile to copy
sizes = [1, 1, 20, 10]
# Strides set to '0' in the higher (unused) dimensions and to '100' (length of a row in 4B or "i32s") in the minor dimension
strides = [0, 0, 0, 100]
# Offsets set to zero since we start from the beginning
offsets = [0, 0, 0, 0]
npu_dma_memcpy_nd(metadata, bd_id, mem, offsets, sizes, strides)2. Configuration to tile the whole matrix:
metadata = of_in
bd_id = 3
mem = matrix_memory # Memory object for the matrix
# Sizes define the extent of the tile to copy.
# Dimension 0 is 10 to transfer 20 int16s for one row of the tile,
# Dimension 1 repeats that row transfer 20 times to complete a [20, 20] tile,
# Dimension 2 repeats that tile transfer 10 times along a row,
# Dimension 3 repeats the row of tiles transfer 5 times to complete.
sizes = [5, 10, 20, 10]
# Strides set to '0' in the highest (unused) dimension,
# '2000' for the next row of tile below the last (200 x 20 x 2B / 4B),
# '10' for the next tile to the 'right' of the last [20, 20] tile,
# and '100' (length of a row in 4B or "i32s") in dimension 0.
strides = [0, 2000, 10, 100]
# Offsets set to zero since we start from the beginning
offsets = [0, 0, 0, 0]
npu_dma_memcpy_nd(metadata, bd_id, mem, offsets, sizes, strides)Synchronization between DMA channels and the host is facilitated by the dma_wait operation, ensuring data consistency and proper execution order. The dma_wait operation waits until the BD associated with the ObjectFifo is complete, issuing a task complete token.
Function Signature:
dma_wait(metadata)- **
metadata: The ObjectFifo python object or the name of the object fifo associated with the DMA option we will wait on.
Example Usage:
Waiting on DMAs associated with one object fifo:
# Waits for the output data to transfer from the output object fifo to the host
dma_wait(of_out) Waiting on DMAs associated with more than one object fifo:
dma_wait(of_in, of_out) A contiguous row-major access—one where strides[0] == 1 and each outer stride equals the product of the inner sizes—is automatically folded to canonical linear form by the compiler. This means you can always write the natural multidimensional form and let the compiler handle it.
For example, transferring a height × width image or a H × W × C activation tensor:
# 2D image: naturally expressed, compiler linearizes to [1,1,1,height*width]
npu_dma_memcpy_nd(of_in, 0, buf, sizes=[1, 1, height, width], strides=[0, 0, width, 1])
# 3D activation tensor: naturally expressed, compiler linearizes to [1,1,1,H*W*C]
npu_dma_memcpy_nd(of_in, 0, buf, sizes=[1, H, W, C], strides=[0, W*C, C, 1])The linearized form uses a wider hardware buffer-length register, so the total transfer size is not subject to the hardware d0 dimension size limit that applies to ND transfers.
- Sync to Reuse Buffer Descriptors: Each
npu_dma_memcpy_ndis assigned abd_id. There are a maximum of16BDs available to use in each Shim Tile. It is "safe" to reuse BDs once all transfers are complete, this can be managed by properly synchronizing taking into account the BDs that must have completed to transfer data into the array to complete a compute operation. And then sync on the BD that receives the data produced by the compute operation to write it back to host memory. - Note Non-blocking Transfers: Overlap data transfers with computation by leveraging the non-blocking nature of
npu_dma_memcpy_nd. - Minimize Synchronization Overhead: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.
As an alternative to npu_dma_memcpy_nd and dma_wait, there is a series of operations around DMA tasks that can serve a similar purpose.
There are two advantages of using the DMA task operations over using npu_dma_memcpy_nd:
- The user does not have to specify a BD number
- DMA task operations are capable of chaining BD operations; however, this is an advance use-case beyond the scope of this guide.
All programming examples have an *_placed.py version that is written using DMA task operations.
Function Signature and Parameters:
def shim_dma_single_bd_task(
alloc,
mem,
tap: TensorAccessPatter | None = None,
offset: int | None = None,
sizes: MixedValues | None = None,
strides: MixedValues | None = None,
transfer_len: int | None = None,
issue_token: bool = False,
)alloc: Theallocargument associates the DMA task with an ObjectFIFO. This argument is calledallocbecause the shim-side end of a data transfer (specifically a channel on a shim tile) is referenced through a so-called "shim DMA allocation". When an ObjectFIFO is created with a Shim Tile endpoint, an allocation with the same name as the ObjectFIFO is automatically generated.mem: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.tap(optional): ATensorAccessPatternis an alternative method of specifyingoffset/sizes/stridesfor determining an access pattern over themembuffer.offset(optional): Starting point for the data transfer. Default values is0.sizes: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.strides(optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.issue_token(optional): If a token is issued, one may calldma_await_taskon the returned task. Default isFalse.burst_length(optional): The configuration of the burst length for the DMA task. If0, defaults to the highest available value.
The strides and sizes express data transformations analogously to those described in Section 2C.
Example Usage:
out_task = shim_dma_single_bd_task(of_out, C, sizes=[1, 1, 1, N], issue_token=True)The example above describes a linear transfer of N data elements from the C buffer in host memory into an object FIFO with matching metadata labeled "of_out". The sizes dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to 1.
Synchronization between DMA channels and the host is facilitated by the dma_await_task operations, ensuring data consistency and proper execution order. The dma_await_task operation waits until all the BDs associated with a task have completed.
Function Signature:
def dma_await_task(*args: DMAConfigureTaskForOp)args: One or moredma_taskobjects, wheredma_taskobjects are the value returned byshim_dma_single_bd_task.
Example Usage:
Waiting on task completion of one DMA task:
# Waits for the output task to complete
dma_await_task(out_task) Waiting on task completion of more than one DMA task:
# Waits for the input task and then the output task to complete
dma_await_task(in_task, out_task) dma_await_task can only be called on a task created with issue_token=True. If issue_token=False (which is default), then dma_free_task should be called when the programmer knows that task if complete. dma_free_task allows the compiler to reuse the BDs of a task without synchronization. Using dma_free_task(X) before task X has completed will lead to a race condition and unpredictable behavior. Only use dma_free_task(X) in conjunction with some other means of synchronization. For example, you may issue dma_free_task(X) after a call to dma_await_task(Y) if you can reason that task Y can only complete after task X has completed.
Function Signature:
def dma_free_task(*args: DMAConfigureTaskForOp)args: One or moredma_taskobjects, wheredma_taskobjects are the value returned byshim_dma_single_bd_task.
Example Usage:
Release BDs belonging to DMAs associated with one task:
# Allow compiler to reuse BDs of a a task. Should only be called if the programmer is sure the task is completed.
dma_free_task(out_task) Release BDs belonging to DMAs associated with more than one task:
# Allow compiler to reuse BDs of more than one task. Should only be called if the programmer is sure all tasks are completed.
dma_free_task(in_task, out_task) - Await or Free to Reuse Buffer Descriptors: While the exact buffer descriptor (BD) used for each operation is not visible to the user with the
dma_taskoperations, there are still a finite number (maximum of16on a Shim Tile). Thus, it is important to usedma_await_taskordma_free_taskbefore the number of BDs are exhausted so that they may be reused. - Note Non-blocking Transfers: Overlap data transfers with computation by leveraging the non-blocking nature of
dma_start_task. - Minimize Synchronization Overhead: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.
Both the npu_dma_memcpy_nd/dma_wait interface and the shim_dma_single_bd_task/dma_await_task/dma_free_task interface are powerful tools for managing data transfers and synchronization with AI Engines in the Ryzen™ AI NPU. By understanding and effectively implementing applications leveraging these functions, developers can enhance the performance, efficiency, and accuracy of their high-performance computing applications.
[Up]