Section 2d - Runtime Data Movement

Section 2 - Data Movement (Object FIFOs)
- Section 2a - Introduction
- Section 2b - Key Object FIFO Patterns
- Section 2c - Data Layout Transformations
- Section 2d - Runtime Data Movement
- Section 2e - Programming for multiple cores
- Section 2f - Practical Examples
- Section 2g - Data Movement Without Object FIFOs

Efficient Data Movement with `npu_dma_memcpy_nd`

The npu_dma_memcpy_nd function is key for enabling non-blocking, multi-dimensional data transfers between different memory regions between the AI Engine array and external memory. This function is essential in developing real applications like signal processing, machine learning, and video processing.

Function Signature and Parameters:

npu_dma_memcpy_nd(metadata, bd_id, mem, offsets=None, sizes=None, strides=None)

metadata: This is a reference to the object FIFO or the string name of an object FIFO that records a Shim Tile and one of its DMA channels allocated for the host-side memory transfer. In order to associate the memcpy operation with an object FIFO, this metadata string needs to match the object FIFO name string.
bd_id: Identifier integer for the particular Buffer Descriptor control registers used for this memcpy. A buffer descriptor contains all information needed for a DMA transfer described in the parameters below.
mem: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.
tap (optional): A TensorAccessPattern is an alternative method of specifying offset/sizes/strides for determining an access pattern over the mem buffer.
offsets (optional): Start points for data transfer in each dimension. There is a maximum of four offset dimensions.
sizes: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.
strides (optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.
burst_length (optional): The configuration of the burst length for the DMA task. If 0, defaults to the highest available value.

The strides and sizes express data transformations analogously to those described in Section 2C.

Example Usage:

npu_dma_memcpy_nd(of_in, 0, input_buffer, sizes=[1, 1, 1, 30])

The example above describes a linear transfer of 30 data elements, or 120 Bytes, from the input_buffer in host memory into an object FIFO with matching metadata labeled "of_in". The size dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to 1.

Advanced Techniques for Multi-dimensional `npu_dma_memcpy_nd`

For high-performance computing applications on AMD's AI Engine, mastering the npu_dma_memcpy_nd function for complex data movements is crucial. Here, we focus on using the sizes, strides, and offsets parameters to effectively manage intricate data transfers.

Tiling a Large Matrix

A common tasks such as tiling a 2D matrix can be implemented using the npu_dma_memcpy_nd operation. Here’s a simplified example that demonstrates the description.

Scenario: Tiling a 2D matrix from shape [100, 200] to [20, 20] and the data type int16. With the convention [row, col].

1. Configuration to transfer one tile:

metadata = of_in
bd_id = 3
mem = matrix_memory  # Memory object for the matrix

# Sizes define the extent of the tile to copy
sizes = [1, 1, 20, 10]

# Strides set to '0' in the higher (unused) dimensions and to '100' (length of a row in 4B or "i32s") in the minor dimension
strides = [0, 0, 0, 100]  

# Offsets set to zero since we start from the beginning
offsets = [0, 0, 0, 0]

npu_dma_memcpy_nd(metadata, bd_id, mem, offsets, sizes, strides)

2. Configuration to tile the whole matrix:

metadata = of_in
bd_id = 3
mem = matrix_memory  # Memory object for the matrix

# Sizes define the extent of the tile to copy.
# Dimension 0 is 10 to transfer 20 int16s for one row of the tile,
# Dimension 1 repeats that row transfer 20 times to complete a [20, 20] tile,
# Dimension 2 repeats that tile transfer 10 times along a row,
# Dimension 3 repeats the row of tiles transfer 5 times to complete.
sizes = [5, 10, 20, 10]

# Strides set to '0' in the highest (unused) dimension,
# '2000' for the next row of tile below the last (200 x 20 x 2B / 4B),
# '10' for the next tile to the 'right' of the last [20, 20] tile,
# and '100' (length of a row in 4B or "i32s") in dimension 0.
strides = [0, 2000, 10, 100]  

# Offsets set to zero since we start from the beginning
offsets = [0, 0, 0, 0]

npu_dma_memcpy_nd(metadata, bd_id, mem, offsets, sizes, strides)

Host Synchronization with `dma_wait` after one or more `npu_dma_memcpy_nd` operations

Synchronization between DMA channels and the host is facilitated by the dma_wait operation, ensuring data consistency and proper execution order. The dma_wait operation waits until the BD associated with the ObjectFifo is complete, issuing a task complete token.

Function Signature:

dma_wait(metadata)

**metadata: The ObjectFifo python object or the name of the object fifo associated with the DMA option we will wait on.

Example Usage:

Waiting on DMAs associated with one object fifo:

# Waits for the output data to transfer from the output object fifo to the host
dma_wait(of_out)

Waiting on DMAs associated with more than one object fifo:

dma_wait(of_in, of_out)

Automatic Linearization of Contiguous Accesses

A contiguous row-major access—one where strides[0] == 1 and each outer stride equals the product of the inner sizes—is automatically folded to canonical linear form by the compiler. This means you can always write the natural multidimensional form and let the compiler handle it.

For example, transferring a height × width image or a H × W × C activation tensor:

# 2D image: naturally expressed, compiler linearizes to [1,1,1,height*width]
npu_dma_memcpy_nd(of_in, 0, buf, sizes=[1, 1, height, width], strides=[0, 0, width, 1])

# 3D activation tensor: naturally expressed, compiler linearizes to [1,1,1,H*W*C]
npu_dma_memcpy_nd(of_in, 0, buf, sizes=[1, H, W, C], strides=[0, W*C, C, 1])

The linearized form uses a wider hardware buffer-length register, so the total transfer size is not subject to the hardware d0 dimension size limit that applies to ND transfers.

Best Practices for Data Movement and Synchronization with `npu_dma_memcpy_nd`

Sync to Reuse Buffer Descriptors: Each npu_dma_memcpy_nd is assigned a bd_id. There are a maximum of 16 BDs available to use in each Shim Tile. It is "safe" to reuse BDs once all transfers are complete, this can be managed by properly synchronizing taking into account the BDs that must have completed to transfer data into the array to complete a compute operation. And then sync on the BD that receives the data produced by the compute operation to write it back to host memory.
Note Non-blocking Transfers: Overlap data transfers with computation by leveraging the non-blocking nature of npu_dma_memcpy_nd.
Minimize Synchronization Overhead: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.

Efficient Data Movement with `dma_task` Operations

As an alternative to npu_dma_memcpy_nd and dma_wait, there is a series of operations around DMA tasks that can serve a similar purpose.

There are two advantages of using the DMA task operations over using npu_dma_memcpy_nd:

The user does not have to specify a BD number
DMA task operations are capable of chaining BD operations; however, this is an advance use-case beyond the scope of this guide.

All programming examples have an *_placed.py version that is written using DMA task operations.

Function Signature and Parameters:

def shim_dma_single_bd_task(
    alloc,
    mem,
    tap: TensorAccessPatter | None = None,
    offset: int | None = None,
    sizes: MixedValues | None = None,
    strides: MixedValues | None = None,
    transfer_len: int | None = None,
    issue_token: bool = False,
)

alloc: The alloc argument associates the DMA task with an ObjectFIFO. This argument is called alloc because the shim-side end of a data transfer (specifically a channel on a shim tile) is referenced through a so-called "shim DMA allocation". When an ObjectFIFO is created with a Shim Tile endpoint, an allocation with the same name as the ObjectFIFO is automatically generated.
mem: Reference to a host buffer, given as an argument to the sequence function, that this transfer will read from or write to.
tap (optional): A TensorAccessPattern is an alternative method of specifying offset/sizes/strides for determining an access pattern over the mem buffer.
offset (optional): Starting point for the data transfer. Default values is 0.
sizes: The extent of data to be transferred across each dimension. There is a maximum of four size dimensions.
strides (optional): Interval steps between data points in each dimension, useful for striding-across and reshaping data.
issue_token (optional): If a token is issued, one may call dma_await_task on the returned task. Default is False.
burst_length (optional): The configuration of the burst length for the DMA task. If 0, defaults to the highest available value.

The strides and sizes express data transformations analogously to those described in Section 2C.

Example Usage:

out_task = shim_dma_single_bd_task(of_out, C, sizes=[1, 1, 1, N], issue_token=True)

The example above describes a linear transfer of N data elements from the C buffer in host memory into an object FIFO with matching metadata labeled "of_out". The sizes dimensions are expressed right to left where the right is dimension 0 and the left dimension 3. Higher dimensions not used should be set to 1.

Host Synchronization with `dma_await_task`

Synchronization between DMA channels and the host is facilitated by the dma_await_task operations, ensuring data consistency and proper execution order. The dma_await_task operation waits until all the BDs associated with a task have completed.

Function Signature:

def dma_await_task(*args: DMAConfigureTaskForOp)

args: One or more dma_task objects, where dma_task objects are the value returned by shim_dma_single_bd_task.

Example Usage:

Waiting on task completion of one DMA task:

# Waits for the output task to complete
dma_await_task(out_task)

Waiting on task completion of more than one DMA task:

# Waits for the input task and then the output task to complete
dma_await_task(in_task, out_task)

Free BDs without Waiting with `dma_free_task`

dma_await_task can only be called on a task created with issue_token=True. If issue_token=False (which is default), then dma_free_task should be called when the programmer knows that task if complete. dma_free_task allows the compiler to reuse the BDs of a task without synchronization. Using dma_free_task(X) before task X has completed will lead to a race condition and unpredictable behavior. Only use dma_free_task(X) in conjunction with some other means of synchronization. For example, you may issue dma_free_task(X) after a call to dma_await_task(Y) if you can reason that task Y can only complete after task X has completed.

Function Signature:

def dma_free_task(*args: DMAConfigureTaskForOp)

args: One or more dma_task objects, where dma_task objects are the value returned by shim_dma_single_bd_task.

Example Usage:

Release BDs belonging to DMAs associated with one task:

# Allow compiler to reuse BDs of a a task. Should only be called if the programmer is sure the task is completed.
dma_free_task(out_task)

Release BDs belonging to DMAs associated with more than one task:

# Allow compiler to reuse BDs of more than one task. Should only be called if the programmer is sure all tasks are completed.
dma_free_task(in_task, out_task)

Best Practices for Data Movement and Synchronization with `dma_task` Operations

Await or Free to Reuse Buffer Descriptors: While the exact buffer descriptor (BD) used for each operation is not visible to the user with the dma_task operations, there are still a finite number (maximum of 16 on a Shim Tile). Thus, it is important to use dma_await_task or dma_free_task before the number of BDs are exhausted so that they may be reused.
Note Non-blocking Transfers: Overlap data transfers with computation by leveraging the non-blocking nature of dma_start_task.
Minimize Synchronization Overhead: Synchronize/wait judiciously to avoid excessive overhead that might degrade performance.

Conclusion

Both the npu_dma_memcpy_nd/dma_wait interface and the shim_dma_single_bd_task/dma_await_task/dma_free_task interface are powerful tools for managing data transfers and synchronization with AI Engines in the Ryzen™ AI NPU. By understanding and effectively implementing applications leveraging these functions, developers can enhance the performance, efficiency, and accuracy of their high-performance computing applications.

[Up]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Section 2d - Runtime Data Movement

Efficient Data Movement with `npu_dma_memcpy_nd`

Advanced Techniques for Multi-dimensional `npu_dma_memcpy_nd`

Tiling a Large Matrix

Host Synchronization with `dma_wait` after one or more `npu_dma_memcpy_nd` operations

Automatic Linearization of Contiguous Accesses

Best Practices for Data Movement and Synchronization with `npu_dma_memcpy_nd`

Efficient Data Movement with `dma_task` Operations

Host Synchronization with `dma_await_task`

Free BDs without Waiting with `dma_free_task`

Best Practices for Data Movement and Synchronization with `dma_task` Operations

Conclusion

FilesExpand file tree

DMATasks.md

Latest commit

History

DMATasks.md

File metadata and controls

Section 2d - Runtime Data Movement

Efficient Data Movement with npu_dma_memcpy_nd

Advanced Techniques for Multi-dimensional npu_dma_memcpy_nd

Tiling a Large Matrix

Host Synchronization with dma_wait after one or more npu_dma_memcpy_nd operations

Automatic Linearization of Contiguous Accesses

Best Practices for Data Movement and Synchronization with npu_dma_memcpy_nd

Efficient Data Movement with dma_task Operations

Host Synchronization with dma_await_task

Free BDs without Waiting with dma_free_task

Best Practices for Data Movement and Synchronization with dma_task Operations

Conclusion

Efficient Data Movement with `npu_dma_memcpy_nd`

Advanced Techniques for Multi-dimensional `npu_dma_memcpy_nd`

Host Synchronization with `dma_wait` after one or more `npu_dma_memcpy_nd` operations

Best Practices for Data Movement and Synchronization with `npu_dma_memcpy_nd`

Efficient Data Movement with `dma_task` Operations

Host Synchronization with `dma_await_task`

Free BDs without Waiting with `dma_free_task`

Best Practices for Data Movement and Synchronization with `dma_task` Operations