Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 58 additions & 9 deletions docs/AIRComputeModel.md
Original file line number Diff line number Diff line change
Expand Up @@ -621,7 +621,7 @@ dimensions depend on the target backend:
The compiler may **reshape** the iteration space (e.g., collapse a 2D herd
into a 1D arrangement) via the `AIRCollapseHerdPass`. Reshaping is inhibited
automatically when the herd body uses cascade channels (`channel_type =
"cascade"`), because cascade connections are topology-dependent and cannot
"npu_cascade"`), because cascade connections are topology-dependent and cannot
survive reindexing. Explicit placement attributes (`x_loc`, `y_loc`,
`x_size`, `y_size`) on the enclosing segment also constrain the legal shapes
by fixing the tile footprint. The pass accepts a `max-col-size` option to
Expand Down Expand Up @@ -670,13 +670,30 @@ address spaces of the operand memrefs and mapped to the appropriate hardware mec
An empty `[offsets]`, `[sizes]`, or `[strides]` list for a side means the entire memref
is addressed with unit strides.

#### Cross-rank addressing (multi-GPU)

Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
enclosing `air.rank` scope. When present, the corresponding memref is
interpreted as living on rank R's symmetric heap rather than on the local
process. The verifier requires the op to be enclosed by an `air.rank` and the
referenced memref to be `air.symmetric`-tagged (see §2.7). Lowering for the
GPU backend (planned: `air-cross-rank-dma-to-mgpu`) will expand these into
`mgpuGetHeapBases()`-based peer-VA arithmetic + `mgpuMemcpy`; the NPU
backend does not support these attributes.

```
// Read 1024 floats from rank 0's symmetric buffer into local L1.
air.dma_memcpy_nd (%local[][][], %sym[][][]) {src_rank = 0 : i64}
: (memref<1024xf32, 2>, memref<1024xf32, 0>)
```

---

### 2.5 `air.channel`, `air.channel.put`, `air.channel.get`

```
// Channel declaration — at module scope
air.channel @name [dim₀, dim₁, …] {channel_type = "dma_stream", depth = <N>}
air.channel @name [dim₀, dim₁, …] {channel_type = "npu_dma_stream", depth = <N>}

// Synchronous put/get — block until the transfer completes
air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src)
Expand All @@ -696,13 +713,17 @@ them independently and to introduce double-buffering.
A channel may be an array (e.g., `[4, 4]` for a 4×4 array). The `indices` operand on
`put`/`get` selects the specific channel within the array.

The `channel_type` attribute controls the underlying mechanism:
The `channel_type` attribute controls the underlying mechanism. Values are
namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels
use the `gpu_` prefix.

| Value | Mechanism |
|-------|-----------|
| `"dma_stream"` (default) | DMA engines with streaming (circuit-switched) interconnect |
| `"dma_packet"` | DMA engines with packet-switched interconnect |
| `"cascade"` | Core-to-core cascade connections between adjacent tiles |
| `"npu_dma_stream"` (default) | NPU: DMA engines with streaming (circuit-switched) interconnect |
| `"npu_dma_packet"` | NPU: DMA engines with packet-switched interconnect |
| `"npu_cascade"` | NPU: Core-to-core cascade connections between adjacent tiles |
| `"npu_mmio"` | NPU: Host-side MMIO blockwrites delivering a constant payload into a tile-local L1 buffer |
| `"gpu_symmetric_heap"` | GPU: Cross-rank messaging through the symmetric heap runtime (XGMI peer-mapped VMem). Requires an enclosing `air.rank` scope. |

The `broadcast_shape` attribute enables one-to-many communication following NumPy
broadcasting rules.
Expand Down Expand Up @@ -796,6 +817,28 @@ in the async dependency graph.

---

### 2.7 `air.symmetric` memref attribute (multi-GPU)

A `memref.alloc` op may carry the unit attribute `air.symmetric` to indicate
that the allocation should be backed by the **symmetric heap** runtime. Every
rank in the enclosing `air.rank` scope performs the same allocation in lockstep,
so each rank has a memref of the same size at the same offset within the heap.
Cross-rank addressing (via `air.dma_memcpy_nd` `src_rank`/`dst_rank` attributes
or `air.channel` with `channel_type = "gpu_symmetric_heap"`) refers to peer
ranks' symmetric memrefs at the same logical offset.

```
%buf = memref.alloc() {air.symmetric} : memref<1024xf32>
```

The GPU lowering (planned: `air-symmetric-alloc-to-mgpu`) will route such
allocations through `mgpuSymmetricAlloc` (`runtime_lib/airgpu/gpu_runtime.cpp`)
instead of plain `mgpuMemAlloc`. Peer ranks' base pointers are obtained at
runtime via `mgpuGetHeapBases()`. The NPU backend does not interpret this
attribute.

---

## 3. NPU (AIE) Backend Mapping

On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps
Expand Down Expand Up @@ -999,7 +1042,13 @@ See [buildingGPU.md](buildingGPU.md) for build instructions and the complete
| L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch |
| L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) |
| L3 (space 0) | DDR via NOC | HBM via global memory |
| `dma_memcpy_nd` | AIE Shim/Tile DMA engines | SCF load/store loops |
| `channel` (`dma_stream`) | Streaming AXI-S switch | — (not yet mapped to GPU) |
| Synchronization | AIE locks | `gpu.barrier` |
| `dma_memcpy_nd` (intra-rank) | AIE Shim/Tile DMA engines | SCF load/store loops |
| `dma_memcpy_nd` (cross-rank, `src_rank`/`dst_rank`) | — | Symmetric heap peer addressing (planned) |
| `channel` (`npu_dma_stream`) | Streaming AXI-S switch | n/a |
| `channel` (`npu_dma_packet`) | Packet-switched AXI-S overlay | n/a |
| `channel` (`npu_cascade`) | Core cascade interface | n/a |
| `channel` (`npu_mmio`) | Host MMIO blockwrite | n/a |
| `channel` (`gpu_symmetric_heap`) | n/a | Symmetric heap peer addressing (planned) |
| `air.symmetric` memref alloc | n/a | `mgpuSymmetricAlloc` (planned) |
| Synchronization | AIE locks | `gpu.barrier` (intra-rank), `mgpuBarrier` (cross-rank) |
| `!air.token` (dependency) | AIE runtime completion signals | GPU stream/event dependencies |
96 changes: 74 additions & 22 deletions mlir/include/air/Dialect/AIR/AIR.td
Original file line number Diff line number Diff line change
Expand Up @@ -477,7 +477,9 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
Variadic<Index>:$src_sizes,
Variadic<Index>:$src_strides,
OptionalAttr<DenseI32ArrayAttr>:$pad_before,
OptionalAttr<DenseI32ArrayAttr>:$pad_after
OptionalAttr<DenseI32ArrayAttr>:$pad_after,
OptionalAttr<I64Attr>:$src_rank,
OptionalAttr<I64Attr>:$dst_rank
);
let results = (outs Optional<air_AsyncToken>:$async_token);
let assemblyFormat = [{
Expand All @@ -487,7 +489,15 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
`(` type($dst) `,` type($src) `)`
}];
let description = [{
dma operator
N-dimensional strided bulk copy between two memrefs.

Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
enclosing `air.rank` scope. When present, the corresponding memref is
interpreted as living on rank R's symmetric heap rather than on the local
process. These attributes are only valid for `air.symmetric`-tagged memref
allocations and require an enclosing `air.rank`. Lowering for these
attributes will be added by a future GPU pass (planned: `air-cross-rank-
dma-to-mgpu`); this PR introduces only the IR surface and verifier rules.
}];
Comment thread
erwei-xilinx marked this conversation as resolved.
let extraClassDeclaration = [{
Value getSrcMemref() { return getSrc(); }
Expand All @@ -501,7 +511,31 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
bool hasPadding() {
return getPadBefore().has_value();
}
bool hasCrossRank() {
return getSrcRank().has_value() || getDstRank().has_value();
}
}];
let builders = [
// Backward-compatible builder: defaults src_rank/dst_rank to absent.
OpBuilder<(ins "::mlir::TypeRange":$resultTypes,
"::mlir::ValueRange":$async_dependencies,
"::mlir::Value":$dst,
"::mlir::ValueRange":$dst_offsets,
"::mlir::ValueRange":$dst_sizes,
"::mlir::ValueRange":$dst_strides,
"::mlir::Value":$src,
"::mlir::ValueRange":$src_offsets,
"::mlir::ValueRange":$src_sizes,
"::mlir::ValueRange":$src_strides,
"::mlir::DenseI32ArrayAttr":$pad_before,
"::mlir::DenseI32ArrayAttr":$pad_after), [{
build($_builder, $_state, resultTypes, async_dependencies, dst,
dst_offsets, dst_sizes, dst_strides, src,
src_offsets, src_sizes, src_strides, pad_before, pad_after,
/*src_rank=*/IntegerAttr(),
/*dst_rank=*/IntegerAttr());
}]>
];
let hasCanonicalizer = 1;
let hasVerifier = 1;
}
Expand Down Expand Up @@ -535,26 +569,30 @@ def air_WaitAllOp: air_Op<"wait_all", [air_AsyncOpInterface]> {
def air_ChannelOp : air_Op<"channel", [Symbol]>,
Arguments<(ins SymbolNameAttr:$sym_name,
DefaultValuedAttr<I64ArrayAttr, "{}">:$size,
DefaultValuedAttr<StrAttr, "\"dma_stream\"">:$channel_type)> {
DefaultValuedAttr<StrAttr, "\"npu_dma_stream\"">:$channel_type)> {
let assemblyFormat = [{
$sym_name $size attr-dict
}];
let summary = "Channel for data movement.";
let description = [{
Operation to represent a communication channel as a point-to-point connection between two memrefs.
The array following the channel name symbol represents the channel's dimensional sizes. Default
size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled
size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled
by the `channel_type` attribute.

### Channel Types
The `channel_type` attribute is a string that determines the mechanism used for data movement:
- **"dma_stream"** (default):
The `channel_type` attribute is a string that determines the mechanism used for data movement.
Values are namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels use
the `gpu_` prefix.

NPU (AIE) channel types:
- **"npu_dma_stream"** (default):
Use DMA engines to send and receive data, with routing performed over a streaming interconnect.
- **"dma_packet"**:
- **"npu_dma_packet"**:
Use DMA engines to send and receive data, with routing performed over a packet-switched network.
- **"cascade"**:
- **"npu_cascade"**:
Use processor cores to send and receive data via cascade connections between adjacent tiles.
- **"mmio"**:
- **"npu_mmio"**:
Use host-side MMIO writes (e.g. `aiex.npu.blockwrite`) issued from the runtime
sequence to deliver a constant payload directly into a tile-local L1 buffer.
No DMA channel, no shim allocation, no flow is reserved.
Expand All @@ -565,32 +603,46 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>,
`memref.get_global`. The consumer-side `get` lowers to a no-op
because the L1 buffer is already populated when the core begins executing.

GPU channel types:
- **"gpu_symmetric_heap"**:
Cross-GPU messaging through the symmetric heap runtime
(`runtime_lib/airgpu/symmetric_heap.{h,cpp}`). The channel must be enclosed
by an `air.rank` op; the put/get sites use rank indices to address peer
heaps. Lowering will be added by a future GPU pass (planned:
`air-gpu-channel-to-mgpu`) which expands put/get to peer-mapped
`mgpuMemcpy` calls plus a barrier; this PR introduces only the IR
surface and verifier rules.

Comment thread
erwei-xilinx marked this conversation as resolved.
### Broadcasting
If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute
If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute
annotates the output sizes after broadcasting. Broadcasting follows NumPy's broadcasting rules.

Example:

```mlir
// An array of 4 x 4 streaming DMA channels
air.channel @channel_0 [4, 4] {channel_type = "dma_stream"}
// An array of 4 x 4 streaming DMA channels (NPU)
air.channel @channel_0 [4, 4] {channel_type = "npu_dma_stream"}

// A streaming DMA channel broadcasting to 4 destinations
air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_stream"}
// A streaming DMA channel broadcasting to 4 destinations (NPU)
air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_stream"}

// An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations.
// An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations (NPU).
// Broadcasting follows NumPy's rules.
air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "dma_stream"}
air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "npu_dma_stream"}

// A packet-switched DMA channel
air.channel @channel_3 [] {channel_type = "dma_packet"}
// A packet-switched DMA channel (NPU)
air.channel @channel_3 [] {channel_type = "npu_dma_packet"}

// A cascade channel using core-to-core cascade connections
air.channel @channel_4 [] {channel_type = "cascade"}
// A cascade channel using core-to-core cascade connections (NPU)
air.channel @channel_4 [] {channel_type = "npu_cascade"}

// An MMIO channel: the put writes a constant from host into L1 of each
// get's destination tile via runtime-sequence blockwrites
air.channel @channel_5 [] {channel_type = "mmio"}
// get's destination tile via runtime-sequence blockwrites (NPU)
air.channel @channel_5 [] {channel_type = "npu_mmio"}

// A cross-GPU channel through the symmetric heap (GPU). Must appear inside
// an air.rank scope; the indices on put/get encode the peer rank.
air.channel @channel_6 [] {channel_type = "gpu_symmetric_heap"}
```
}];
let extraClassDeclaration = [{
Expand Down
Loading
Loading