Xilinx · erwei-xilinx · May 6, 2026 · May 3, 2026 · May 6, 2026 · May 6, 2026
@@ -621,7 +621,7 @@ dimensions depend on the target backend:
   The compiler may **reshape** the iteration space (e.g., collapse a 2D herd
   into a 1D arrangement) via the `AIRCollapseHerdPass`. Reshaping is inhibited
   automatically when the herd body uses cascade channels (`channel_type =
-  "cascade"`), because cascade connections are topology-dependent and cannot
+  "npu_cascade"`), because cascade connections are topology-dependent and cannot
   survive reindexing. Explicit placement attributes (`x_loc`, `y_loc`,
   `x_size`, `y_size`) on the enclosing segment also constrain the legal shapes
   by fixing the tile footprint. The pass accepts a `max-col-size` option to
@@ -670,13 +670,30 @@ address spaces of the operand memrefs and mapped to the appropriate hardware mec
 An empty `[offsets]`, `[sizes]`, or `[strides]` list for a side means the entire memref
 is addressed with unit strides.
 
+#### Cross-rank addressing (multi-GPU)
+
+Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
+enclosing `air.rank` scope. When present, the corresponding memref is
+interpreted as living on rank R's symmetric heap rather than on the local
+process. The verifier requires the op to be enclosed by an `air.rank` and the
+referenced memref to be `air.symmetric`-tagged (see §2.7). Lowering for the
+GPU backend (planned: `air-cross-rank-dma-to-mgpu`) will expand these into
+`mgpuGetHeapBases()`-based peer-VA arithmetic + `mgpuMemcpy`; the NPU
+backend does not support these attributes.
+
+```
+// Read 1024 floats from rank 0's symmetric buffer into local L1.
+air.dma_memcpy_nd (%local[][][], %sym[][][]) {src_rank = 0 : i64}
+    : (memref<1024xf32, 2>, memref<1024xf32, 0>)
+```
+
 ---
 
 ### 2.5 `air.channel`, `air.channel.put`, `air.channel.get`
 
 ```
 // Channel declaration — at module scope
-air.channel @name [dim₀, dim₁, …] {channel_type = "dma_stream", depth = <N>}
+air.channel @name [dim₀, dim₁, …] {channel_type = "npu_dma_stream", depth = <N>}
 
 // Synchronous put/get — block until the transfer completes
 air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src)
@@ -696,13 +713,17 @@ them independently and to introduce double-buffering.
 A channel may be an array (e.g., `[4, 4]` for a 4×4 array). The `indices` operand on
 `put`/`get` selects the specific channel within the array.
 
-The `channel_type` attribute controls the underlying mechanism:
+The `channel_type` attribute controls the underlying mechanism. Values are
+namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels
+use the `gpu_` prefix.
 
 | Value | Mechanism |
 |-------|-----------|
-| `"dma_stream"` (default) | DMA engines with streaming (circuit-switched) interconnect |
-| `"dma_packet"` | DMA engines with packet-switched interconnect |
-| `"cascade"` | Core-to-core cascade connections between adjacent tiles |
+| `"npu_dma_stream"` (default) | NPU: DMA engines with streaming (circuit-switched) interconnect |
+| `"npu_dma_packet"` | NPU: DMA engines with packet-switched interconnect |
+| `"npu_cascade"` | NPU: Core-to-core cascade connections between adjacent tiles |
+| `"npu_mmio"` | NPU: Host-side MMIO blockwrites delivering a constant payload into a tile-local L1 buffer |
+| `"gpu_symmetric_heap"` | GPU: Cross-rank messaging through the symmetric heap runtime (XGMI peer-mapped VMem). Requires an enclosing `air.rank` scope. |
 
 The `broadcast_shape` attribute enables one-to-many communication following NumPy
 broadcasting rules.
@@ -796,6 +817,28 @@ in the async dependency graph.
 
 ---
 
+### 2.7 `air.symmetric` memref attribute (multi-GPU)
+
+A `memref.alloc` op may carry the unit attribute `air.symmetric` to indicate
+that the allocation should be backed by the **symmetric heap** runtime. Every
+rank in the enclosing `air.rank` scope performs the same allocation in lockstep,
+so each rank has a memref of the same size at the same offset within the heap.
+Cross-rank addressing (via `air.dma_memcpy_nd` `src_rank`/`dst_rank` attributes
+or `air.channel` with `channel_type = "gpu_symmetric_heap"`) refers to peer
+ranks' symmetric memrefs at the same logical offset.
+
+```
+%buf = memref.alloc() {air.symmetric} : memref<1024xf32>
+```
+
+The GPU lowering (planned: `air-symmetric-alloc-to-mgpu`) will route such
+allocations through `mgpuSymmetricAlloc` (`runtime_lib/airgpu/gpu_runtime.cpp`)
+instead of plain `mgpuMemAlloc`. Peer ranks' base pointers are obtained at
+runtime via `mgpuGetHeapBases()`. The NPU backend does not interpret this
+attribute.
+
+---
+
 ## 3. NPU (AIE) Backend Mapping
 
 On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps
@@ -999,7 +1042,13 @@ See [buildingGPU.md](buildingGPU.md) for build instructions and the complete
 | L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch |
 | L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) |
 | L3 (space 0) | DDR via NOC | HBM via global memory |
-| `dma_memcpy_nd` | AIE Shim/Tile DMA engines | SCF load/store loops |
-| `channel` (`dma_stream`) | Streaming AXI-S switch | — (not yet mapped to GPU) |
-| Synchronization | AIE locks | `gpu.barrier` |
+| `dma_memcpy_nd` (intra-rank) | AIE Shim/Tile DMA engines | SCF load/store loops |
+| `dma_memcpy_nd` (cross-rank, `src_rank`/`dst_rank`) | — | Symmetric heap peer addressing (planned) |
+| `channel` (`npu_dma_stream`) | Streaming AXI-S switch | n/a |
+| `channel` (`npu_dma_packet`) | Packet-switched AXI-S overlay | n/a |
+| `channel` (`npu_cascade`) | Core cascade interface | n/a |
+| `channel` (`npu_mmio`) | Host MMIO blockwrite | n/a |
+| `channel` (`gpu_symmetric_heap`) | n/a | Symmetric heap peer addressing (planned) |
+| `air.symmetric` memref alloc | n/a | `mgpuSymmetricAlloc` (planned) |
+| Synchronization | AIE locks | `gpu.barrier` (intra-rank), `mgpuBarrier` (cross-rank) |
 | `!air.token` (dependency) | AIE runtime completion signals | GPU stream/event dependencies |
@@ -477,7 +477,9 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
         Variadic<Index>:$src_sizes,
         Variadic<Index>:$src_strides,
         OptionalAttr<DenseI32ArrayAttr>:$pad_before,
-        OptionalAttr<DenseI32ArrayAttr>:$pad_after
+        OptionalAttr<DenseI32ArrayAttr>:$pad_after,
+        OptionalAttr<I64Attr>:$src_rank,
+        OptionalAttr<I64Attr>:$dst_rank
   );
   let results = (outs Optional<air_AsyncToken>:$async_token);
   let assemblyFormat = [{
@@ -487,7 +489,15 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
     `(` type($dst) `,` type($src) `)`
   }];
   let description = [{
-    dma operator
+    N-dimensional strided bulk copy between two memrefs.
+
+    Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
+    enclosing `air.rank` scope. When present, the corresponding memref is
+    interpreted as living on rank R's symmetric heap rather than on the local
+    process. These attributes are only valid for `air.symmetric`-tagged memref
+    allocations and require an enclosing `air.rank`. Lowering for these
+    attributes will be added by a future GPU pass (planned: `air-cross-rank-
+    dma-to-mgpu`); this PR introduces only the IR surface and verifier rules.
   }];
   let extraClassDeclaration = [{
     Value getSrcMemref() { return getSrc(); }
@@ -501,7 +511,31 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
     bool hasPadding() {
       return getPadBefore().has_value();
     }
+    bool hasCrossRank() {
+      return getSrcRank().has_value() || getDstRank().has_value();
+    }
   }];
+  let builders = [
+    // Backward-compatible builder: defaults src_rank/dst_rank to absent.
+    OpBuilder<(ins "::mlir::TypeRange":$resultTypes,
+                   "::mlir::ValueRange":$async_dependencies,
+                   "::mlir::Value":$dst,
+                   "::mlir::ValueRange":$dst_offsets,
+                   "::mlir::ValueRange":$dst_sizes,
+                   "::mlir::ValueRange":$dst_strides,
+                   "::mlir::Value":$src,
+                   "::mlir::ValueRange":$src_offsets,
+                   "::mlir::ValueRange":$src_sizes,
+                   "::mlir::ValueRange":$src_strides,
+                   "::mlir::DenseI32ArrayAttr":$pad_before,
+                   "::mlir::DenseI32ArrayAttr":$pad_after), [{
+      build($_builder, $_state, resultTypes, async_dependencies, dst,
+            dst_offsets, dst_sizes, dst_strides, src,
+            src_offsets, src_sizes, src_strides, pad_before, pad_after,
+            /*src_rank=*/IntegerAttr(),
+            /*dst_rank=*/IntegerAttr());
+    }]>
+  ];
   let hasCanonicalizer = 1;
   let hasVerifier = 1;
 }
@@ -535,26 +569,30 @@ def air_WaitAllOp: air_Op<"wait_all", [air_AsyncOpInterface]> {
 def air_ChannelOp : air_Op<"channel", [Symbol]>,
     Arguments<(ins SymbolNameAttr:$sym_name,
                    DefaultValuedAttr<I64ArrayAttr, "{}">:$size,
-                   DefaultValuedAttr<StrAttr, "\"dma_stream\"">:$channel_type)> {
+                   DefaultValuedAttr<StrAttr, "\"npu_dma_stream\"">:$channel_type)> {
   let assemblyFormat = [{
     $sym_name $size attr-dict
   }];
   let summary = "Channel for data movement.";
   let description = [{
     Operation to represent a communication channel as a point-to-point connection between two memrefs.
     The array following the channel name symbol represents the channel's dimensional sizes. Default
-    size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled 
+    size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled
     by the `channel_type` attribute.
 
     ### Channel Types
-    The `channel_type` attribute is a string that determines the mechanism used for data movement:
-    - **"dma_stream"** (default):
+    The `channel_type` attribute is a string that determines the mechanism used for data movement.
+    Values are namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels use
+    the `gpu_` prefix.
+
+    NPU (AIE) channel types:
+    - **"npu_dma_stream"** (default):
       Use DMA engines to send and receive data, with routing performed over a streaming interconnect.
-    - **"dma_packet"**:
+    - **"npu_dma_packet"**:
       Use DMA engines to send and receive data, with routing performed over a packet-switched network.
-    - **"cascade"**:
+    - **"npu_cascade"**:
       Use processor cores to send and receive data via cascade connections between adjacent tiles.
-    - **"mmio"**:
+    - **"npu_mmio"**:
       Use host-side MMIO writes (e.g. `aiex.npu.blockwrite`) issued from the runtime
       sequence to deliver a constant payload directly into a tile-local L1 buffer.
       No DMA channel, no shim allocation, no flow is reserved.
@@ -565,32 +603,46 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>,
       `memref.get_global`. The consumer-side `get` lowers to a no-op
       because the L1 buffer is already populated when the core begins executing.
 
+    GPU channel types:
+    - **"gpu_symmetric_heap"**:
+      Cross-GPU messaging through the symmetric heap runtime
+      (`runtime_lib/airgpu/symmetric_heap.{h,cpp}`). The channel must be enclosed
+      by an `air.rank` op; the put/get sites use rank indices to address peer
+      heaps. Lowering will be added by a future GPU pass (planned:
+      `air-gpu-channel-to-mgpu`) which expands put/get to peer-mapped
+      `mgpuMemcpy` calls plus a barrier; this PR introduces only the IR
+      surface and verifier rules.
+
     ### Broadcasting
-    If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute  
+    If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute
     annotates the output sizes after broadcasting. Broadcasting follows NumPy's broadcasting rules.
 
     Example:
 
     ```mlir
-    // An array of 4 x 4 streaming DMA channels
-    air.channel @channel_0 [4, 4] {channel_type = "dma_stream"}
+    // An array of 4 x 4 streaming DMA channels (NPU)
+    air.channel @channel_0 [4, 4] {channel_type = "npu_dma_stream"}
 
-    // A streaming DMA channel broadcasting to 4 destinations
-    air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_stream"}
+    // A streaming DMA channel broadcasting to 4 destinations (NPU)
+    air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_stream"}
 
-    // An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations.
+    // An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations (NPU).
     // Broadcasting follows NumPy's rules.
-    air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "dma_stream"}
+    air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "npu_dma_stream"}
 
-    // A packet-switched DMA channel
-    air.channel @channel_3 [] {channel_type = "dma_packet"}
+    // A packet-switched DMA channel (NPU)
+    air.channel @channel_3 [] {channel_type = "npu_dma_packet"}
 
-    // A cascade channel using core-to-core cascade connections
-    air.channel @channel_4 [] {channel_type = "cascade"}
+    // A cascade channel using core-to-core cascade connections (NPU)
+    air.channel @channel_4 [] {channel_type = "npu_cascade"}
 
     // An MMIO channel: the put writes a constant from host into L1 of each
-    // get's destination tile via runtime-sequence blockwrites
-    air.channel @channel_5 [] {channel_type = "mmio"}
+    // get's destination tile via runtime-sequence blockwrites (NPU)
+    air.channel @channel_5 [] {channel_type = "npu_mmio"}
+
+    // A cross-GPU channel through the symmetric heap (GPU). Must appear inside
+    // an air.rank scope; the indices on put/get encode the peer rank.
+    air.channel @channel_6 [] {channel_type = "gpu_symmetric_heap"}
     ```
   }];
   let extraClassDeclaration = [{