Xilinx
diff --git a/‎docs/AIRComputeModel.md‎
Lines changed: 56 additions & 9 deletions b/‎docs/AIRComputeModel.md‎
Lines changed: 56 additions & 9 deletions
diff --git a/‎docs/MultiGPUPlan.md‎
Lines changed: 224 additions & 0 deletions b/‎docs/MultiGPUPlan.md‎
Lines changed: 224 additions & 0 deletions
@@ -621,7 +621,7 @@ dimensions depend on the target backend:
   The compiler may **reshape** the iteration space (e.g., collapse a 2D herd
   into a 1D arrangement) via the `AIRCollapseHerdPass`. Reshaping is inhibited
   automatically when the herd body uses cascade channels (`channel_type =
-  "cascade"`), because cascade connections are topology-dependent and cannot
+  "npu_cascade"`), because cascade connections are topology-dependent and cannot
   survive reindexing. Explicit placement attributes (`x_loc`, `y_loc`,
   `x_size`, `y_size`) on the enclosing segment also constrain the legal shapes
   by fixing the tile footprint. The pass accepts a `max-col-size` option to
@@ -670,13 +670,29 @@ address spaces of the operand memrefs and mapped to the appropriate hardware mec
 An empty `[offsets]`, `[sizes]`, or `[strides]` list for a side means the entire memref
 is addressed with unit strides.
 
+#### Cross-rank addressing (multi-GPU)
+
+Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
+enclosing `air.rank` scope. When present, the corresponding memref is
+interpreted as living on rank R's symmetric heap rather than on the local
+process. The verifier requires the op to be enclosed by an `air.rank` and the
+referenced memref to be `air.symmetric`-tagged (see §2.7). The GPU backend
+(`air-to-rocdl`) lowers cross-rank DMAs through `mgpuGetHeapBases()`-based
+peer addressing; the NPU backend does not support these attributes.
+
+```
+// Read 1024 floats from rank 0's symmetric buffer into local L1.
+air.dma_memcpy_nd (%local[][][], %sym[][][]) {src_rank = 0 : i64}
+    : (memref<1024xf32, 2>, memref<1024xf32, 0>)
+```
+
 ---
 
 ### 2.5 `air.channel`, `air.channel.put`, `air.channel.get`
 
 ```
 // Channel declaration — at module scope
-air.channel @name [dim₀, dim₁, …] {channel_type = "dma_stream", depth = <N>}
+air.channel @name [dim₀, dim₁, …] {channel_type = "npu_dma_stream", depth = <N>}
 
 // Synchronous put/get — block until the transfer completes
 air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src)
@@ -696,13 +712,17 @@ them independently and to introduce double-buffering.
 A channel may be an array (e.g., `[4, 4]` for a 4×4 array). The `indices` operand on
 `put`/`get` selects the specific channel within the array.
 
-The `channel_type` attribute controls the underlying mechanism:
+The `channel_type` attribute controls the underlying mechanism. Values are
+namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels
+use the `gpu_` prefix.
 
 | Value | Mechanism |
 |-------|-----------|
-| `"dma_stream"` (default) | DMA engines with streaming (circuit-switched) interconnect |
-| `"dma_packet"` | DMA engines with packet-switched interconnect |
-| `"cascade"` | Core-to-core cascade connections between adjacent tiles |
+| `"npu_dma_stream"` (default) | NPU: DMA engines with streaming (circuit-switched) interconnect |
+| `"npu_dma_packet"` | NPU: DMA engines with packet-switched interconnect |
+| `"npu_cascade"` | NPU: Core-to-core cascade connections between adjacent tiles |
+| `"npu_mmio"` | NPU: Host-side MMIO blockwrites delivering a constant payload into a tile-local L1 buffer |
+| `"gpu_symmetric_heap"` | GPU: Cross-rank messaging through the symmetric heap runtime (XGMI peer-mapped VMem). Requires an enclosing `air.rank` scope. |
 
 The `broadcast_shape` attribute enables one-to-many communication following NumPy
 broadcasting rules.
@@ -796,6 +816,27 @@ in the async dependency graph.
 
 ---
 
+### 2.7 `air.symmetric` memref attribute (multi-GPU)
+
+A `memref.alloc` op may carry the unit attribute `air.symmetric` to indicate
+that the allocation should be backed by the **symmetric heap** runtime. Every
+rank in the enclosing `air.rank` scope performs the same allocation in lockstep,
+so each rank has a memref of the same size at the same offset within the heap.
+Cross-rank addressing (via `air.dma_memcpy_nd` `src_rank`/`dst_rank` attributes
+or `air.channel` with `channel_type = "gpu_symmetric_heap"`) refers to peer
+ranks' symmetric memrefs at the same logical offset.
+
+```
+%buf = memref.alloc() {air.symmetric} : memref<1024xf32>
+```
+
+The GPU lowering routes such allocations through `mgpuSymmetricAlloc`
+(`runtime_lib/airgpu/gpu_runtime.cpp`) instead of plain `mgpuMemAlloc`.
+Peer ranks' base pointers are obtained via `mgpuGetHeapBases()`. The NPU
+backend does not interpret this attribute.
+
+---
+
 ## 3. NPU (AIE) Backend Mapping
 
 On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps
@@ -999,7 +1040,13 @@ See [buildingGPU.md](buildingGPU.md) for build instructions and the complete
 | L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch |
 | L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) |
 | L3 (space 0) | DDR via NOC | HBM via global memory |
-| `dma_memcpy_nd` | AIE Shim/Tile DMA engines | SCF load/store loops |
-| `channel` (`dma_stream`) | Streaming AXI-S switch | — (not yet mapped to GPU) |
-| Synchronization | AIE locks | `gpu.barrier` |
+| `dma_memcpy_nd` (intra-rank) | AIE Shim/Tile DMA engines | SCF load/store loops |
+| `dma_memcpy_nd` (cross-rank, `src_rank`/`dst_rank`) | — | Symmetric heap peer addressing (planned) |
+| `channel` (`npu_dma_stream`) | Streaming AXI-S switch | n/a |
+| `channel` (`npu_dma_packet`) | Packet-switched AXI-S overlay | n/a |
+| `channel` (`npu_cascade`) | Core cascade interface | n/a |
+| `channel` (`npu_mmio`) | Host MMIO blockwrite | n/a |
+| `channel` (`gpu_symmetric_heap`) | n/a | Symmetric heap peer addressing (planned) |
+| `air.symmetric` memref alloc | n/a | `mgpuSymmetricAlloc` (planned) |
+| Synchronization | AIE locks | `gpu.barrier` (intra-rank), `mgpuBarrier` (cross-rank) |
 | `!air.token` (dependency) | AIE runtime completion signals | GPU stream/event dependencies |
@@ -0,0 +1,224 @@
+# Multi-GPU Messaging Support for AIR — Design and Implementation Plan
+
+## Context
+
+mlir-air today supports single-GPU lowering: `air.launch`/`air.segment`/`air.herd` lowers
+to a single `gpu.launch`, and `air.dma_memcpy_nd` lowers to thread-cooperative SCF loops
+within that launch (`mlir/lib/Conversion/AIRToROCDLPass.cpp`). The runtime side
+(`runtime_lib/airgpu/`) already implements a complete multi-process / multi-GPU shared-memory
+fabric (symmetric heap over XGMI peer access via VMem export/import), but the MLIR layer
+that *emits* code targeting this runtime does not exist yet.
+
+This document is a plan to bridge that gap so that AIR programs can express and execute
+**cross-GPU messaging** (point-to-point and collective) across multiple GPUs in the same
+host (and, eventually, across hosts).
+
+## Current State
+
+### IR layer
+
+| Construct | Status | Reference |
+|-----------|--------|-----------|
+| `air.universe.alloc` (device pool) | Defined | `mlir/include/air/Dialect/AIR/AIR.td:22-33` |
+| `air.rank` (multi-device level above launch) | Defined | `AIR.td:35-126` |
+| `air.rank` lowering | Serialized to `scf.for` (placeholder) | `mlir/lib/Conversion/ConvertToAIRPass.cpp:1977-2045` |
+| `air.dma_memcpy_nd` on GPU | Single-launch SCF loops | `AIRToROCDLPass.cpp:737-853` |
+| `air.channel*` on GPU | **Not lowered** | confirmed by `docs/AIRComputeModel.md:1003` |
+| Cross-rank memref / cross-GPU messaging in MLIR | **None** | — |
+
+Hierarchy is **rank → launch → segment → herd**. `air.rank` may be nested inside
+`air.launch` but not inside segment/herd/another rank
+(`mlir/lib/Dialect/AIR/IR/AIRDialect.cpp:1291-1300`).
+
+### Runtime layer (complete)
+
+| Component | Purpose | File |
+|-----------|---------|------|
+| `VMemAllocator` | `hipMemCreate` + `hipMemMap` allocations exportable as POSIX FDs | `runtime_lib/airgpu/vmem_allocator.{h,cpp}` |
+| `fd_passing` | Full AF_UNIX socket mesh, `SCM_RIGHTS` fd exchange | `runtime_lib/airgpu/fd_passing.{h,cpp}` |
+| `SymmetricHeap` | Collective constructor; per-rank heap bases mapped into local VA via XGMI peer access | `runtime_lib/airgpu/symmetric_heap.{h,cpp}` |
+| C ABI extensions | `mgpuSymmetricHeapInit/Destroy`, `mgpuGetRank/WorldSize`, `mgpuSymmetricAlloc/Free`, `mgpuGetHeapBase(rank)`, `mgpuGetHeapBases()`, `mgpuBarrier`, `mgpuSetDevice` | `runtime_lib/airgpu/gpu_runtime.cpp:256-317` |
+| Validated test | N processes, cross-rank XGMI read | `test/gpu/test_symmetric_heap.cpp`, `test/gpu/run_symmetric_heap_test.sh` |
+
+### NPU reference
+
+`air.channel` lowering for AIE/NPU
+(`mlir/lib/Conversion/AIRToAIEPass.cpp:2200-2330`, `LowerAIRChannelsPattern`) lowers
+`air.channel` to `aie.objectfifo`, finding puts/gets through symbol lookup. This
+named-symbol decoupled producer/consumer model is the right shape to reuse for
+GPU symmetric-heap messaging.
+
+## Design
+
+### Op design
+
+Two complementary additions, staged:
+
+**A. New `channel_type = "gpu_symmetric_heap"` on `air.channel`.**
+Existing values are renamed with a `npu_` prefix to make backend scope explicit:
+`dma_stream` → `npu_dma_stream`, `dma_packet` → `npu_dma_packet`, `cascade` →
+`npu_cascade`, `mmio` → `npu_mmio` (and `ddr_stream` → `npu_ddr_stream` if used).
+The new GPU value is `gpu_symmetric_heap`. This keeps `channel_type` as a single
+attribute that names both the backend (`npu_` / `gpu_`) and the mechanism within it.
+
+A put on rank A and a matching get on rank B exchange data through the symmetric heap.
+The channel's `indices` operand can encode the destination/source rank, or a separate
+`rank` operand is added. Verifier must require the enclosing scope to be
+`air.rank`-aware for `gpu_symmetric_heap`.
+
+This reuses the named-symbol decoupling that AIE channels already provide and matches
+the test pattern of "rank A writes, rank B reads at the same offset."
+
+**B. Rank-aware `air.dma_memcpy_nd`.**
+Optional `src_rank` / `dst_rank` operand (or attribute) on `air.dma_memcpy_nd`. When
+present, the corresponding memref refers semantically to "the same allocation but on
+rank R's heap." This is mechanically simpler for a first pass; it's a strict subset
+of what (A) can express.
+
+Recommendation: **implement (B) first** to demonstrate end-to-end cross-rank data
+movement through the existing DMA lowering, then add (A) for production patterns
+(broadcast/multicast, decoupled producer/consumer, balance-checked communication).
+
+### Symmetric memref tagging
+
+New attribute `air.symmetric` (or a dedicated address-space tag) on `memref.alloc`
+distinguishes symmetric-heap allocations from regular global memory. The
+`air-to-rocdl` pass routes tagged allocs to `mgpuSymmetricAlloc` instead of
+`mgpuMemAlloc`. The same offset is allocated on every rank (collective allocation
+guaranteed by symmetric-heap semantics), so cross-rank addressing is
+`mgpuGetHeapBases()[peer] + offset_of(buf)`.
+
+### Process model
+
+`air.rank` becomes a real multi-process construct on GPU. Each process executes the
+entire body of the rank op, with `air.rank.id` resolving to `mgpuGetRank()`. The
+host-side launcher (extension to `aircc`) forks `WORLD_SIZE` processes with `RANK=i
+LOCAL_RANK=i` set, all loading the same compiled binary linked against `libairgpu.so`.
+
+## Implementation Phases
+
+### Phase 1 — Op extensions
+- Add `channel_type = "symmetric_heap"` to `ChannelOp` verifier
+  (`AIRDialect.cpp:3296`)
+- Add optional `src_rank` / `dst_rank` operand or attribute to
+  `air.dma_memcpy_nd` (`AIR.td:458-501`)
+- Add `air.symmetric` memref attribute / address-space tag
+- Document semantics in `docs/AIRComputeModel.md`
+
+### Phase 2 — `air-rank-to-mgpu` pass
+Replaces `air-rank-to-launch` in the GPU pipeline. New file
+`mlir/lib/Conversion/AIRRankToMgpuPass.cpp`:
+- Lowers `air.rank.id` → `mgpuGetRank()`
+- Lowers `air.rank.size` → `mgpuGetWorldSize()`
+- Inserts `mgpuSymmetricHeapInit(heap_size)` at function entry,
+  `mgpuSymmetricHeapDestroy()` at exit (or assumes the launcher does this)
+- Body of `air.rank` is moved to per-process function (no `scf.for` wrapping)
+
+### Phase 3 — Symmetric alloc lowering
+- Extend `hoistAlloc` (or add a sibling pass) in `AIRToROCDLPass.cpp:570-601` to
+  recognize `air.symmetric`-tagged `memref.alloc` and lower them to
+  `mgpuSymmetricAlloc` calls (host-side, not GPU workgroup attribution)
+- Memrefs from symmetric alloc remain in address space 0 (global) but carry an
+  attribute that downstream lowering uses to detect peer-access addressing
+
+### Phase 4 — Cross-rank DMA lowering
+Extend `convertDMAToGPUMemcpy` in `AIRToROCDLPass.cpp:737-853`:
+- Detect peer-tagged operand (rank attribute / op operand)
+- Resolve `mgpuGetHeapBases()` once at kernel launch (host-side), pass the peer
+  base pointer into the kernel as an extra argument
+- Inside the kernel, replace the source memref base with `peer_base + offset`
+- Use the same thread-cooperative load/store template as today
+- Insert `mgpuBarrier()` (host-side) at synchronization points before/after the
+  cross-rank transfer
+
+### Phase 5 — `air.channel` on GPU
+Add a new pattern (parallel to `convertDMAToGPUMemcpy`) for `air.channel.put` /
+`air.channel.get` with `channel_type = "symmetric_heap"`:
+- Producer (put): same thread-cooperative loop as DMA, writing into the
+  symmetric-heap slot at `bases[my_rank] + slot_offset`
+- Consumer (get): cooperative loop reading from `bases[peer_rank] + slot_offset`
+- Synchronization: per-channel notify-flag word in the symmetric heap, polled by
+  the consumer, set by the producer (matches `depth = 1` rendezvous semantics)
+- For `depth > 1`, allocate `depth` slots and a head/tail index per channel
+
+### Phase 6 — `aircc` launcher integration
+- Add a runner mode (e.g. `--multi-rank=N`) that forks `N` processes with
+  `RANK`/`WORLD_SIZE`/`LOCAL_RANK` env vars set
+- Each child execs the same compiled binary linked with `libairgpu.so`
+- Host-side cleanup waits for all children
+- Reuse the pattern from `test/gpu/run_symmetric_heap_test.sh`
+
+### Phase 7 — End-to-end test
+Add `test/gpu/symmetric_heap_dma/`:
+
+```mlir
+%c2 = arith.constant 2 : index
+air.rank (%r) in (%nr = %c2) {
+  %buf = memref.alloc() {air.symmetric} : memref<1024xf32, 0>
+  air.launch (%bx) in (%g = %c1) args(%b = %buf, %ri = %r) ... {
+    air.segment ... {
+      air.herd tile (%t) in (%nt = %c256) ... {
+        // Initialize on rank 0
+        scf.if %is_rank_0 { /* store pattern into %b */ }
+        // Cross-rank DMA: get from rank 0 into local
+        air.dma_memcpy_nd (%local[][][], %b[][][]) {src_rank = 0}
+            : (memref<1024xf32, 2>, memref<1024xf32, 0>)
+        // Verify on rank 1
+      }
+    }
+  }
+}
+```
+
+Driver script: same shape as `run_symmetric_heap_test.sh` — fork N processes,
+each loads the compiled binary, validates cross-rank read.
+
+## Key file paths
+
+### Existing
+- Op defs: `mlir/include/air/Dialect/AIR/AIR.td` (rank/universe at L22-139, channel at L535-594, DMA at L458-501)
+- Op verifiers: `mlir/lib/Dialect/AIR/IR/AIRDialect.cpp` (rank L890-1300, channel L3018-3300)
+- Rank-to-launch pass: `mlir/lib/Conversion/ConvertToAIRPass.cpp:1977-2045`
+- ROCDL conversion: `mlir/lib/Conversion/AIRToROCDLPass.cpp` (DMA lowering at L737-853, hoistAlloc at L570-601)
+- AIE channel reference: `mlir/lib/Conversion/AIRToAIEPass.cpp:2200-2330`
+- Runtime: `runtime_lib/airgpu/{symmetric_heap,vmem_allocator,fd_passing,gpu_runtime}.{cpp,h}`
+- Tests: `test/gpu/test_symmetric_heap.cpp`, `test/gpu/run_symmetric_heap_test.sh`,
+  `mlir/test/Dialect/AIR/air_rank.mlir`,
+  `mlir/test/Conversion/AIRRankToLaunch/rank_to_launch.mlir`
+- Doc: `docs/AIRComputeModel.md` (§4 GPU mapping, §5 summary table)
+
+### To create
+- `mlir/include/air/Conversion/AIRRankToMgpuPass.h`
+- `mlir/lib/Conversion/AIRRankToMgpuPass.cpp`
+- `test/gpu/symmetric_heap_dma/` (test inputs + run.sh)
+- `mlir/test/Conversion/AIRToROCDL/cross_rank_dma.mlir` (FileCheck unit test)
+- `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` (FileCheck unit test)
+
+## Verification
+
+For each phase:
+
+1. **Unit tests (FileCheck)** under `mlir/test/Conversion/` checking the IR shape
+   after each pass.
+2. **End-to-end run** on a multi-GPU host (e.g., MI300A):
+   ```bash
+   bash test/gpu/symmetric_heap_dma/run.sh
+   # Expected: "PASS: cross-rank read matches"
+   ```
+3. **Regression**: existing single-GPU tests
+   (`test/gpu/4k_4k_mul/run.sh`, `test/gpu/dma_copy/run.sh`,
+   `test/gpu/simple_test/`) must continue to pass.
+
+## Open questions
+
+- **Inter-host scaling**: the current `fd_passing` uses AF_UNIX sockets, so it's
+  single-host. Multi-host requires switching to TCP + RDMA or layering on
+  RCCL/UCX. Out of scope for the first version.
+- **Synchronization semantics for channels**: rendezvous (`depth=1`) is simple
+  via flag-poll; for `depth>N` an in-heap ring buffer is needed. First version
+  can constrain to `depth=1`.
+- **Async tokens**: the current GPU DMA lowering does not handle async form
+  (`AIRToROCDLPass.cpp:735` TODO). Cross-rank async is even more complex; first
+  version should restrict to synchronous cross-rank ops.
+- **Error handling**: process-group cleanup if one rank crashes — `aircc`
+  launcher must terminate the whole group.