|
| 1 | +# Multi-GPU Messaging Support for AIR — Design and Implementation Plan |
| 2 | + |
| 3 | +## Context |
| 4 | + |
| 5 | +mlir-air today supports single-GPU lowering: `air.launch`/`air.segment`/`air.herd` lowers |
| 6 | +to a single `gpu.launch`, and `air.dma_memcpy_nd` lowers to thread-cooperative SCF loops |
| 7 | +within that launch (`mlir/lib/Conversion/AIRToROCDLPass.cpp`). The runtime side |
| 8 | +(`runtime_lib/airgpu/`) already implements a complete multi-process / multi-GPU shared-memory |
| 9 | +fabric (symmetric heap over XGMI peer access via VMem export/import), but the MLIR layer |
| 10 | +that *emits* code targeting this runtime does not exist yet. |
| 11 | + |
| 12 | +This document is a plan to bridge that gap so that AIR programs can express and execute |
| 13 | +**cross-GPU messaging** (point-to-point and collective) across multiple GPUs in the same |
| 14 | +host (and, eventually, across hosts). |
| 15 | + |
| 16 | +## Current State |
| 17 | + |
| 18 | +### IR layer |
| 19 | + |
| 20 | +| Construct | Status | Reference | |
| 21 | +|-----------|--------|-----------| |
| 22 | +| `air.universe.alloc` (device pool) | Defined | `mlir/include/air/Dialect/AIR/AIR.td:22-33` | |
| 23 | +| `air.rank` (multi-device level above launch) | Defined | `AIR.td:35-126` | |
| 24 | +| `air.rank` lowering | Serialized to `scf.for` (placeholder) | `mlir/lib/Conversion/ConvertToAIRPass.cpp:1977-2045` | |
| 25 | +| `air.dma_memcpy_nd` on GPU | Single-launch SCF loops | `AIRToROCDLPass.cpp:737-853` | |
| 26 | +| `air.channel*` on GPU | **Not lowered** | confirmed by `docs/AIRComputeModel.md:1003` | |
| 27 | +| Cross-rank memref / cross-GPU messaging in MLIR | **None** | — | |
| 28 | + |
| 29 | +Hierarchy is **rank → launch → segment → herd**. `air.rank` may be nested inside |
| 30 | +`air.launch` but not inside segment/herd/another rank |
| 31 | +(`mlir/lib/Dialect/AIR/IR/AIRDialect.cpp:1291-1300`). |
| 32 | + |
| 33 | +### Runtime layer (complete) |
| 34 | + |
| 35 | +| Component | Purpose | File | |
| 36 | +|-----------|---------|------| |
| 37 | +| `VMemAllocator` | `hipMemCreate` + `hipMemMap` allocations exportable as POSIX FDs | `runtime_lib/airgpu/vmem_allocator.{h,cpp}` | |
| 38 | +| `fd_passing` | Full AF_UNIX socket mesh, `SCM_RIGHTS` fd exchange | `runtime_lib/airgpu/fd_passing.{h,cpp}` | |
| 39 | +| `SymmetricHeap` | Collective constructor; per-rank heap bases mapped into local VA via XGMI peer access | `runtime_lib/airgpu/symmetric_heap.{h,cpp}` | |
| 40 | +| C ABI extensions | `mgpuSymmetricHeapInit/Destroy`, `mgpuGetRank/WorldSize`, `mgpuSymmetricAlloc/Free`, `mgpuGetHeapBase(rank)`, `mgpuGetHeapBases()`, `mgpuBarrier`, `mgpuSetDevice` | `runtime_lib/airgpu/gpu_runtime.cpp:256-317` | |
| 41 | +| Validated test | N processes, cross-rank XGMI read | `test/gpu/test_symmetric_heap.cpp`, `test/gpu/run_symmetric_heap_test.sh` | |
| 42 | + |
| 43 | +### NPU reference |
| 44 | + |
| 45 | +`air.channel` lowering for AIE/NPU |
| 46 | +(`mlir/lib/Conversion/AIRToAIEPass.cpp:2200-2330`, `LowerAIRChannelsPattern`) lowers |
| 47 | +`air.channel` to `aie.objectfifo`, finding puts/gets through symbol lookup. This |
| 48 | +named-symbol decoupled producer/consumer model is the right shape to reuse for |
| 49 | +GPU symmetric-heap messaging. |
| 50 | + |
| 51 | +## Design |
| 52 | + |
| 53 | +### Op design |
| 54 | + |
| 55 | +Two complementary additions, staged: |
| 56 | + |
| 57 | +**A. New `channel_type = "gpu_symmetric_heap"` on `air.channel`.** |
| 58 | +Existing values are renamed with a `npu_` prefix to make backend scope explicit: |
| 59 | +`dma_stream` → `npu_dma_stream`, `dma_packet` → `npu_dma_packet`, `cascade` → |
| 60 | +`npu_cascade`, `mmio` → `npu_mmio` (and `ddr_stream` → `npu_ddr_stream` if used). |
| 61 | +The new GPU value is `gpu_symmetric_heap`. This keeps `channel_type` as a single |
| 62 | +attribute that names both the backend (`npu_` / `gpu_`) and the mechanism within it. |
| 63 | + |
| 64 | +A put on rank A and a matching get on rank B exchange data through the symmetric heap. |
| 65 | +The channel's `indices` operand can encode the destination/source rank, or a separate |
| 66 | +`rank` operand is added. Verifier must require the enclosing scope to be |
| 67 | +`air.rank`-aware for `gpu_symmetric_heap`. |
| 68 | + |
| 69 | +This reuses the named-symbol decoupling that AIE channels already provide and matches |
| 70 | +the test pattern of "rank A writes, rank B reads at the same offset." |
| 71 | + |
| 72 | +**B. Rank-aware `air.dma_memcpy_nd`.** |
| 73 | +Optional `src_rank` / `dst_rank` operand (or attribute) on `air.dma_memcpy_nd`. When |
| 74 | +present, the corresponding memref refers semantically to "the same allocation but on |
| 75 | +rank R's heap." This is mechanically simpler for a first pass; it's a strict subset |
| 76 | +of what (A) can express. |
| 77 | + |
| 78 | +Recommendation: **implement (B) first** to demonstrate end-to-end cross-rank data |
| 79 | +movement through the existing DMA lowering, then add (A) for production patterns |
| 80 | +(broadcast/multicast, decoupled producer/consumer, balance-checked communication). |
| 81 | + |
| 82 | +### Symmetric memref tagging |
| 83 | + |
| 84 | +New attribute `air.symmetric` (or a dedicated address-space tag) on `memref.alloc` |
| 85 | +distinguishes symmetric-heap allocations from regular global memory. The |
| 86 | +`air-to-rocdl` pass routes tagged allocs to `mgpuSymmetricAlloc` instead of |
| 87 | +`mgpuMemAlloc`. The same offset is allocated on every rank (collective allocation |
| 88 | +guaranteed by symmetric-heap semantics), so cross-rank addressing is |
| 89 | +`mgpuGetHeapBases()[peer] + offset_of(buf)`. |
| 90 | + |
| 91 | +### Process model |
| 92 | + |
| 93 | +`air.rank` becomes a real multi-process construct on GPU. Each process executes the |
| 94 | +entire body of the rank op, with `air.rank.id` resolving to `mgpuGetRank()`. The |
| 95 | +host-side launcher (extension to `aircc`) forks `WORLD_SIZE` processes with `RANK=i |
| 96 | +LOCAL_RANK=i` set, all loading the same compiled binary linked against `libairgpu.so`. |
| 97 | + |
| 98 | +## Implementation Phases |
| 99 | + |
| 100 | +### Phase 1 — Op extensions |
| 101 | +- Add `channel_type = "symmetric_heap"` to `ChannelOp` verifier |
| 102 | + (`AIRDialect.cpp:3296`) |
| 103 | +- Add optional `src_rank` / `dst_rank` operand or attribute to |
| 104 | + `air.dma_memcpy_nd` (`AIR.td:458-501`) |
| 105 | +- Add `air.symmetric` memref attribute / address-space tag |
| 106 | +- Document semantics in `docs/AIRComputeModel.md` |
| 107 | + |
| 108 | +### Phase 2 — `air-rank-to-mgpu` pass |
| 109 | +Replaces `air-rank-to-launch` in the GPU pipeline. New file |
| 110 | +`mlir/lib/Conversion/AIRRankToMgpuPass.cpp`: |
| 111 | +- Lowers `air.rank.id` → `mgpuGetRank()` |
| 112 | +- Lowers `air.rank.size` → `mgpuGetWorldSize()` |
| 113 | +- Inserts `mgpuSymmetricHeapInit(heap_size)` at function entry, |
| 114 | + `mgpuSymmetricHeapDestroy()` at exit (or assumes the launcher does this) |
| 115 | +- Body of `air.rank` is moved to per-process function (no `scf.for` wrapping) |
| 116 | + |
| 117 | +### Phase 3 — Symmetric alloc lowering |
| 118 | +- Extend `hoistAlloc` (or add a sibling pass) in `AIRToROCDLPass.cpp:570-601` to |
| 119 | + recognize `air.symmetric`-tagged `memref.alloc` and lower them to |
| 120 | + `mgpuSymmetricAlloc` calls (host-side, not GPU workgroup attribution) |
| 121 | +- Memrefs from symmetric alloc remain in address space 0 (global) but carry an |
| 122 | + attribute that downstream lowering uses to detect peer-access addressing |
| 123 | + |
| 124 | +### Phase 4 — Cross-rank DMA lowering |
| 125 | +Extend `convertDMAToGPUMemcpy` in `AIRToROCDLPass.cpp:737-853`: |
| 126 | +- Detect peer-tagged operand (rank attribute / op operand) |
| 127 | +- Resolve `mgpuGetHeapBases()` once at kernel launch (host-side), pass the peer |
| 128 | + base pointer into the kernel as an extra argument |
| 129 | +- Inside the kernel, replace the source memref base with `peer_base + offset` |
| 130 | +- Use the same thread-cooperative load/store template as today |
| 131 | +- Insert `mgpuBarrier()` (host-side) at synchronization points before/after the |
| 132 | + cross-rank transfer |
| 133 | + |
| 134 | +### Phase 5 — `air.channel` on GPU |
| 135 | +Add a new pattern (parallel to `convertDMAToGPUMemcpy`) for `air.channel.put` / |
| 136 | +`air.channel.get` with `channel_type = "symmetric_heap"`: |
| 137 | +- Producer (put): same thread-cooperative loop as DMA, writing into the |
| 138 | + symmetric-heap slot at `bases[my_rank] + slot_offset` |
| 139 | +- Consumer (get): cooperative loop reading from `bases[peer_rank] + slot_offset` |
| 140 | +- Synchronization: per-channel notify-flag word in the symmetric heap, polled by |
| 141 | + the consumer, set by the producer (matches `depth = 1` rendezvous semantics) |
| 142 | +- For `depth > 1`, allocate `depth` slots and a head/tail index per channel |
| 143 | + |
| 144 | +### Phase 6 — `aircc` launcher integration |
| 145 | +- Add a runner mode (e.g. `--multi-rank=N`) that forks `N` processes with |
| 146 | + `RANK`/`WORLD_SIZE`/`LOCAL_RANK` env vars set |
| 147 | +- Each child execs the same compiled binary linked with `libairgpu.so` |
| 148 | +- Host-side cleanup waits for all children |
| 149 | +- Reuse the pattern from `test/gpu/run_symmetric_heap_test.sh` |
| 150 | + |
| 151 | +### Phase 7 — End-to-end test |
| 152 | +Add `test/gpu/symmetric_heap_dma/`: |
| 153 | + |
| 154 | +```mlir |
| 155 | +%c2 = arith.constant 2 : index |
| 156 | +air.rank (%r) in (%nr = %c2) { |
| 157 | + %buf = memref.alloc() {air.symmetric} : memref<1024xf32, 0> |
| 158 | + air.launch (%bx) in (%g = %c1) args(%b = %buf, %ri = %r) ... { |
| 159 | + air.segment ... { |
| 160 | + air.herd tile (%t) in (%nt = %c256) ... { |
| 161 | + // Initialize on rank 0 |
| 162 | + scf.if %is_rank_0 { /* store pattern into %b */ } |
| 163 | + // Cross-rank DMA: get from rank 0 into local |
| 164 | + air.dma_memcpy_nd (%local[][][], %b[][][]) {src_rank = 0} |
| 165 | + : (memref<1024xf32, 2>, memref<1024xf32, 0>) |
| 166 | + // Verify on rank 1 |
| 167 | + } |
| 168 | + } |
| 169 | + } |
| 170 | +} |
| 171 | +``` |
| 172 | + |
| 173 | +Driver script: same shape as `run_symmetric_heap_test.sh` — fork N processes, |
| 174 | +each loads the compiled binary, validates cross-rank read. |
| 175 | + |
| 176 | +## Key file paths |
| 177 | + |
| 178 | +### Existing |
| 179 | +- Op defs: `mlir/include/air/Dialect/AIR/AIR.td` (rank/universe at L22-139, channel at L535-594, DMA at L458-501) |
| 180 | +- Op verifiers: `mlir/lib/Dialect/AIR/IR/AIRDialect.cpp` (rank L890-1300, channel L3018-3300) |
| 181 | +- Rank-to-launch pass: `mlir/lib/Conversion/ConvertToAIRPass.cpp:1977-2045` |
| 182 | +- ROCDL conversion: `mlir/lib/Conversion/AIRToROCDLPass.cpp` (DMA lowering at L737-853, hoistAlloc at L570-601) |
| 183 | +- AIE channel reference: `mlir/lib/Conversion/AIRToAIEPass.cpp:2200-2330` |
| 184 | +- Runtime: `runtime_lib/airgpu/{symmetric_heap,vmem_allocator,fd_passing,gpu_runtime}.{cpp,h}` |
| 185 | +- Tests: `test/gpu/test_symmetric_heap.cpp`, `test/gpu/run_symmetric_heap_test.sh`, |
| 186 | + `mlir/test/Dialect/AIR/air_rank.mlir`, |
| 187 | + `mlir/test/Conversion/AIRRankToLaunch/rank_to_launch.mlir` |
| 188 | +- Doc: `docs/AIRComputeModel.md` (§4 GPU mapping, §5 summary table) |
| 189 | + |
| 190 | +### To create |
| 191 | +- `mlir/include/air/Conversion/AIRRankToMgpuPass.h` |
| 192 | +- `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` |
| 193 | +- `test/gpu/symmetric_heap_dma/` (test inputs + run.sh) |
| 194 | +- `mlir/test/Conversion/AIRToROCDL/cross_rank_dma.mlir` (FileCheck unit test) |
| 195 | +- `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` (FileCheck unit test) |
| 196 | + |
| 197 | +## Verification |
| 198 | + |
| 199 | +For each phase: |
| 200 | + |
| 201 | +1. **Unit tests (FileCheck)** under `mlir/test/Conversion/` checking the IR shape |
| 202 | + after each pass. |
| 203 | +2. **End-to-end run** on a multi-GPU host (e.g., MI300A): |
| 204 | + ```bash |
| 205 | + bash test/gpu/symmetric_heap_dma/run.sh |
| 206 | + # Expected: "PASS: cross-rank read matches" |
| 207 | + ``` |
| 208 | +3. **Regression**: existing single-GPU tests |
| 209 | + (`test/gpu/4k_4k_mul/run.sh`, `test/gpu/dma_copy/run.sh`, |
| 210 | + `test/gpu/simple_test/`) must continue to pass. |
| 211 | + |
| 212 | +## Open questions |
| 213 | + |
| 214 | +- **Inter-host scaling**: the current `fd_passing` uses AF_UNIX sockets, so it's |
| 215 | + single-host. Multi-host requires switching to TCP + RDMA or layering on |
| 216 | + RCCL/UCX. Out of scope for the first version. |
| 217 | +- **Synchronization semantics for channels**: rendezvous (`depth=1`) is simple |
| 218 | + via flag-poll; for `depth>N` an in-heap ring buffer is needed. First version |
| 219 | + can constrain to `depth=1`. |
| 220 | +- **Async tokens**: the current GPU DMA lowering does not handle async form |
| 221 | + (`AIRToROCDLPass.cpp:735` TODO). Cross-rank async is even more complex; first |
| 222 | + version should restrict to synchronous cross-rank ops. |
| 223 | +- **Error handling**: process-group cleanup if one rank crashes — `aircc` |
| 224 | + launcher must terminate the whole group. |
0 commit comments