Skip to content

Commit 38b7e10

Browse files
erwei-xilinxclaude
andcommitted
[multi-gpu] Phase 1: namespace channel_type, add cross-rank attrs, doc plan
Step toward multi-GPU messaging support per docs/MultiGPUPlan.md. Pure IR/dialect changes — no lowering yet. ## channel_type namespace rename (Option 1) Existing channel_type values gain a `npu_` prefix to make backend scope explicit: - `dma_stream` → `npu_dma_stream` (default) - `dma_packet` → `npu_dma_packet` - `cascade` → `npu_cascade` - `mmio` → `npu_mmio` Mechanical rename across 33 files (verifier, transform/conversion passes, all .mlir tests, Python programming examples). ## New channel_type for GPU multi-rank messaging - `gpu_symmetric_heap`: cross-rank channel through the symmetric heap runtime (runtime_lib/airgpu/symmetric_heap.{h,cpp}). Verifier requires put/get sites to be inside an `air.rank` scope. ## air.dma_memcpy_nd cross-rank addressing - New optional integer attributes `src_rank` / `dst_rank` name a peer rank in the enclosing `air.rank` scope. - Verifier requires: - an enclosing `air.rank` scope - the peer-side memref's `memref.alloc` (when directly available) to carry the `air.symmetric` attribute - Backward-compatible builder so existing call sites compile unchanged. ## air.symmetric memref attribute A unit attribute on `memref.alloc` indicating the allocation is backed by the symmetric heap. Documented in docs/AIRComputeModel.md §2.7. ## Documentation - New docs/MultiGPUPlan.md: full design and 7-phase implementation plan - docs/AIRComputeModel.md: §2.4 cross-rank addressing, §2.7 air.symmetric, §2.5 channel_type table updated, §5 summary table updated ## Tests - mlir/test/Dialect/AIR/air_cross_rank_dma.mlir (new): positive round-trip for src_rank/dst_rank, air.symmetric memref, gpu_symmetric_heap channel put/get inside air.rank - mlir/test/Dialect/AIR/air_channel_invalid.mlir: gpu_symmetric_heap put/get outside air.rank rejected; updated unsupported channel_type error message - mlir/test/Dialect/AIR/air_memcpy_invalid.mlir: src_rank/dst_rank outside air.rank rejected; missing air.symmetric on alloc rejected All 21 mlir/test/Dialect/AIR/ tests pass; GPU dma_copy and 4k_4k_mul e2e tests pass on MI300A. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f9e8a6e commit 38b7e10

45 files changed

Lines changed: 536 additions & 231 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/AIRComputeModel.md

Lines changed: 56 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -621,7 +621,7 @@ dimensions depend on the target backend:
621621
The compiler may **reshape** the iteration space (e.g., collapse a 2D herd
622622
into a 1D arrangement) via the `AIRCollapseHerdPass`. Reshaping is inhibited
623623
automatically when the herd body uses cascade channels (`channel_type =
624-
"cascade"`), because cascade connections are topology-dependent and cannot
624+
"npu_cascade"`), because cascade connections are topology-dependent and cannot
625625
survive reindexing. Explicit placement attributes (`x_loc`, `y_loc`,
626626
`x_size`, `y_size`) on the enclosing segment also constrain the legal shapes
627627
by fixing the tile footprint. The pass accepts a `max-col-size` option to
@@ -670,13 +670,29 @@ address spaces of the operand memrefs and mapped to the appropriate hardware mec
670670
An empty `[offsets]`, `[sizes]`, or `[strides]` list for a side means the entire memref
671671
is addressed with unit strides.
672672

673+
#### Cross-rank addressing (multi-GPU)
674+
675+
Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
676+
enclosing `air.rank` scope. When present, the corresponding memref is
677+
interpreted as living on rank R's symmetric heap rather than on the local
678+
process. The verifier requires the op to be enclosed by an `air.rank` and the
679+
referenced memref to be `air.symmetric`-tagged (see §2.7). The GPU backend
680+
(`air-to-rocdl`) lowers cross-rank DMAs through `mgpuGetHeapBases()`-based
681+
peer addressing; the NPU backend does not support these attributes.
682+
683+
```
684+
// Read 1024 floats from rank 0's symmetric buffer into local L1.
685+
air.dma_memcpy_nd (%local[][][], %sym[][][]) {src_rank = 0 : i64}
686+
: (memref<1024xf32, 2>, memref<1024xf32, 0>)
687+
```
688+
673689
---
674690

675691
### 2.5 `air.channel`, `air.channel.put`, `air.channel.get`
676692

677693
```
678694
// Channel declaration — at module scope
679-
air.channel @name [dim₀, dim₁, …] {channel_type = "dma_stream", depth = <N>}
695+
air.channel @name [dim₀, dim₁, …] {channel_type = "npu_dma_stream", depth = <N>}
680696
681697
// Synchronous put/get — block until the transfer completes
682698
air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src)
@@ -696,13 +712,17 @@ them independently and to introduce double-buffering.
696712
A channel may be an array (e.g., `[4, 4]` for a 4×4 array). The `indices` operand on
697713
`put`/`get` selects the specific channel within the array.
698714

699-
The `channel_type` attribute controls the underlying mechanism:
715+
The `channel_type` attribute controls the underlying mechanism. Values are
716+
namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels
717+
use the `gpu_` prefix.
700718

701719
| Value | Mechanism |
702720
|-------|-----------|
703-
| `"dma_stream"` (default) | DMA engines with streaming (circuit-switched) interconnect |
704-
| `"dma_packet"` | DMA engines with packet-switched interconnect |
705-
| `"cascade"` | Core-to-core cascade connections between adjacent tiles |
721+
| `"npu_dma_stream"` (default) | NPU: DMA engines with streaming (circuit-switched) interconnect |
722+
| `"npu_dma_packet"` | NPU: DMA engines with packet-switched interconnect |
723+
| `"npu_cascade"` | NPU: Core-to-core cascade connections between adjacent tiles |
724+
| `"npu_mmio"` | NPU: Host-side MMIO blockwrites delivering a constant payload into a tile-local L1 buffer |
725+
| `"gpu_symmetric_heap"` | GPU: Cross-rank messaging through the symmetric heap runtime (XGMI peer-mapped VMem). Requires an enclosing `air.rank` scope. |
706726

707727
The `broadcast_shape` attribute enables one-to-many communication following NumPy
708728
broadcasting rules.
@@ -796,6 +816,27 @@ in the async dependency graph.
796816

797817
---
798818

819+
### 2.7 `air.symmetric` memref attribute (multi-GPU)
820+
821+
A `memref.alloc` op may carry the unit attribute `air.symmetric` to indicate
822+
that the allocation should be backed by the **symmetric heap** runtime. Every
823+
rank in the enclosing `air.rank` scope performs the same allocation in lockstep,
824+
so each rank has a memref of the same size at the same offset within the heap.
825+
Cross-rank addressing (via `air.dma_memcpy_nd` `src_rank`/`dst_rank` attributes
826+
or `air.channel` with `channel_type = "gpu_symmetric_heap"`) refers to peer
827+
ranks' symmetric memrefs at the same logical offset.
828+
829+
```
830+
%buf = memref.alloc() {air.symmetric} : memref<1024xf32>
831+
```
832+
833+
The GPU lowering routes such allocations through `mgpuSymmetricAlloc`
834+
(`runtime_lib/airgpu/gpu_runtime.cpp`) instead of plain `mgpuMemAlloc`.
835+
Peer ranks' base pointers are obtained via `mgpuGetHeapBases()`. The NPU
836+
backend does not interpret this attribute.
837+
838+
---
839+
799840
## 3. NPU (AIE) Backend Mapping
800841

801842
On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps
@@ -999,7 +1040,13 @@ See [buildingGPU.md](buildingGPU.md) for build instructions and the complete
9991040
| L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch |
10001041
| L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) |
10011042
| L3 (space 0) | DDR via NOC | HBM via global memory |
1002-
| `dma_memcpy_nd` | AIE Shim/Tile DMA engines | SCF load/store loops |
1003-
| `channel` (`dma_stream`) | Streaming AXI-S switch | — (not yet mapped to GPU) |
1004-
| Synchronization | AIE locks | `gpu.barrier` |
1043+
| `dma_memcpy_nd` (intra-rank) | AIE Shim/Tile DMA engines | SCF load/store loops |
1044+
| `dma_memcpy_nd` (cross-rank, `src_rank`/`dst_rank`) || Symmetric heap peer addressing (planned) |
1045+
| `channel` (`npu_dma_stream`) | Streaming AXI-S switch | n/a |
1046+
| `channel` (`npu_dma_packet`) | Packet-switched AXI-S overlay | n/a |
1047+
| `channel` (`npu_cascade`) | Core cascade interface | n/a |
1048+
| `channel` (`npu_mmio`) | Host MMIO blockwrite | n/a |
1049+
| `channel` (`gpu_symmetric_heap`) | n/a | Symmetric heap peer addressing (planned) |
1050+
| `air.symmetric` memref alloc | n/a | `mgpuSymmetricAlloc` (planned) |
1051+
| Synchronization | AIE locks | `gpu.barrier` (intra-rank), `mgpuBarrier` (cross-rank) |
10051052
| `!air.token` (dependency) | AIE runtime completion signals | GPU stream/event dependencies |

mlir/include/air/Dialect/AIR/AIR.td

Lines changed: 71 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -477,7 +477,9 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
477477
Variadic<Index>:$src_sizes,
478478
Variadic<Index>:$src_strides,
479479
OptionalAttr<DenseI32ArrayAttr>:$pad_before,
480-
OptionalAttr<DenseI32ArrayAttr>:$pad_after
480+
OptionalAttr<DenseI32ArrayAttr>:$pad_after,
481+
OptionalAttr<I64Attr>:$src_rank,
482+
OptionalAttr<I64Attr>:$dst_rank
481483
);
482484
let results = (outs Optional<air_AsyncToken>:$async_token);
483485
let assemblyFormat = [{
@@ -487,7 +489,14 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
487489
`(` type($dst) `,` type($src) `)`
488490
}];
489491
let description = [{
490-
dma operator
492+
N-dimensional strided bulk copy between two memrefs.
493+
494+
Optional `src_rank` / `dst_rank` integer attributes name a peer rank in the
495+
enclosing `air.rank` scope. When present, the corresponding memref is
496+
interpreted as living on rank R's symmetric heap rather than on the local
497+
process. These attributes are only valid for `air.symmetric`-tagged memref
498+
allocations and require an enclosing `air.rank`. They are currently only
499+
supported by the GPU lowering (`air-to-rocdl`).
491500
}];
492501
let extraClassDeclaration = [{
493502
Value getSrcMemref() { return getSrc(); }
@@ -501,7 +510,31 @@ def air_DmaMemcpyNdOp: air_Op<"dma_memcpy_nd",
501510
bool hasPadding() {
502511
return getPadBefore().has_value();
503512
}
513+
bool hasCrossRank() {
514+
return getSrcRank().has_value() || getDstRank().has_value();
515+
}
504516
}];
517+
let builders = [
518+
// Backward-compatible builder: defaults src_rank/dst_rank to absent.
519+
OpBuilder<(ins "::mlir::TypeRange":$resultTypes,
520+
"::mlir::ValueRange":$async_dependencies,
521+
"::mlir::Value":$dst,
522+
"::mlir::ValueRange":$dst_offsets,
523+
"::mlir::ValueRange":$dst_sizes,
524+
"::mlir::ValueRange":$dst_strides,
525+
"::mlir::Value":$src,
526+
"::mlir::ValueRange":$src_offsets,
527+
"::mlir::ValueRange":$src_sizes,
528+
"::mlir::ValueRange":$src_strides,
529+
"::mlir::DenseI32ArrayAttr":$pad_before,
530+
"::mlir::DenseI32ArrayAttr":$pad_after), [{
531+
build($_builder, $_state, resultTypes, async_dependencies, dst,
532+
dst_offsets, dst_sizes, dst_strides, src,
533+
src_offsets, src_sizes, src_strides, pad_before, pad_after,
534+
/*src_rank=*/IntegerAttr(),
535+
/*dst_rank=*/IntegerAttr());
536+
}]>
537+
];
505538
let hasCanonicalizer = 1;
506539
let hasVerifier = 1;
507540
}
@@ -535,26 +568,30 @@ def air_WaitAllOp: air_Op<"wait_all", [air_AsyncOpInterface]> {
535568
def air_ChannelOp : air_Op<"channel", [Symbol]>,
536569
Arguments<(ins SymbolNameAttr:$sym_name,
537570
DefaultValuedAttr<I64ArrayAttr, "{}">:$size,
538-
DefaultValuedAttr<StrAttr, "\"dma_stream\"">:$channel_type)> {
571+
DefaultValuedAttr<StrAttr, "\"npu_dma_stream\"">:$channel_type)> {
539572
let assemblyFormat = [{
540573
$sym_name $size attr-dict
541574
}];
542575
let summary = "Channel for data movement.";
543576
let description = [{
544577
Operation to represent a communication channel as a point-to-point connection between two memrefs.
545578
The array following the channel name symbol represents the channel's dimensional sizes. Default
546-
size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled
579+
size, with empty size array, is 1. The data movement mechanism that the channel uses is controlled
547580
by the `channel_type` attribute.
548581

549582
### Channel Types
550-
The `channel_type` attribute is a string that determines the mechanism used for data movement:
551-
- **"dma_stream"** (default):
583+
The `channel_type` attribute is a string that determines the mechanism used for data movement.
584+
Values are namespaced by backend: NPU (AIE) channels use the `npu_` prefix; GPU channels use
585+
the `gpu_` prefix.
586+
587+
NPU (AIE) channel types:
588+
- **"npu_dma_stream"** (default):
552589
Use DMA engines to send and receive data, with routing performed over a streaming interconnect.
553-
- **"dma_packet"**:
590+
- **"npu_dma_packet"**:
554591
Use DMA engines to send and receive data, with routing performed over a packet-switched network.
555-
- **"cascade"**:
592+
- **"npu_cascade"**:
556593
Use processor cores to send and receive data via cascade connections between adjacent tiles.
557-
- **"mmio"**:
594+
- **"npu_mmio"**:
558595
Use host-side MMIO writes (e.g. `aiex.npu.blockwrite`) issued from the runtime
559596
sequence to deliver a constant payload directly into a tile-local L1 buffer.
560597
No DMA channel, no shim allocation, no flow is reserved.
@@ -565,32 +602,44 @@ def air_ChannelOp : air_Op<"channel", [Symbol]>,
565602
`memref.get_global`. The consumer-side `get` lowers to a no-op
566603
because the L1 buffer is already populated when the core begins executing.
567604

605+
GPU channel types:
606+
- **"gpu_symmetric_heap"**:
607+
Cross-GPU messaging through the symmetric heap runtime
608+
(`runtime_lib/airgpu/symmetric_heap.{h,cpp}`). The channel must be enclosed
609+
by an `air.rank` op; the put/get sites use rank indices to address peer
610+
heaps. Lowered by `air-to-rocdl` to thread-cooperative loops over peer-mapped
611+
VMem buffers, with synchronization via in-heap notify flags or `mgpuBarrier`.
612+
568613
### Broadcasting
569-
If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute
614+
If a channel broadcasts to multiple destinations, the optional `broadcast_shape` attribute
570615
annotates the output sizes after broadcasting. Broadcasting follows NumPy's broadcasting rules.
571616

572617
Example:
573618

574619
```mlir
575-
// An array of 4 x 4 streaming DMA channels
576-
air.channel @channel_0 [4, 4] {channel_type = "dma_stream"}
620+
// An array of 4 x 4 streaming DMA channels (NPU)
621+
air.channel @channel_0 [4, 4] {channel_type = "npu_dma_stream"}
577622

578-
// A streaming DMA channel broadcasting to 4 destinations
579-
air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "dma_stream"}
623+
// A streaming DMA channel broadcasting to 4 destinations (NPU)
624+
air.channel @channel_1 [1, 1] {broadcast_shape = [1, 4], channel_type = "npu_dma_stream"}
580625

581-
// An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations.
626+
// An array of 1 x 4 streaming DMA channels broadcasting to 4 x 4 destinations (NPU).
582627
// Broadcasting follows NumPy's rules.
583-
air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "dma_stream"}
628+
air.channel @channel_2 [1, 4] {broadcast_shape = [4, 4], channel_type = "npu_dma_stream"}
584629

585-
// A packet-switched DMA channel
586-
air.channel @channel_3 [] {channel_type = "dma_packet"}
630+
// A packet-switched DMA channel (NPU)
631+
air.channel @channel_3 [] {channel_type = "npu_dma_packet"}
587632

588-
// A cascade channel using core-to-core cascade connections
589-
air.channel @channel_4 [] {channel_type = "cascade"}
633+
// A cascade channel using core-to-core cascade connections (NPU)
634+
air.channel @channel_4 [] {channel_type = "npu_cascade"}
590635

591636
// An MMIO channel: the put writes a constant from host into L1 of each
592-
// get's destination tile via runtime-sequence blockwrites
593-
air.channel @channel_5 [] {channel_type = "mmio"}
637+
// get's destination tile via runtime-sequence blockwrites (NPU)
638+
air.channel @channel_5 [] {channel_type = "npu_mmio"}
639+
640+
// A cross-GPU channel through the symmetric heap (GPU). Must appear inside
641+
// an air.rank scope; the indices on put/get encode the peer rank.
642+
air.channel @channel_6 [] {channel_type = "gpu_symmetric_heap"}
594643
```
595644
}];
596645
let extraClassDeclaration = [{

0 commit comments

Comments
 (0)