[multi-gpu] Phase 4: air-symmetric-alloc-to-mgpu lowering pass by erwei-xilinx · Pull Request #1579 · Xilinx/mlir-air

erwei-xilinx · 2026-05-03T18:46:33Z

Summary

Phase 4 of the multi-GPU stack: a new air-symmetric-alloc-to-mgpu conversion pass that uplevels symmetric-heap allocations from the runtime-ABI level (mgpuSymmetricAlloc + a hand-built memref descriptor) to the standard memref dialect.

What it does

Replaces the runtime-ABI dance:

%ptr   = call @mgpuSymmetricAlloc(size, stream)
%bytes = wrap_bytes(%ptr, size)            // hand-built memref<?xi8>
%m     = memref.view %bytes[0][]           // retype to memref<T>
...
call @mgpuSymmetricFree(%ptr, stream)

with the MLIR-native form:

%m = memref.alloc() {air.symmetric} : memref<T>
...
memref.dealloc %m

The pass lowers each memref.alloc {air.symmetric} to a runtime call plus an LLVM memref descriptor build + unrealized_conversion_cast back to the original memref type (so downstream uses keep working through convert-to-llvm). Each memref.dealloc whose operand traces back through the cast becomes mgpuSymmetricFree. No-op when no air.symmetric allocations are present.

Why this matters

Every test in main today carries a 15-line wrap_bytes helper that builds a memref descriptor by hand from a runtime !llvm.ptr — leaking the LLVM-struct ABI into the test IR. Phase 4 replaces this with the standard memref.alloc op + an attribute, putting the symmetric allocation at the same dialect level as the rest of the IR.

This is the foundation for phases 5/6 (DMA + channel ops): they can operate on clean memref<T> values from the user's perspective, without each test having to reconstruct memrefs from raw pointers.

What's new

mlir/include/air/Conversion/AIRSymmetricAllocToMgpuPass.h / .cpp — pass implementation (~200 LOC)
mlir/include/air/Conversion/GPUPasses.td — air-symmetric-alloc-to-mgpu def
mlir/test/Conversion/AIRSymmetricAllocToMgpu/symmetric_alloc.mlir — FileCheck unit tests with // REQUIRES: gpu (the pass isn't registered in non-GPU builds):
- 1D alloc + dealloc shape (size, descriptor, cast, free)
- 2D alloc with row-major strides in descriptor
- Element type byte-size: f32 (4B), f64 (8B), i32 (4B)
- Multiple symmetric allocs share one decl pair
- Pass is a no-op for non-symmetric allocs
- Pass is a no-op when there are zero symmetric allocs
test/gpu/multi_gpu/air_alloc/cacheline.mlir — e2e test that wraps the producer/consumer cacheline reference in air.rank AND uses memref.alloc {air.symmetric} for the symmetric data buffer. 1:1 wrap of the air_rank/cacheline reference, with only the symmetric alloc/free swapped for the new abstraction. After lowering through -air-rank-to-mgpu -air-symmetric-alloc-to-mgpu -air-translate-to-llvm, functionally equivalent to handwritten/cacheline.mlir.
test/gpu/multi_gpu/air_alloc/Makefile — self-contained, follows the same pattern as air_rank/Makefile.

Test plan

FileCheck unit tests pass (6 cases above)
E2E on real 2x MI325X (rad-mi325x-1, NUM_RANKS=2): cache-line message PASS (data[0]=100, flag=1) — output structurally identical to INPUT=cacheline in handwritten/ and air_rank/, only distinguished by the [mlir/alloc] log tag. 3/3 stability runs.
Equivalence demonstrated: the same payload values (lane+100 for lanes 0..30, flag=1 at lane 31) and the same cache-line atomicity contract reach the consumer in all three variants (handwritten/cacheline, air_rank/cacheline, air_alloc/cacheline).
git clang-format origin/main applied; "Python and C/C++ Check Format" check should pass

Followup

The non-symmetric staging allocations (verify_buf, heap_bases device copy) still go through mgpuMemAlloc + wrap_bytes. A future pass for non-symmetric device allocs would remove that helper too.

🤖 Generated with Claude Code

New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a `src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side `mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`. The peer pointer is computed at runtime as: peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank]) where `local_ptr` is extracted from the local-side memref via `memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]` gives this rank's symmetric heap base. ## Restrictions (this initial version) - Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global) - The op must be at host scope (not inside a `gpu.launch` or `gpu.func`) - "Entire memref" form only — no explicit `[offsets][sizes][strides]` - Only one of `src_rank` / `dst_rank` may be set per op These restrictions match the hand-written reference's Phase 2 pattern. They can be relaxed in follow-up work. ## Files - `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU` - `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck - `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering - `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector ## Test plan FileCheck unit tests cover: - src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi, byte-stride GEP, mgpuMemcpy) - dst_rank lowering (peer pointer becomes dst arg) - 2D memref byte size - f64 element type byte size - Multiple cross-rank DMAs share extern decls - Pass is a no-op for non-cross-rank DMAs End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks): - INPUT=handwritten — PASS (Phase 2 baseline) - INPUT=rank — PASS (Phase 3) - INPUT=alloc — PASS (Phase 4) - INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3) Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their own dst_buf; verification reads back 1.0. Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 — true multi-GPU re-validation is needed before declaring multi-GPU production-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a `src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side `mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`. The peer pointer is computed at runtime as: peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank]) where `local_ptr` is extracted from the local-side memref via `memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]` gives this rank's symmetric heap base. - Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global) - The op must be at host scope (not inside a `gpu.launch` or `gpu.func`) - "Entire memref" form only — no explicit `[offsets][sizes][strides]` - Only one of `src_rank` / `dst_rank` may be set per op These restrictions match the hand-written reference's Phase 2 pattern. They can be relaxed in follow-up work. - `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU` - `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck - `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering - `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector FileCheck unit tests cover: - src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi, byte-stride GEP, mgpuMemcpy) - dst_rank lowering (peer pointer becomes dst arg) - 2D memref byte size - f64 element type byte size - Multiple cross-rank DMAs share extern decls - Pass is a no-op for non-cross-rank DMAs End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks): - INPUT=handwritten — PASS (Phase 2 baseline) - INPUT=rank — PASS (Phase 3) - INPUT=alloc — PASS (Phase 4) - INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3) Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their own dst_buf; verification reads back 1.0. Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 — true multi-GPU re-validation is needed before declaring multi-GPU production-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that uplevels symmetric-heap allocations from the runtime-ABI level to the memref dialect level. Replaces: %ptr = call @mgpuSymmetricAlloc(size, stream) %bytes = wrap_bytes(%ptr, size) // hand-built memref<?xi8> %m = memref.view %bytes[0][] // retype to memref<T> ... call @mgpuSymmetricFree(%ptr, stream) with the MLIR-native form: %m = memref.alloc() {air.symmetric} : memref<T> ... memref.dealloc %m The pass lowers each `memref.alloc {air.symmetric}` to the runtime call plus a descriptor build and an `unrealized_conversion_cast` back to the original memref type so downstream uses keep working through `convert-to-llvm`. Each `memref.dealloc` whose operand traces back through the cast to a symmetric alloc becomes `mgpuSymmetricFree`. The pass is a no-op when no `air.symmetric` allocations are present. - `mlir/include/air/Conversion/AIRSymmetricAllocToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — `air-symmetric-alloc-to-mgpu` def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRSYMMETRICALLOCTOMGPU` - `mlir/lib/Conversion/AIRSymmetricAllocToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRSymmetricAllocToMgpu/symmetric_alloc.mlir` — FileCheck unit test (REQUIRES:gpu, like other GPU-only passes). - `test/gpu/multi_gpu/air_alloc/cacheline.mlir` — e2e test that wraps the producer/consumer cacheline reference in `air.rank` AND uses `memref.alloc {air.symmetric}` for the symmetric data buffer. Functionally equivalent to handwritten/cacheline.mlir after lowering through `-air-rank-to-mgpu -air-symmetric-alloc-to-mgpu -air-translate-to-llvm`. - `test/gpu/multi_gpu/air_alloc/Makefile` — self-contained, same preconditions + driver pattern as other multi_gpu/<level>/Makefile. FileCheck unit tests cover: - 1D alloc + dealloc shape (size, descriptor, cast, free) - 2D alloc with row-major strides in descriptor - Element type byte-size: f32 (4B), f64 (8B), i32 (4B) - Multiple symmetric allocs share one decl pair - Pass is a no-op for non-symmetric allocs - Pass is a no-op when there are zero symmetric allocs End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): cache-line message PASS (data[0]=100, flag=1) — output structurally identical to INPUT=cacheline in handwritten/ and air_rank/, only distinguished by the `[mlir/alloc]` log tag. 3/3 stability runs. Why this matters: every test today has a 15-line `wrap_bytes` helper that builds a memref descriptor by hand from a runtime `!llvm.ptr`, leaking the descriptor ABI into the test. Phase 4 replaces this with the standard `memref.alloc` op + an attribute, putting the symmetric allocation at the same dialect level as the rest of the IR. This is the foundation for phases 5/6 (DMA + channel ops) to operate on clean memrefs rather than reconstructing them from raw pointers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erwei-xilinx · 2026-05-19T00:24:43Z

Superseded by erwei-xilinx:multigpu-air-hierarchy, which carries a refactored phase-4 pass dispatching on #air.symmetric_heap memref memory_space instead of the {air.symmetric} op attribute (commit 4a6cd86).

Rationale for the refactor: the {air.symmetric} op-attribute approach required walking the defining-op chain from each memref.load/store site to know whether the memref was symmetric-heap-backed. Encoding the property as a memory_space attribute carries it on the memref type — verifiers, helper passes, and air.translate can all answer "is this on the symmetric heap?" by inspecting the type, without a chain walk. It also lets memrefs constructed via memref.view / wrap_bytes (not just memref.alloc) carry the tag, which is needed for the cacheline tests where buffers are wrapped from raw runtime pointers.

Part of a larger redesign documented in docs/MultiGPUPhase56Redesign.md that also reshapes phases 5 and 6 (see #1580, #1581). Consolidated PR forthcoming.

…lti-launch fixes Replaces the in-flight phase 4 PR (Xilinx#1579), reframing symmetric-heap allocation as a memref memory_space attribute rather than an op-attribute on memref.alloc. Foundation for the future channel-to-cacheline lowering (see docs/MultiGPUPhase56Redesign.md). Why memory_space (not op-attribute): - Travels with SSA values automatically; no need to trace defining-op chain when checking from a memref.load/store site. - The AIR herd verifier can reject/accept based purely on the memref type, not on op metadata of a possibly-distant alloc. - Lets memrefs constructed via memref.view / wrap_bytes / etc. (not just memref.alloc) carry the symmetric-heap tag — required for the in-flight cacheline tests where buffers are wrapped from runtime pointers. Changes: * `#air.symmetric_heap` custom memref memory_space attribute, defined via TableGen AttrDef in mlir/include/air/Dialect/AIR/AIROpBase.td. Dialect plumbing in AIROpBase.td (let useDefaultAttributePrinterParser = 1), CMakeLists.txt (mlir_tablegen for AIRAttrs.{h,cpp}.inc), AIRDialect.h (#include the generated header), AIRDialect.cpp (addAttributes<> in initialize()). * AIR herd verifier (verifyComputeMemoryAccess in AIRDialect.cpp) skips the L1-or-better-only check for memrefs whose memory_space is #air.symmetric_heap. Same for verifyAllocMemorySpace (segment-level alloc check). Direct memref.load/store on symmetric-heap memrefs inside air.herd bodies is now legal — required for the kernel-driven cross-rank cacheline pattern on GPU. * AIRSymmetricAllocToMgpu pass dispatches on the memref result type's memory_space, replacing the op-attribute check `op->hasAttr( "air.symmetric")`. FileCheck unit tests rewritten to use `memref.alloc() : memref<..., #air.symmetric_heap>` instead of `memref.alloc() {air.symmetric}`. All 6 cases pass. * AIR-to-ROCDL pass: two fixes uncovered by the air_hierarchy/ cacheline e2e baseline (kept in-tree as the target shape but currently blocked on a separate multi-launch use-after-free in this same pass — see docs/MultiGPUPhase56Redesign.md). 1. 1D / N-D launch + herd shapes are now handled (previously assumed 2D, would crash on getSizeOperands()[1] OOB for 1D). The new `sizeOrOne` helper materializes a `1` constant for missing dims. 2. Pattern set is now frozen once and reused across multiple launches (previously moved into applyPatternsGreedily on first iteration, FrozenRewritePatternSet ctor crashed on the second). * Plan doc (docs/MultiGPUPhase56Redesign.md) updated with the infrastructure landed today, what's still blocked (air-to-rocdl multi-launch use-after-free), and recommended next steps. * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir + Makefile: the target IR shape that phase 6's redesigned lowering should emit. Kept as a draft baseline; will run e2e once the air-to-rocdl multi-launch issue is resolved separately. Tested: - lit unit test for AIRSymmetricAllocToMgpu: 6/6 pass - air_hierarchy e2e: blocked at air-to-rocdl multi-launch crash; the 1D and pattern-reuse fixes here unblock the first two errors but the third (use-after-free during block deletion) needs separate work Supersedes: - PR Xilinx#1579 (phase 4 op-attribute approach) — close as superseded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Xilinx#1618) * [multi-gpu] Add #air.symmetric_heap memory_space + air-to-rocdl 1D/multi-launch fixes Replaces the in-flight phase 4 PR (Xilinx#1579), reframing symmetric-heap allocation as a memref memory_space attribute rather than an op-attribute on memref.alloc. Foundation for the future channel-to-cacheline lowering (see docs/MultiGPUPhase56Redesign.md). Why memory_space (not op-attribute): - Travels with SSA values automatically; no need to trace defining-op chain when checking from a memref.load/store site. - The AIR herd verifier can reject/accept based purely on the memref type, not on op metadata of a possibly-distant alloc. - Lets memrefs constructed via memref.view / wrap_bytes / etc. (not just memref.alloc) carry the symmetric-heap tag — required for the in-flight cacheline tests where buffers are wrapped from runtime pointers. Changes: * `#air.symmetric_heap` custom memref memory_space attribute, defined via TableGen AttrDef in mlir/include/air/Dialect/AIR/AIROpBase.td. Dialect plumbing in AIROpBase.td (let useDefaultAttributePrinterParser = 1), CMakeLists.txt (mlir_tablegen for AIRAttrs.{h,cpp}.inc), AIRDialect.h (#include the generated header), AIRDialect.cpp (addAttributes<> in initialize()). * AIR herd verifier (verifyComputeMemoryAccess in AIRDialect.cpp) skips the L1-or-better-only check for memrefs whose memory_space is #air.symmetric_heap. Same for verifyAllocMemorySpace (segment-level alloc check). Direct memref.load/store on symmetric-heap memrefs inside air.herd bodies is now legal — required for the kernel-driven cross-rank cacheline pattern on GPU. * AIRSymmetricAllocToMgpu pass dispatches on the memref result type's memory_space, replacing the op-attribute check `op->hasAttr( "air.symmetric")`. FileCheck unit tests rewritten to use `memref.alloc() : memref<..., #air.symmetric_heap>` instead of `memref.alloc() {air.symmetric}`. All 6 cases pass. * AIR-to-ROCDL pass: two fixes uncovered by the air_hierarchy/ cacheline e2e baseline (kept in-tree as the target shape but currently blocked on a separate multi-launch use-after-free in this same pass — see docs/MultiGPUPhase56Redesign.md). 1. 1D / N-D launch + herd shapes are now handled (previously assumed 2D, would crash on getSizeOperands()[1] OOB for 1D). The new `sizeOrOne` helper materializes a `1` constant for missing dims. 2. Pattern set is now frozen once and reused across multiple launches (previously moved into applyPatternsGreedily on first iteration, FrozenRewritePatternSet ctor crashed on the second). * Plan doc (docs/MultiGPUPhase56Redesign.md) updated with the infrastructure landed today, what's still blocked (air-to-rocdl multi-launch use-after-free), and recommended next steps. * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir + Makefile: the target IR shape that phase 6's redesigned lowering should emit. Kept as a draft baseline; will run e2e once the air-to-rocdl multi-launch issue is resolved separately. Tested: - lit unit test for AIRSymmetricAllocToMgpu: 6/6 pass - air_hierarchy e2e: blocked at air-to-rocdl multi-launch crash; the 1D and pattern-reuse fixes here unblock the first two errors but the third (use-after-free during block deletion) needs separate work Supersedes: - PR Xilinx#1579 (phase 4 op-attribute approach) — close as superseded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] AIRToROCDL: fix three multi-launch bugs (use-after-free, wrong pairing, missing arg replace) Multi-launch programs (e.g., scf.if %is_producer { air.launch P } else { air.launch C }) crashed AIRToROCDLPass with "Cannot destroy a value that still has uses!" during block destruction. Three compounding bugs identified via a minimal reproducer (two empty air.launch ops with air.segment + air.herd): 1. blkIdx / gridIdx are class-level vectors that accumulated across iterations of the launch walk. The second launch ended up using the first launch's herd-size operands, dangling values into the new gpu.launch op. Fixed by clearing them at the top of each (launch, segment) pair (along with gridXVal / gridYVal). 2. The post-conversion body-move loop used a *nested* walk: module.walk(gpuLaunchOp) { module.walk(airLaunchOp) { move } } which pairwise-matched every gpu.launch with every air.launch in the module, folding multi-launch programs into the first gpu.launch and leaving the others empty. Fixed by performing the body move *inside* the first walk where the 1:1 pairing is established. 3. The body-move did not replace the air.launch's own block args (kernel operands) with their outer values before moving the body. After air.launch was later erased, the moved ops dangled into the destroyed block args (the actual use-after-free). Fixed with the same replaceAllUsesWith pattern that deleteAirHerd / deleteAirSegment use. The 1D-launch and pattern-reuse fixes from the prior commit on this branch are preserved. Tested: - Minimal reproducer (/tmp/multi_launch_repro.mlir): two air.launches with memref.store inside each herd. Now lowers cleanly to two distinct gpu.launch ops with correct operand routing. - air_hierarchy/cacheline.mlir: now progresses past air-to-rocdl all the way through to a GPU binary (after sed-stripping #air.symmetric_heap from the pre-mlir-opt IR; see Makefile + plan doc). Runtime hang remains; likely the herd's tile-id → block-dim mapping doesn't preserve the cooperative cache-line semantics the handwritten kernel relies on. Documented as next step in docs/MultiGPUPhase56Redesign.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] air_hierarchy/cacheline: use herd 64×1 (PE→thread), fix §2.3 doc The air_hierarchy/cacheline.mlir test was hanging at runtime because the herd was declared (1,1) PE but the body used gpu.thread_id x to address 32 lanes — incompatible with the standard AIR→GPU mapping where the herd iteration space becomes the GPU blockDim (PE → thread). Per AIRComputeModel.md §4.1/§4.5/§5 — and the convention used by test/gpu/4k_4k_mul/air_sync.mlir — a 1-PE herd lowers to 1 GPU thread. Rewrite producer + consumer herds as (64, 1) so blockDim = (64, 1, 1) (one full MI3xx wavefront), and use the herd tile id %tx in place of gpu.thread_id x. Lanes 0..31 do real work, lanes 32..63 stay idle but keep the wavefront full so the consumer's gpu.shuffle width=64 can see the producer's flag lane. Also fix the §2.3 statement that claimed "PE instances mapped to individual warps" on GPU — that contradicted the rest of the doc and the actual lowering in air-to-rocdl, which is PE→thread. Replace with a pointer to §4.1 for the precise mapping, plus guidance that any herd needing wave-cooperative ops (e.g. gpu.shuffle) should pick a herd size that's a multiple of the target wavefront width. Verified on MI350X (gfx950, 8 GPUs): producer + consumer both report PASS, with data[0]=100 and flag=1 transmitted across the symmetric heap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Phase 6: air-gpu-channel-to-cacheline pass + air_channel e2e Adds the redesigned phase-6 pass per docs/MultiGPUPhase56Redesign.md. Lowers air.channel.put/get ops on channels of type "gpu_symmetric_heap" into the kernel-driven cacheline pattern that test/gpu/multi_gpu/handwritten/cacheline.mlir writes by hand: put -> air.translate %src, %from, %to, %bases + cooperative memref.store (lanes 0..30 publish payload, lane 31 stores sync flag = 1) get -> scf.while spin loop + gpu.shuffle idx broadcast of lane 31's observation until the flag arrives; sink store keeps the spin alive past DCE in subsequent passes The pass: - Pairs puts/gets via existing air::getTheOtherChannelOpThroughSymbol util - Infers producer/consumer ranks from enclosing scf.if (cmpi eq %rid, %k) rank-dispatch context - Finds %bases by type-matching memref<?xindex, #air.symmetric_heap> in the put/get's enclosing herd kernel args (semantically-unique combo for the symmetric-heap base table; clear error on 0 or >1 matches) - Errors loudly if any precondition isn't met (herd scope, rank dispatch, bases arg, memref shape) - Erases the channel symbol after expansion Initial scope: 1-put / 1-get cacheline pattern with memref<32xi32, #air.symmetric_heap> source/destination. Allgather + multi-wire topologies are deferred. Verified end-to-end on MI350X (gfx950, 8 GPUs): produces "=== ALL 2 RANKS PASSED ===" with data[0]=100 and flag=1 published cross-rank, functionally equivalent to the handwritten reference and to air_hierarchy/cacheline.mlir. Files added: - mlir/include/air/Conversion/AIRGpuChannelToCachelinePass.h - mlir/lib/Conversion/AIRGpuChannelToCachelinePass.cpp - mlir/test/Conversion/AIRGpuChannelToCacheline/cacheline.mlir - test/gpu/multi_gpu/air_channel/{cacheline.mlir,Makefile} Wired into existing pass registration (GPUPasses.td, GPUPassDetail.h, CMakeLists.txt, Passes.cpp). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] cacheline spin: use upstream-idiomatic atomic_rmw shape Switch the cacheline spin loop in all three multi_gpu cacheline tests (handwritten/, air_hierarchy/, air_channel/) and in the air-gpu-channel-to-cacheline pass's get-side expansion from %final_v = scf.while (%dummy = %c0) : (i32) -> i32 { %v = scf.if %active -> i32 { memref.load %dst[%tx] : ... } ... scf.condition(%not_ready) %v : i32 } do { ^bb0(%vi : i32): scf.yield %vi : i32 } scf.if %active { memref.store %final_v, %dst[%tx] : ... } // sink to the upstream-idiomatic shape from mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir: scf.while : () -> () { %v = scf.if %active -> i32 { %loaded = memref.atomic_rmw addi %c0_i32, %dst[%tx] : (i32, memref<32xi32, ...>) -> i32 scf.yield %loaded : i32 } else { scf.yield %c0_i32 : i32 } %flag, _ = gpu.shuffle idx %v, %c31, %c64 : i32 %not_ready = arith.cmpi ne, %flag, %c1 : i32 scf.condition(%not_ready) } do { scf.yield } `memref.atomic_rmw addi %c0` is functionally a load (adds 0, returns the prior value) but carries both Read and Write effects in its MemoryEffectOpInterface. This: 1. Survives the DCE inside air-to-rocdl's `applyPatternsGreedily`. With the previous plain memref.load, MLIR's `wouldOpBeTriviallyDead` considered the spin's body to have only Read effects and the greedy driver killed the entire scf.while (verified in GreedyPatternRewriteDriver.cpp:483-490 + SideEffectInterfaces.cpp:84-95). The previous workaround was a sink store after the loop to make %final_v used; this commit removes the workaround. 2. Encodes "this read must be observable across producers" as an IR-level fact instead of relying on a plain memref.load happening to stay observable through the lowering chain. 3. Lets us drop the iter-arg/result plumbing (%dummy, %final_v) and the trailing scf.if/store sink — the spin loop becomes a self-contained zero-result scf.while. Verified on MI350X (gfx950, 8 GPUs): all three tiers (handwritten/cacheline, air_hierarchy/cacheline, air_channel/cacheline) still produce "=== ALL 2 RANKS PASSED ===" with data[0]=100 and flag=1 cross-rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Makefiles: auto-detect GFX_TARGET, drop hardcoded gfx942/gfx950 All four multi_gpu test Makefiles (handwritten, air_rank, air_hierarchy, air_channel) previously hardcoded the AMDGPU chip target in their mlir-opt pass pipeline (some at gfx942, the newer ones at gfx950), which meant running them on a node with a different MI3xx variant silently produced an unloadable binary (hipErrorNoBinaryForGpu) at runtime — and the tests reported MISMATCH instead of a clean precondition failure. Replace the hardcoded chip with a GFX_TARGET make variable, auto- detected in two layers: 1. amdgpu-arch (ROCm's clang tool; ships at /opt/rocm/llvm/bin/). Most direct — outputs the gfx name for the visible GPU(s) verbatim. 2. Fallback: parse /sys/class/kfd/kfd/topology/nodes/*/properties for the first GPU node's `gfx_target_version`, which encodes the chip as major*10000 + minor*100 + step. Decode to `gfx<major><minor><step>` (e.g., 90500 -> gfx950, 90402 -> gfx942). Holds for the MI3xx family these tests target. Doesn't depend on ROCm tooling, only the kernel module exposing the topology files. Users can override via `make GFX_TARGET=gfx<NNN>` for cross-compilation or on heterogeneous systems. check-preconditions now fails loudly with a clear message if neither detection layer produces a value. Verified on MI350X (gfx950, 8 GPUs): all four e2e tests detect gfx950 correctly and produce "=== ALL 2 RANKS PASSED ===". Manual override test (`make GFX_TARGET=gfx906 ...`) confirmed the variable propagates through the pipeline pass arguments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Drop docs/MultiGPUPhase56Redesign.md — was a local planning scratchpad This file was an internal planning document — in-flight PR statuses, "drop phase 5", "what is currently wrong", a TL;DR aimed at the author rather than the project. It got included in commit 4a6cd86 by mistake and should not have landed. The actual normative documentation lives in docs/AIRComputeModel.md (GPU §4 explains the AIR-on-GPU mapping that this PR's phase 6 pass lowers into). PR descriptions cover the rationale for the redesign. Removal only, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Align AIR-on-GPU to original §2.3 compute model (PE = wavefront) The AIR-on-GPU lowering had drifted to PE → thread over time, contradicting docs/AIRComputeModel.md §2.3 which states the herd PE maps to a wavefront (not a thread). The implementation drift, plus an earlier silent "fix" of §2.3 to match the drift, made the model inconsistent across §2.3 vs §4 vs the air-to-rocdl lowering vs the GPU tests. This commit restores PE → wavefront as the canonical model and aligns every GPU-side artifact to it: Doc (docs/AIRComputeModel.md): * §2.3: restored original PE → wavefront wording; added pointer to gpu.lane_id for lane access inside the herd body. * §4.1: rewrote the mapping table — blockDim = (Nx * wave_size, Ny, 1), tile_x → thread_id_x / wave_size, lane within PE → gpu.lane_id. Clarified that the GPU kernel boundary is at the (innermost) segment, not the herd; herd is an inline parallel block inside the kernel. * §4.3: clarified L1 (private) semantics under PE = wavefront. * §4.5: rewrote the matmul example to herd (4,1) — 4 PEs × 64 lanes = 256 effective work items — fitting the §2.3 wavefront-slot budget. * §5: summary table: "air.herd tile" → "Single GPU wavefront", "L1" → "Per-PE (per-warp) VGPRs / private scratch". Pass (mlir/lib/Conversion/AIRToROCDLPass.cpp, mlir/include/air/Conversion/GPUPasses.td): * Added `wave-size` option (default 64; configurable for other arches). * convertLaunchToGPULaunch: blockDim.x = herd.Nx * wave_size. * deleteAirHerd: herd block args remapped to warp-id within block: tile_x → thread_id_x / wave_size tile_y → thread_id_y size_x → block_dim_x / wave_size size_y → block_dim_y Pass (mlir/lib/Conversion/AIRGpuChannelToCachelinePass.cpp): * put / get expansions emit `gpu.lane_id` for the per-lane index instead of using the herd's tile_x (which is now a warp id, not a thread id). * Updated FileCheck unit test to expect gpu.lane_id. Tests: * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir: herd (64, 1) → herd (1, 1) — one PE = one wave; lane index from gpu.lane_id (lanes 0..30 publish payload, lane 31 publishes flag). * test/gpu/multi_gpu/air_channel/cacheline.mlir: same. * test/gpu/4k_4k_mul/air_sync.mlir matmul: herd (256, 1) → herd (4, 1) with %gtid = %tx * 64 + gpu.lane_id; same 256-thread workgroup, expressed as 4 wavefronts × 64 lanes. Honors §2.3's ≤32 PE per herd budget. * test/gpu/4k_4k_mul/run.sh: GFX_TARGET auto-detected (amdgpu-arch primary, KFD-topology fallback). Fixed a pre-existing bug where LLVM_LIB_DIR was set to LLVM_INSTALL_DIR (missing /lib) when LLVM_INSTALL_DIR was set in the environment. Verified on MI350X (gfx950, 8 GPUs): * FileCheck mlir/test/Conversion/AIRGpuChannelToCacheline/cacheline.mlir * multi_gpu/handwritten/cacheline → === ALL 2 RANKS PASSED === * multi_gpu/air_hierarchy/cacheline → === ALL 2 RANKS PASSED === * multi_gpu/air_channel/cacheline → === ALL 2 RANKS PASSED === * 4k_4k_mul matmul → "Output Matched!" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] docs §2.3: drop GPU dialect leak from abstract model section §2.3 ("Platform-specific iteration space semantics") is about herd shape and placement budgets — not how to access state inside a PE. The gpu.lane_id / gpu.shuffle sentence I added in the previous commit (about PE = wavefront) belonged in §4, not §2.3; §4.1 already carries that guidance in the mapping table. Removing the bolted-on sentence restores §2.3 to its original wording (three claims: herd ⊆ CU, PE → warp, ≤32 PE budget) with no dialect-op leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] docs §4: minimal diff — only the cell changes forced by §2.3 Previous commits expanded §4 with new intro paragraphs, an extra mapping-table row, an L1 description rewrite, and matmul-example elaboration — all of which were elaboration beyond what reverting §2.3 strictly required. Restore everything to the original wording except the cells that are *literally wrong* under PE = wavefront: §4.1 mapping table: - "blockDim = (bx, by, 1)" → "blockDim = (bx * wave_size, by, 1), with each PE materialised as one wavefront" - "Herd tile index → (threadIdx.x, threadIdx.y)" → "warp-id within block: (threadIdx.x / wave_size, threadIdx.y)" §4.1 prose: "number of threads per block" → "number of wavefronts per block"; "per-thread air.herd body" → "per-PE air.herd body". Two-word tweaks; no new paragraphs. §4.5 matmul example: - herd (256, 1) → herd (4, 1) (256 PEs violated §2.3's ≤32 budget; 4 PEs × 64 lanes = same 256 work items) - mapping summary updated to match §5 summary table: - "air.herd tile → Single GPU thread" → "Single GPU wavefront" §4.3 (memory space mapping) reverted entirely — original "Per-thread private (VGPRs/scratch)" is technically still accurate under PE = warp (MLIR's "private" space is per-thread regardless of how a PE is defined). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): drop unused --check-prefix=ANY RUN line The second RUN line in AIRGpuChannelToCacheline/cacheline.mlir invoked FileCheck with --check-prefix=ANY but the file has no ANY: patterns, so FileCheck errors with "no check strings found with prefix 'ANY'". Drop the RUN line; the remaining (default-prefix) RUN line covers the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): guard blkIdx.getDefiningOp() null in convertLaunchToGPULaunch `blkIdx[i]` is allowed to be a BlockArgument — happens when an air.herd/air.launch size operand is passed in as an SSA value rather than declared as a constant in scope. In that case getDefiningOp() returns null and the unconditional `blockXValOp->moveBefore(launchOp)` would crash with a null deref. Add the obvious null guard: only move when there's a defining op (a BlockArgument already dominates launchOp from the enclosing scope, so no move is needed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): require identity layout in air-symmetric-alloc-to-mgpu buildMemrefDescriptor hard-codes row-major strides from the shape and offset=0; it can't faithfully represent strided / affine layouts. Without an explicit check, a non-identity layout (`memref<NxT, strided<...>, #air.symmetric_heap>`) would be silently miscompiled — the runtime allocation would be sized correctly but the descriptor strides would not match the user-declared layout. Add an `isIdentity()` precondition check in the AllocOp loop, before calling buildMemrefDescriptor. Emit a clear pass-failure diagnostic naming the requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): unify mlir:: prefix style in GPUPasses.td dependentDialects ConvertAIRToROCDL's dependentDialects mixed `gpu::GPUDialect` / `LLVM::LLVMDialect` (no `mlir::` prefix) with `mlir::arith::ArithDialect` (with prefix). The rest of this file (and the repo's convention) uses the bare style. Drop the `mlir::` prefix from the arith entry for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): tighten air-gpu-channel-to-cacheline (rank inference, bases shape, wave-size) Three related fixes to the channel-to-cacheline pass surfaced by Copilot review on Xilinx#1618. 1. inferRankFromEnclosingIf (Copilot #2): the pass walked up scf.if chains and accepted the first `cmpi eq %v, %const` as a rank dispatch. It never checked %v derives from %rid of the enclosing air.rank, so any other `eq %v, %const` (e.g. a lane predicate `cmpi eq %lane, %c0` inside the herd body) would mis-infer "rank 0". Add a helper `isRankIdOfEnclosingAirRank(v)` and require %v be a rank-id block arg of an enclosing air.rank before extracting the constant. Also handle the (rid, const) and (const, rid) orderings symmetrically via a small lambda instead of two near-identical branches. 2. findUniqueBasesArg (Copilot #3): the heap_bases arg search filtered only on element type (index) + memory_space (#air.symmetric_heap) but missed `memrefTy.getRank() == 1`. A 2-D index memref with the symmetric_heap memory_space would silently match, and downstream air.translate creation would fail with a poor message. Add the rank check. 3. Wave-size option (Copilot Xilinx#4): the spin loop's gpu.shuffle hard- coded `width = 64`, which is wrong on wave32 targets (NVIDIA). The sister pass air-to-rocdl already has a `wave-size` option; mirror it here. Default 64 (matching air-to-rocdl); user must set both options consistently. Updated td description spells out the must- match-air-to-rocdl + must-be->=32 (cacheline memref shape) constraints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] CI fix: clang-format-17 line-break adjustments in AIRGpuChannelToCachelinePass.cpp CI's clang-format-17 wants slightly different line breaks than what my local clang-format produced: - Break after `(Value rid,` in the lambda signature in inferRankFromEnclosingIf (line was 1 char too long without wrap). - Break after `gpu::ShuffleOp::create(` instead of after the second arg in expandGetToCachelineSpin (after the waveWidthI32 rename pushed total width over the limit). Pure formatting; no semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This was referenced May 3, 2026

[multi-gpu] Phase 5: air-cross-rank-dma-to-mgpu lowering pass #1580

Closed

[multi-gpu] Phase 6: air-gpu-channel-to-mgpu lowering pass #1581

Closed

[multi-gpu] Phase 7: aircc integration (--multi-gpu flag) #1582

Draft

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 618aa19 to d8bd421 Compare May 3, 2026 20:22

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from d8bd421 to f493e1a Compare May 3, 2026 20:27

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from f493e1a to cbe0a61 Compare May 5, 2026 18:36

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from cbe0a61 to 2a8cd18 Compare May 6, 2026 00:29

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 2a8cd18 to 5e3841d Compare May 6, 2026 01:01

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 5e3841d to 2a50ab5 Compare May 6, 2026 04:24

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 2a50ab5 to 087fcb7 Compare May 6, 2026 04:39

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 087fcb7 to a89b806 Compare May 6, 2026 04:49

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from a89b806 to 4a8d33a Compare May 6, 2026 05:17

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 99da48f to 996c830 Compare May 6, 2026 18:22

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 996c830 to bdefa38 Compare May 6, 2026 18:53

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from bdefa38 to 74acf56 Compare May 6, 2026 19:02

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 74acf56 to 7947aa7 Compare May 6, 2026 20:15

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch 2 times, most recently from 953eb43 to d1d9c39 Compare May 12, 2026 15:38

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from d1d9c39 to a41fac7 Compare May 12, 2026 16:19

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from a41fac7 to 689ca78 Compare May 12, 2026 17:20

erwei-xilinx mentioned this pull request May 12, 2026

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level #1613

Merged

5 tasks

erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 689ca78 to bc2b291 Compare May 12, 2026 23:52

erwei-xilinx mentioned this pull request May 19, 2026

[multi-gpu] Phases 4 + 6 + AIR-on-GPU realigned to PE=wavefront model #1618

Merged

5 tasks

erwei-xilinx closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi-gpu] Phase 4: air-symmetric-alloc-to-mgpu lowering pass#1579

[multi-gpu] Phase 4: air-symmetric-alloc-to-mgpu lowering pass#1579
erwei-xilinx wants to merge 1 commit into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase4-symmetric-alloc-pass

erwei-xilinx commented May 3, 2026 •

edited

Loading

Uh oh!

erwei-xilinx commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erwei-xilinx commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it does

Why this matters

What's new

Test plan

Followup

Uh oh!

erwei-xilinx commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erwei-xilinx commented May 3, 2026 •

edited

Loading