Skip to content

[multi-gpu] Phase 4: air-symmetric-alloc-to-mgpu lowering pass#1579

Closed
erwei-xilinx wants to merge 1 commit into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase4-symmetric-alloc-pass
Closed

[multi-gpu] Phase 4: air-symmetric-alloc-to-mgpu lowering pass#1579
erwei-xilinx wants to merge 1 commit into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase4-symmetric-alloc-pass

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented May 3, 2026

Summary

Phase 4 of the multi-GPU stack: a new air-symmetric-alloc-to-mgpu conversion pass that uplevels symmetric-heap allocations from the runtime-ABI level (mgpuSymmetricAlloc + a hand-built memref descriptor) to the standard memref dialect.

What it does

Replaces the runtime-ABI dance:

%ptr   = call @mgpuSymmetricAlloc(size, stream)
%bytes = wrap_bytes(%ptr, size)            // hand-built memref<?xi8>
%m     = memref.view %bytes[0][]           // retype to memref<T>
...
call @mgpuSymmetricFree(%ptr, stream)

with the MLIR-native form:

%m = memref.alloc() {air.symmetric} : memref<T>
...
memref.dealloc %m

The pass lowers each memref.alloc {air.symmetric} to a runtime call plus an LLVM memref descriptor build + unrealized_conversion_cast back to the original memref type (so downstream uses keep working through convert-to-llvm). Each memref.dealloc whose operand traces back through the cast becomes mgpuSymmetricFree. No-op when no air.symmetric allocations are present.

Why this matters

Every test in main today carries a 15-line wrap_bytes helper that builds a memref descriptor by hand from a runtime !llvm.ptr — leaking the LLVM-struct ABI into the test IR. Phase 4 replaces this with the standard memref.alloc op + an attribute, putting the symmetric allocation at the same dialect level as the rest of the IR.

This is the foundation for phases 5/6 (DMA + channel ops): they can operate on clean memref<T> values from the user's perspective, without each test having to reconstruct memrefs from raw pointers.

What's new

  • mlir/include/air/Conversion/AIRSymmetricAllocToMgpuPass.h / .cpp — pass implementation (~200 LOC)
  • mlir/include/air/Conversion/GPUPasses.tdair-symmetric-alloc-to-mgpu def
  • mlir/test/Conversion/AIRSymmetricAllocToMgpu/symmetric_alloc.mlir — FileCheck unit tests with // REQUIRES: gpu (the pass isn't registered in non-GPU builds):
    • 1D alloc + dealloc shape (size, descriptor, cast, free)
    • 2D alloc with row-major strides in descriptor
    • Element type byte-size: f32 (4B), f64 (8B), i32 (4B)
    • Multiple symmetric allocs share one decl pair
    • Pass is a no-op for non-symmetric allocs
    • Pass is a no-op when there are zero symmetric allocs
  • test/gpu/multi_gpu/air_alloc/cacheline.mlir — e2e test that wraps the producer/consumer cacheline reference in air.rank AND uses memref.alloc {air.symmetric} for the symmetric data buffer. 1:1 wrap of the air_rank/cacheline reference, with only the symmetric alloc/free swapped for the new abstraction. After lowering through -air-rank-to-mgpu -air-symmetric-alloc-to-mgpu -air-translate-to-llvm, functionally equivalent to handwritten/cacheline.mlir.
  • test/gpu/multi_gpu/air_alloc/Makefile — self-contained, follows the same pattern as air_rank/Makefile.

Test plan

  • FileCheck unit tests pass (6 cases above)
  • E2E on real 2x MI325X (rad-mi325x-1, NUM_RANKS=2): cache-line message PASS (data[0]=100, flag=1) — output structurally identical to INPUT=cacheline in handwritten/ and air_rank/, only distinguished by the [mlir/alloc] log tag. 3/3 stability runs.
  • Equivalence demonstrated: the same payload values (lane+100 for lanes 0..30, flag=1 at lane 31) and the same cache-line atomicity contract reach the consumer in all three variants (handwritten/cacheline, air_rank/cacheline, air_alloc/cacheline).
  • git clang-format origin/main applied; "Python and C/C++ Check Format" check should pass

Followup

The non-symmetric staging allocations (verify_buf, heap_bases device copy) still go through mgpuMemAlloc + wrap_bytes. A future pass for non-symmetric device allocs would remove that helper too.

🤖 Generated with Claude Code

@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 618aa19 to d8bd421 Compare May 3, 2026 20:22
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 3, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 3, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from d8bd421 to f493e1a Compare May 3, 2026 20:27
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 3, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from f493e1a to cbe0a61 Compare May 5, 2026 18:36
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 5, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from cbe0a61 to 2a8cd18 Compare May 6, 2026 00:29
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 2a8cd18 to 5e3841d Compare May 6, 2026 01:01
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 5e3841d to 2a50ab5 Compare May 6, 2026 04:24
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 2a50ab5 to 087fcb7 Compare May 6, 2026 04:39
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 087fcb7 to a89b806 Compare May 6, 2026 04:49
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from a89b806 to 4a8d33a Compare May 6, 2026 05:17
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 99da48f to 996c830 Compare May 6, 2026 18:22
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 996c830 to bdefa38 Compare May 6, 2026 18:53
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from bdefa38 to 74acf56 Compare May 6, 2026 19:02
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 74acf56 to 7947aa7 Compare May 6, 2026 20:15
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch 2 times, most recently from 953eb43 to d1d9c39 Compare May 12, 2026 15:38
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 12, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from d1d9c39 to a41fac7 Compare May 12, 2026 16:19
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 12, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from a41fac7 to 689ca78 Compare May 12, 2026 17:20
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 12, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that uplevels symmetric-heap allocations from the
runtime-ABI level to the memref dialect level. Replaces:

  %ptr   = call @mgpuSymmetricAlloc(size, stream)
  %bytes = wrap_bytes(%ptr, size)            // hand-built memref<?xi8>
  %m     = memref.view %bytes[0][]           // retype to memref<T>
  ...
  call @mgpuSymmetricFree(%ptr, stream)

with the MLIR-native form:

  %m = memref.alloc() {air.symmetric} : memref<T>
  ...
  memref.dealloc %m

The pass lowers each `memref.alloc {air.symmetric}` to the runtime call
plus a descriptor build and an `unrealized_conversion_cast` back to the
original memref type so downstream uses keep working through
`convert-to-llvm`. Each `memref.dealloc` whose operand traces back
through the cast to a symmetric alloc becomes `mgpuSymmetricFree`.

The pass is a no-op when no `air.symmetric` allocations are present.

- `mlir/include/air/Conversion/AIRSymmetricAllocToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-symmetric-alloc-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRSYMMETRICALLOCTOMGPU`
- `mlir/lib/Conversion/AIRSymmetricAllocToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRSymmetricAllocToMgpu/symmetric_alloc.mlir`
  — FileCheck unit test (REQUIRES:gpu, like other GPU-only passes).
- `test/gpu/multi_gpu/air_alloc/cacheline.mlir` — e2e test that wraps
  the producer/consumer cacheline reference in `air.rank` AND uses
  `memref.alloc {air.symmetric}` for the symmetric data buffer.
  Functionally equivalent to handwritten/cacheline.mlir after lowering
  through `-air-rank-to-mgpu -air-symmetric-alloc-to-mgpu
  -air-translate-to-llvm`.
- `test/gpu/multi_gpu/air_alloc/Makefile` — self-contained, same
  preconditions + driver pattern as other multi_gpu/<level>/Makefile.

FileCheck unit tests cover:
- 1D alloc + dealloc shape (size, descriptor, cast, free)
- 2D alloc with row-major strides in descriptor
- Element type byte-size: f32 (4B), f64 (8B), i32 (4B)
- Multiple symmetric allocs share one decl pair
- Pass is a no-op for non-symmetric allocs
- Pass is a no-op when there are zero symmetric allocs

End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): cache-line
message PASS (data[0]=100, flag=1) — output structurally identical to
INPUT=cacheline in handwritten/ and air_rank/, only distinguished by
the `[mlir/alloc]` log tag. 3/3 stability runs.

Why this matters: every test today has a 15-line `wrap_bytes` helper
that builds a memref descriptor by hand from a runtime `!llvm.ptr`,
leaking the descriptor ABI into the test. Phase 4 replaces this with
the standard `memref.alloc` op + an attribute, putting the symmetric
allocation at the same dialect level as the rest of the IR. This is
the foundation for phases 5/6 (DMA + channel ops) to operate on
clean memrefs rather than reconstructing them from raw pointers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase4-symmetric-alloc-pass branch from 689ca78 to bc2b291 Compare May 12, 2026 23:52
@erwei-xilinx
Copy link
Copy Markdown
Collaborator Author

Superseded by erwei-xilinx:multigpu-air-hierarchy, which carries a refactored phase-4 pass dispatching on #air.symmetric_heap memref memory_space instead of the {air.symmetric} op attribute (commit 4a6cd86).

Rationale for the refactor: the {air.symmetric} op-attribute approach required walking the defining-op chain from each memref.load/store site to know whether the memref was symmetric-heap-backed. Encoding the property as a memory_space attribute carries it on the memref type — verifiers, helper passes, and air.translate can all answer "is this on the symmetric heap?" by inspecting the type, without a chain walk. It also lets memrefs constructed via memref.view / wrap_bytes (not just memref.alloc) carry the tag, which is needed for the cacheline tests where buffers are wrapped from raw runtime pointers.

Part of a larger redesign documented in docs/MultiGPUPhase56Redesign.md that also reshapes phases 5 and 6 (see #1580, #1581). Consolidated PR forthcoming.

erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 19, 2026
…lti-launch fixes

Replaces the in-flight phase 4 PR (Xilinx#1579), reframing symmetric-heap
allocation as a memref memory_space attribute rather than an op-attribute
on memref.alloc. Foundation for the future channel-to-cacheline lowering
(see docs/MultiGPUPhase56Redesign.md).

Why memory_space (not op-attribute):
  - Travels with SSA values automatically; no need to trace defining-op
    chain when checking from a memref.load/store site.
  - The AIR herd verifier can reject/accept based purely on the memref
    type, not on op metadata of a possibly-distant alloc.
  - Lets memrefs constructed via memref.view / wrap_bytes / etc. (not
    just memref.alloc) carry the symmetric-heap tag — required for the
    in-flight cacheline tests where buffers are wrapped from runtime
    pointers.

Changes:

  * `#air.symmetric_heap` custom memref memory_space attribute, defined
    via TableGen AttrDef in mlir/include/air/Dialect/AIR/AIROpBase.td.
    Dialect plumbing in AIROpBase.td (let useDefaultAttributePrinterParser
    = 1), CMakeLists.txt (mlir_tablegen for AIRAttrs.{h,cpp}.inc),
    AIRDialect.h (#include the generated header), AIRDialect.cpp
    (addAttributes<> in initialize()).

  * AIR herd verifier (verifyComputeMemoryAccess in AIRDialect.cpp)
    skips the L1-or-better-only check for memrefs whose memory_space is
    #air.symmetric_heap. Same for verifyAllocMemorySpace (segment-level
    alloc check). Direct memref.load/store on symmetric-heap memrefs
    inside air.herd bodies is now legal — required for the kernel-driven
    cross-rank cacheline pattern on GPU.

  * AIRSymmetricAllocToMgpu pass dispatches on the memref result type's
    memory_space, replacing the op-attribute check `op->hasAttr(
    "air.symmetric")`. FileCheck unit tests rewritten to use
    `memref.alloc() : memref<..., #air.symmetric_heap>` instead of
    `memref.alloc() {air.symmetric}`. All 6 cases pass.

  * AIR-to-ROCDL pass: two fixes uncovered by the air_hierarchy/
    cacheline e2e baseline (kept in-tree as the target shape but
    currently blocked on a separate multi-launch use-after-free in this
    same pass — see docs/MultiGPUPhase56Redesign.md).
    1. 1D / N-D launch + herd shapes are now handled (previously assumed
       2D, would crash on getSizeOperands()[1] OOB for 1D). The new
       `sizeOrOne` helper materializes a `1` constant for missing dims.
    2. Pattern set is now frozen once and reused across multiple
       launches (previously moved into applyPatternsGreedily on first
       iteration, FrozenRewritePatternSet ctor crashed on the second).

  * Plan doc (docs/MultiGPUPhase56Redesign.md) updated with the
    infrastructure landed today, what's still blocked (air-to-rocdl
    multi-launch use-after-free), and recommended next steps.

  * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir + Makefile: the
    target IR shape that phase 6's redesigned lowering should emit. Kept
    as a draft baseline; will run e2e once the air-to-rocdl multi-launch
    issue is resolved separately.

Tested:
  - lit unit test for AIRSymmetricAllocToMgpu: 6/6 pass
  - air_hierarchy e2e: blocked at air-to-rocdl multi-launch crash; the
    1D and pattern-reuse fixes here unblock the first two errors but the
    third (use-after-free during block deletion) needs separate work

Supersedes:
  - PR Xilinx#1579 (phase 4 op-attribute approach) — close as superseded

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 19, 2026
…Xilinx#1618)

* [multi-gpu] Add #air.symmetric_heap memory_space + air-to-rocdl 1D/multi-launch fixes

Replaces the in-flight phase 4 PR (Xilinx#1579), reframing symmetric-heap
allocation as a memref memory_space attribute rather than an op-attribute
on memref.alloc. Foundation for the future channel-to-cacheline lowering
(see docs/MultiGPUPhase56Redesign.md).

Why memory_space (not op-attribute):
  - Travels with SSA values automatically; no need to trace defining-op
    chain when checking from a memref.load/store site.
  - The AIR herd verifier can reject/accept based purely on the memref
    type, not on op metadata of a possibly-distant alloc.
  - Lets memrefs constructed via memref.view / wrap_bytes / etc. (not
    just memref.alloc) carry the symmetric-heap tag — required for the
    in-flight cacheline tests where buffers are wrapped from runtime
    pointers.

Changes:

  * `#air.symmetric_heap` custom memref memory_space attribute, defined
    via TableGen AttrDef in mlir/include/air/Dialect/AIR/AIROpBase.td.
    Dialect plumbing in AIROpBase.td (let useDefaultAttributePrinterParser
    = 1), CMakeLists.txt (mlir_tablegen for AIRAttrs.{h,cpp}.inc),
    AIRDialect.h (#include the generated header), AIRDialect.cpp
    (addAttributes<> in initialize()).

  * AIR herd verifier (verifyComputeMemoryAccess in AIRDialect.cpp)
    skips the L1-or-better-only check for memrefs whose memory_space is
    #air.symmetric_heap. Same for verifyAllocMemorySpace (segment-level
    alloc check). Direct memref.load/store on symmetric-heap memrefs
    inside air.herd bodies is now legal — required for the kernel-driven
    cross-rank cacheline pattern on GPU.

  * AIRSymmetricAllocToMgpu pass dispatches on the memref result type's
    memory_space, replacing the op-attribute check `op->hasAttr(
    "air.symmetric")`. FileCheck unit tests rewritten to use
    `memref.alloc() : memref<..., #air.symmetric_heap>` instead of
    `memref.alloc() {air.symmetric}`. All 6 cases pass.

  * AIR-to-ROCDL pass: two fixes uncovered by the air_hierarchy/
    cacheline e2e baseline (kept in-tree as the target shape but
    currently blocked on a separate multi-launch use-after-free in this
    same pass — see docs/MultiGPUPhase56Redesign.md).
    1. 1D / N-D launch + herd shapes are now handled (previously assumed
       2D, would crash on getSizeOperands()[1] OOB for 1D). The new
       `sizeOrOne` helper materializes a `1` constant for missing dims.
    2. Pattern set is now frozen once and reused across multiple
       launches (previously moved into applyPatternsGreedily on first
       iteration, FrozenRewritePatternSet ctor crashed on the second).

  * Plan doc (docs/MultiGPUPhase56Redesign.md) updated with the
    infrastructure landed today, what's still blocked (air-to-rocdl
    multi-launch use-after-free), and recommended next steps.

  * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir + Makefile: the
    target IR shape that phase 6's redesigned lowering should emit. Kept
    as a draft baseline; will run e2e once the air-to-rocdl multi-launch
    issue is resolved separately.

Tested:
  - lit unit test for AIRSymmetricAllocToMgpu: 6/6 pass
  - air_hierarchy e2e: blocked at air-to-rocdl multi-launch crash; the
    1D and pattern-reuse fixes here unblock the first two errors but the
    third (use-after-free during block deletion) needs separate work

Supersedes:
  - PR Xilinx#1579 (phase 4 op-attribute approach) — close as superseded

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] AIRToROCDL: fix three multi-launch bugs (use-after-free, wrong pairing, missing arg replace)

Multi-launch programs (e.g., scf.if %is_producer { air.launch P } else
{ air.launch C }) crashed AIRToROCDLPass with "Cannot destroy a value
that still has uses!" during block destruction. Three compounding bugs
identified via a minimal reproducer (two empty air.launch ops with
air.segment + air.herd):

1. blkIdx / gridIdx are class-level vectors that accumulated across
   iterations of the launch walk. The second launch ended up using the
   first launch's herd-size operands, dangling values into the new
   gpu.launch op. Fixed by clearing them at the top of each (launch,
   segment) pair (along with gridXVal / gridYVal).

2. The post-conversion body-move loop used a *nested* walk:
     module.walk(gpuLaunchOp) { module.walk(airLaunchOp) { move } }
   which pairwise-matched every gpu.launch with every air.launch in the
   module, folding multi-launch programs into the first gpu.launch and
   leaving the others empty. Fixed by performing the body move *inside*
   the first walk where the 1:1 pairing is established.

3. The body-move did not replace the air.launch's own block args
   (kernel operands) with their outer values before moving the body.
   After air.launch was later erased, the moved ops dangled into the
   destroyed block args (the actual use-after-free). Fixed with the
   same replaceAllUsesWith pattern that deleteAirHerd /
   deleteAirSegment use.

The 1D-launch and pattern-reuse fixes from the prior commit on this
branch are preserved.

Tested:
- Minimal reproducer (/tmp/multi_launch_repro.mlir): two air.launches
  with memref.store inside each herd. Now lowers cleanly to two
  distinct gpu.launch ops with correct operand routing.
- air_hierarchy/cacheline.mlir: now progresses past air-to-rocdl all
  the way through to a GPU binary (after sed-stripping
  #air.symmetric_heap from the pre-mlir-opt IR; see Makefile + plan
  doc). Runtime hang remains; likely the herd's tile-id → block-dim
  mapping doesn't preserve the cooperative cache-line semantics the
  handwritten kernel relies on. Documented as next step in
  docs/MultiGPUPhase56Redesign.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] air_hierarchy/cacheline: use herd 64×1 (PE→thread), fix §2.3 doc

The air_hierarchy/cacheline.mlir test was hanging at runtime because the
herd was declared (1,1) PE but the body used gpu.thread_id x to address
32 lanes — incompatible with the standard AIR→GPU mapping where the
herd iteration space becomes the GPU blockDim (PE → thread).

Per AIRComputeModel.md §4.1/§4.5/§5 — and the convention used by
test/gpu/4k_4k_mul/air_sync.mlir — a 1-PE herd lowers to 1 GPU thread.
Rewrite producer + consumer herds as (64, 1) so blockDim = (64, 1, 1)
(one full MI3xx wavefront), and use the herd tile id %tx in place of
gpu.thread_id x. Lanes 0..31 do real work, lanes 32..63 stay idle but
keep the wavefront full so the consumer's gpu.shuffle width=64 can see
the producer's flag lane.

Also fix the §2.3 statement that claimed "PE instances mapped to
individual warps" on GPU — that contradicted the rest of the doc and
the actual lowering in air-to-rocdl, which is PE→thread. Replace with
a pointer to §4.1 for the precise mapping, plus guidance that any herd
needing wave-cooperative ops (e.g. gpu.shuffle) should pick a herd size
that's a multiple of the target wavefront width.

Verified on MI350X (gfx950, 8 GPUs): producer + consumer both report
PASS, with data[0]=100 and flag=1 transmitted across the symmetric
heap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Phase 6: air-gpu-channel-to-cacheline pass + air_channel e2e

Adds the redesigned phase-6 pass per docs/MultiGPUPhase56Redesign.md.
Lowers air.channel.put/get ops on channels of type "gpu_symmetric_heap"
into the kernel-driven cacheline pattern that
test/gpu/multi_gpu/handwritten/cacheline.mlir writes by hand:

  put -> air.translate %src, %from, %to, %bases + cooperative memref.store
         (lanes 0..30 publish payload, lane 31 stores sync flag = 1)
  get -> scf.while spin loop + gpu.shuffle idx broadcast of lane 31's
         observation until the flag arrives; sink store keeps the spin
         alive past DCE in subsequent passes

The pass:
- Pairs puts/gets via existing air::getTheOtherChannelOpThroughSymbol util
- Infers producer/consumer ranks from enclosing scf.if (cmpi eq %rid, %k)
  rank-dispatch context
- Finds %bases by type-matching memref<?xindex, #air.symmetric_heap> in
  the put/get's enclosing herd kernel args (semantically-unique combo
  for the symmetric-heap base table; clear error on 0 or >1 matches)
- Errors loudly if any precondition isn't met (herd scope, rank
  dispatch, bases arg, memref shape)
- Erases the channel symbol after expansion

Initial scope: 1-put / 1-get cacheline pattern with
memref<32xi32, #air.symmetric_heap> source/destination. Allgather +
multi-wire topologies are deferred.

Verified end-to-end on MI350X (gfx950, 8 GPUs): produces
"=== ALL 2 RANKS PASSED ===" with data[0]=100 and flag=1 published
cross-rank, functionally equivalent to the handwritten reference and
to air_hierarchy/cacheline.mlir.

Files added:
- mlir/include/air/Conversion/AIRGpuChannelToCachelinePass.h
- mlir/lib/Conversion/AIRGpuChannelToCachelinePass.cpp
- mlir/test/Conversion/AIRGpuChannelToCacheline/cacheline.mlir
- test/gpu/multi_gpu/air_channel/{cacheline.mlir,Makefile}

Wired into existing pass registration (GPUPasses.td, GPUPassDetail.h,
CMakeLists.txt, Passes.cpp).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] cacheline spin: use upstream-idiomatic atomic_rmw shape

Switch the cacheline spin loop in all three multi_gpu cacheline tests
(handwritten/, air_hierarchy/, air_channel/) and in the
air-gpu-channel-to-cacheline pass's get-side expansion from

    %final_v = scf.while (%dummy = %c0) : (i32) -> i32 {
      %v = scf.if %active -> i32 { memref.load %dst[%tx] : ... } ...
      scf.condition(%not_ready) %v : i32
    } do { ^bb0(%vi : i32): scf.yield %vi : i32 }
    scf.if %active { memref.store %final_v, %dst[%tx] : ... }   // sink

to the upstream-idiomatic shape from
mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir:

    scf.while : () -> () {
      %v = scf.if %active -> i32 {
        %loaded = memref.atomic_rmw addi %c0_i32, %dst[%tx]
            : (i32, memref<32xi32, ...>) -> i32
        scf.yield %loaded : i32
      } else { scf.yield %c0_i32 : i32 }
      %flag, _ = gpu.shuffle idx %v, %c31, %c64 : i32
      %not_ready = arith.cmpi ne, %flag, %c1 : i32
      scf.condition(%not_ready)
    } do { scf.yield }

`memref.atomic_rmw addi %c0` is functionally a load (adds 0, returns the
prior value) but carries both Read and Write effects in its
MemoryEffectOpInterface. This:

1. Survives the DCE inside air-to-rocdl's `applyPatternsGreedily`. With
   the previous plain memref.load, MLIR's `wouldOpBeTriviallyDead`
   considered the spin's body to have only Read effects and the
   greedy driver killed the entire scf.while (verified in
   GreedyPatternRewriteDriver.cpp:483-490 + SideEffectInterfaces.cpp:84-95).
   The previous workaround was a sink store after the loop to make
   %final_v used; this commit removes the workaround.

2. Encodes "this read must be observable across producers" as an
   IR-level fact instead of relying on a plain memref.load happening to
   stay observable through the lowering chain.

3. Lets us drop the iter-arg/result plumbing (%dummy, %final_v) and the
   trailing scf.if/store sink — the spin loop becomes a self-contained
   zero-result scf.while.

Verified on MI350X (gfx950, 8 GPUs): all three tiers
(handwritten/cacheline, air_hierarchy/cacheline, air_channel/cacheline)
still produce "=== ALL 2 RANKS PASSED ===" with data[0]=100 and flag=1
cross-rank.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Makefiles: auto-detect GFX_TARGET, drop hardcoded gfx942/gfx950

All four multi_gpu test Makefiles (handwritten, air_rank, air_hierarchy,
air_channel) previously hardcoded the AMDGPU chip target in their
mlir-opt pass pipeline (some at gfx942, the newer ones at gfx950),
which meant running them on a node with a different MI3xx variant
silently produced an unloadable binary (hipErrorNoBinaryForGpu) at
runtime — and the tests reported MISMATCH instead of a clean
precondition failure.

Replace the hardcoded chip with a GFX_TARGET make variable, auto-
detected in two layers:

  1. amdgpu-arch (ROCm's clang tool; ships at /opt/rocm/llvm/bin/).
     Most direct — outputs the gfx name for the visible GPU(s) verbatim.

  2. Fallback: parse /sys/class/kfd/kfd/topology/nodes/*/properties for
     the first GPU node's `gfx_target_version`, which encodes the chip
     as major*10000 + minor*100 + step. Decode to `gfx<major><minor><step>`
     (e.g., 90500 -> gfx950, 90402 -> gfx942). Holds for the MI3xx family
     these tests target. Doesn't depend on ROCm tooling, only the kernel
     module exposing the topology files.

Users can override via `make GFX_TARGET=gfx<NNN>` for cross-compilation
or on heterogeneous systems. check-preconditions now fails loudly with
a clear message if neither detection layer produces a value.

Verified on MI350X (gfx950, 8 GPUs): all four e2e tests detect gfx950
correctly and produce "=== ALL 2 RANKS PASSED ===". Manual override
test (`make GFX_TARGET=gfx906 ...`) confirmed the variable propagates
through the pipeline pass arguments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Drop docs/MultiGPUPhase56Redesign.md — was a local planning scratchpad

This file was an internal planning document — in-flight PR statuses,
"drop phase 5", "what is currently wrong", a TL;DR aimed at the author
rather than the project. It got included in commit 4a6cd86 by mistake
and should not have landed.

The actual normative documentation lives in docs/AIRComputeModel.md
(GPU §4 explains the AIR-on-GPU mapping that this PR's phase 6 pass
lowers into). PR descriptions cover the rationale for the redesign.

Removal only, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Align AIR-on-GPU to original §2.3 compute model (PE = wavefront)

The AIR-on-GPU lowering had drifted to PE → thread over time, contradicting
docs/AIRComputeModel.md §2.3 which states the herd PE maps to a wavefront
(not a thread). The implementation drift, plus an earlier silent "fix" of
§2.3 to match the drift, made the model inconsistent across §2.3 vs §4 vs
the air-to-rocdl lowering vs the GPU tests.

This commit restores PE → wavefront as the canonical model and aligns every
GPU-side artifact to it:

  Doc (docs/AIRComputeModel.md):
    * §2.3: restored original PE → wavefront wording; added pointer to
      gpu.lane_id for lane access inside the herd body.
    * §4.1: rewrote the mapping table — blockDim = (Nx * wave_size, Ny, 1),
      tile_x → thread_id_x / wave_size, lane within PE → gpu.lane_id.
      Clarified that the GPU kernel boundary is at the (innermost) segment,
      not the herd; herd is an inline parallel block inside the kernel.
    * §4.3: clarified L1 (private) semantics under PE = wavefront.
    * §4.5: rewrote the matmul example to herd (4,1) — 4 PEs × 64 lanes =
      256 effective work items — fitting the §2.3 wavefront-slot budget.
    * §5: summary table: "air.herd tile" → "Single GPU wavefront",
      "L1" → "Per-PE (per-warp) VGPRs / private scratch".

  Pass (mlir/lib/Conversion/AIRToROCDLPass.cpp,
        mlir/include/air/Conversion/GPUPasses.td):
    * Added `wave-size` option (default 64; configurable for other arches).
    * convertLaunchToGPULaunch: blockDim.x = herd.Nx * wave_size.
    * deleteAirHerd: herd block args remapped to warp-id within block:
        tile_x → thread_id_x / wave_size
        tile_y → thread_id_y
        size_x → block_dim_x / wave_size
        size_y → block_dim_y

  Pass (mlir/lib/Conversion/AIRGpuChannelToCachelinePass.cpp):
    * put / get expansions emit `gpu.lane_id` for the per-lane index
      instead of using the herd's tile_x (which is now a warp id, not a
      thread id).
    * Updated FileCheck unit test to expect gpu.lane_id.

  Tests:
    * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir:
      herd (64, 1) → herd (1, 1) — one PE = one wave; lane index from
      gpu.lane_id (lanes 0..30 publish payload, lane 31 publishes flag).
    * test/gpu/multi_gpu/air_channel/cacheline.mlir: same.
    * test/gpu/4k_4k_mul/air_sync.mlir matmul:
      herd (256, 1) → herd (4, 1) with %gtid = %tx * 64 + gpu.lane_id;
      same 256-thread workgroup, expressed as 4 wavefronts × 64 lanes.
      Honors §2.3's ≤32 PE per herd budget.
    * test/gpu/4k_4k_mul/run.sh: GFX_TARGET auto-detected (amdgpu-arch
      primary, KFD-topology fallback). Fixed a pre-existing bug where
      LLVM_LIB_DIR was set to LLVM_INSTALL_DIR (missing /lib) when
      LLVM_INSTALL_DIR was set in the environment.

Verified on MI350X (gfx950, 8 GPUs):
  * FileCheck mlir/test/Conversion/AIRGpuChannelToCacheline/cacheline.mlir
  * multi_gpu/handwritten/cacheline      → === ALL 2 RANKS PASSED ===
  * multi_gpu/air_hierarchy/cacheline    → === ALL 2 RANKS PASSED ===
  * multi_gpu/air_channel/cacheline      → === ALL 2 RANKS PASSED ===
  * 4k_4k_mul matmul                     → "Output Matched!"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] docs §2.3: drop GPU dialect leak from abstract model section

§2.3 ("Platform-specific iteration space semantics") is about herd shape
and placement budgets — not how to access state inside a PE. The
gpu.lane_id / gpu.shuffle sentence I added in the previous commit (about
PE = wavefront) belonged in §4, not §2.3; §4.1 already carries that
guidance in the mapping table. Removing the bolted-on sentence restores
§2.3 to its original wording (three claims: herd ⊆ CU, PE → warp,
≤32 PE budget) with no dialect-op leak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] docs §4: minimal diff — only the cell changes forced by §2.3

Previous commits expanded §4 with new intro paragraphs, an extra
mapping-table row, an L1 description rewrite, and matmul-example
elaboration — all of which were elaboration beyond what reverting
§2.3 strictly required.

Restore everything to the original wording except the cells that are
*literally wrong* under PE = wavefront:

  §4.1 mapping table:
    - "blockDim = (bx, by, 1)" → "blockDim = (bx * wave_size, by, 1),
       with each PE materialised as one wavefront"
    - "Herd tile index → (threadIdx.x, threadIdx.y)" → "warp-id within
       block: (threadIdx.x / wave_size, threadIdx.y)"

  §4.1 prose: "number of threads per block" → "number of wavefronts per
    block"; "per-thread air.herd body" → "per-PE air.herd body". Two-word
    tweaks; no new paragraphs.

  §4.5 matmul example:
    - herd (256, 1) → herd (4, 1) (256 PEs violated §2.3's ≤32 budget;
      4 PEs × 64 lanes = same 256 work items)
    - mapping summary updated to match

  §5 summary table:
    - "air.herd tile → Single GPU thread" → "Single GPU wavefront"

§4.3 (memory space mapping) reverted entirely — original "Per-thread
private (VGPRs/scratch)" is technically still accurate under PE = warp
(MLIR's "private" space is per-thread regardless of how a PE is
defined).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Review Xilinx#1618 (Copilot): drop unused --check-prefix=ANY RUN line

The second RUN line in AIRGpuChannelToCacheline/cacheline.mlir invoked
FileCheck with --check-prefix=ANY but the file has no ANY: patterns,
so FileCheck errors with "no check strings found with prefix 'ANY'".
Drop the RUN line; the remaining (default-prefix) RUN line covers
the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Review Xilinx#1618 (Copilot): guard blkIdx.getDefiningOp() null in convertLaunchToGPULaunch

`blkIdx[i]` is allowed to be a BlockArgument — happens when an
air.herd/air.launch size operand is passed in as an SSA value rather
than declared as a constant in scope. In that case getDefiningOp()
returns null and the unconditional `blockXValOp->moveBefore(launchOp)`
would crash with a null deref.

Add the obvious null guard: only move when there's a defining op (a
BlockArgument already dominates launchOp from the enclosing scope,
so no move is needed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Review Xilinx#1618 (Copilot): require identity layout in air-symmetric-alloc-to-mgpu

buildMemrefDescriptor hard-codes row-major strides from the shape and
offset=0; it can't faithfully represent strided / affine layouts.
Without an explicit check, a non-identity layout (`memref<NxT, strided<...>,
#air.symmetric_heap>`) would be silently miscompiled — the runtime
allocation would be sized correctly but the descriptor strides would
not match the user-declared layout.

Add an `isIdentity()` precondition check in the AllocOp loop, before
calling buildMemrefDescriptor. Emit a clear pass-failure diagnostic
naming the requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Review Xilinx#1618 (Copilot): unify mlir:: prefix style in GPUPasses.td dependentDialects

ConvertAIRToROCDL's dependentDialects mixed `gpu::GPUDialect` /
`LLVM::LLVMDialect` (no `mlir::` prefix) with `mlir::arith::ArithDialect`
(with prefix). The rest of this file (and the repo's convention) uses
the bare style. Drop the `mlir::` prefix from the arith entry for
consistency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] Review Xilinx#1618 (Copilot): tighten air-gpu-channel-to-cacheline (rank inference, bases shape, wave-size)

Three related fixes to the channel-to-cacheline pass surfaced by
Copilot review on Xilinx#1618.

1. inferRankFromEnclosingIf (Copilot #2): the pass walked up scf.if
   chains and accepted the first `cmpi eq %v, %const` as a rank
   dispatch. It never checked %v derives from %rid of the enclosing
   air.rank, so any other `eq %v, %const` (e.g. a lane predicate
   `cmpi eq %lane, %c0` inside the herd body) would mis-infer "rank 0".
   Add a helper `isRankIdOfEnclosingAirRank(v)` and require %v be a
   rank-id block arg of an enclosing air.rank before extracting the
   constant. Also handle the (rid, const) and (const, rid) orderings
   symmetrically via a small lambda instead of two near-identical
   branches.

2. findUniqueBasesArg (Copilot #3): the heap_bases arg search filtered
   only on element type (index) + memory_space (#air.symmetric_heap)
   but missed `memrefTy.getRank() == 1`. A 2-D index memref with the
   symmetric_heap memory_space would silently match, and downstream
   air.translate creation would fail with a poor message. Add the
   rank check.

3. Wave-size option (Copilot Xilinx#4): the spin loop's gpu.shuffle hard-
   coded `width = 64`, which is wrong on wave32 targets (NVIDIA). The
   sister pass air-to-rocdl already has a `wave-size` option; mirror
   it here. Default 64 (matching air-to-rocdl); user must set both
   options consistently. Updated td description spells out the must-
   match-air-to-rocdl + must-be->=32 (cacheline memref shape)
   constraints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [multi-gpu] CI fix: clang-format-17 line-break adjustments in AIRGpuChannelToCachelinePass.cpp

CI's clang-format-17 wants slightly different line breaks than what
my local clang-format produced:

  - Break after `(Value rid,` in the lambda signature in
    inferRankFromEnclosingIf (line was 1 char too long without wrap).
  - Break after `gpu::ShuffleOp::create(` instead of after the second
    arg in expandGetToCachelineSpin (after the waveWidthI32 rename
    pushed total width over the limit).

Pure formatting; no semantic change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant