[multi-gpu] Phase 4: air-symmetric-alloc-to-mgpu lowering pass#1579
[multi-gpu] Phase 4: air-symmetric-alloc-to-mgpu lowering pass#1579erwei-xilinx wants to merge 1 commit into
Conversation
618aa19 to
d8bd421
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d8bd421 to
f493e1a
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f493e1a to
cbe0a61
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cbe0a61 to
2a8cd18
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2a8cd18 to
5e3841d
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5e3841d to
2a50ab5
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2a50ab5 to
087fcb7
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
087fcb7 to
a89b806
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a89b806 to
4a8d33a
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
99da48f to
996c830
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
996c830 to
bdefa38
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdefa38 to
74acf56
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
74acf56 to
7947aa7
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
953eb43 to
d1d9c39
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
## Restrictions (this initial version)
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
## Files
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
## Test plan
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d1d9c39 to
a41fac7
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a41fac7 to
689ca78
Compare
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.
The peer pointer is computed at runtime as:
peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])
where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.
- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op
These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.
- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector
FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs
End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
own dst_buf; verification reads back 1.0.
Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New conversion pass that uplevels symmetric-heap allocations from the
runtime-ABI level to the memref dialect level. Replaces:
%ptr = call @mgpuSymmetricAlloc(size, stream)
%bytes = wrap_bytes(%ptr, size) // hand-built memref<?xi8>
%m = memref.view %bytes[0][] // retype to memref<T>
...
call @mgpuSymmetricFree(%ptr, stream)
with the MLIR-native form:
%m = memref.alloc() {air.symmetric} : memref<T>
...
memref.dealloc %m
The pass lowers each `memref.alloc {air.symmetric}` to the runtime call
plus a descriptor build and an `unrealized_conversion_cast` back to the
original memref type so downstream uses keep working through
`convert-to-llvm`. Each `memref.dealloc` whose operand traces back
through the cast to a symmetric alloc becomes `mgpuSymmetricFree`.
The pass is a no-op when no `air.symmetric` allocations are present.
- `mlir/include/air/Conversion/AIRSymmetricAllocToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-symmetric-alloc-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRSYMMETRICALLOCTOMGPU`
- `mlir/lib/Conversion/AIRSymmetricAllocToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRSymmetricAllocToMgpu/symmetric_alloc.mlir`
— FileCheck unit test (REQUIRES:gpu, like other GPU-only passes).
- `test/gpu/multi_gpu/air_alloc/cacheline.mlir` — e2e test that wraps
the producer/consumer cacheline reference in `air.rank` AND uses
`memref.alloc {air.symmetric}` for the symmetric data buffer.
Functionally equivalent to handwritten/cacheline.mlir after lowering
through `-air-rank-to-mgpu -air-symmetric-alloc-to-mgpu
-air-translate-to-llvm`.
- `test/gpu/multi_gpu/air_alloc/Makefile` — self-contained, same
preconditions + driver pattern as other multi_gpu/<level>/Makefile.
FileCheck unit tests cover:
- 1D alloc + dealloc shape (size, descriptor, cast, free)
- 2D alloc with row-major strides in descriptor
- Element type byte-size: f32 (4B), f64 (8B), i32 (4B)
- Multiple symmetric allocs share one decl pair
- Pass is a no-op for non-symmetric allocs
- Pass is a no-op when there are zero symmetric allocs
End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): cache-line
message PASS (data[0]=100, flag=1) — output structurally identical to
INPUT=cacheline in handwritten/ and air_rank/, only distinguished by
the `[mlir/alloc]` log tag. 3/3 stability runs.
Why this matters: every test today has a 15-line `wrap_bytes` helper
that builds a memref descriptor by hand from a runtime `!llvm.ptr`,
leaking the descriptor ABI into the test. Phase 4 replaces this with
the standard `memref.alloc` op + an attribute, putting the symmetric
allocation at the same dialect level as the rest of the IR. This is
the foundation for phases 5/6 (DMA + channel ops) to operate on
clean memrefs rather than reconstructing them from raw pointers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
689ca78 to
bc2b291
Compare
|
Superseded by erwei-xilinx:multigpu-air-hierarchy, which carries a refactored phase-4 pass dispatching on Rationale for the refactor: the Part of a larger redesign documented in |
…lti-launch fixes Replaces the in-flight phase 4 PR (Xilinx#1579), reframing symmetric-heap allocation as a memref memory_space attribute rather than an op-attribute on memref.alloc. Foundation for the future channel-to-cacheline lowering (see docs/MultiGPUPhase56Redesign.md). Why memory_space (not op-attribute): - Travels with SSA values automatically; no need to trace defining-op chain when checking from a memref.load/store site. - The AIR herd verifier can reject/accept based purely on the memref type, not on op metadata of a possibly-distant alloc. - Lets memrefs constructed via memref.view / wrap_bytes / etc. (not just memref.alloc) carry the symmetric-heap tag — required for the in-flight cacheline tests where buffers are wrapped from runtime pointers. Changes: * `#air.symmetric_heap` custom memref memory_space attribute, defined via TableGen AttrDef in mlir/include/air/Dialect/AIR/AIROpBase.td. Dialect plumbing in AIROpBase.td (let useDefaultAttributePrinterParser = 1), CMakeLists.txt (mlir_tablegen for AIRAttrs.{h,cpp}.inc), AIRDialect.h (#include the generated header), AIRDialect.cpp (addAttributes<> in initialize()). * AIR herd verifier (verifyComputeMemoryAccess in AIRDialect.cpp) skips the L1-or-better-only check for memrefs whose memory_space is #air.symmetric_heap. Same for verifyAllocMemorySpace (segment-level alloc check). Direct memref.load/store on symmetric-heap memrefs inside air.herd bodies is now legal — required for the kernel-driven cross-rank cacheline pattern on GPU. * AIRSymmetricAllocToMgpu pass dispatches on the memref result type's memory_space, replacing the op-attribute check `op->hasAttr( "air.symmetric")`. FileCheck unit tests rewritten to use `memref.alloc() : memref<..., #air.symmetric_heap>` instead of `memref.alloc() {air.symmetric}`. All 6 cases pass. * AIR-to-ROCDL pass: two fixes uncovered by the air_hierarchy/ cacheline e2e baseline (kept in-tree as the target shape but currently blocked on a separate multi-launch use-after-free in this same pass — see docs/MultiGPUPhase56Redesign.md). 1. 1D / N-D launch + herd shapes are now handled (previously assumed 2D, would crash on getSizeOperands()[1] OOB for 1D). The new `sizeOrOne` helper materializes a `1` constant for missing dims. 2. Pattern set is now frozen once and reused across multiple launches (previously moved into applyPatternsGreedily on first iteration, FrozenRewritePatternSet ctor crashed on the second). * Plan doc (docs/MultiGPUPhase56Redesign.md) updated with the infrastructure landed today, what's still blocked (air-to-rocdl multi-launch use-after-free), and recommended next steps. * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir + Makefile: the target IR shape that phase 6's redesigned lowering should emit. Kept as a draft baseline; will run e2e once the air-to-rocdl multi-launch issue is resolved separately. Tested: - lit unit test for AIRSymmetricAllocToMgpu: 6/6 pass - air_hierarchy e2e: blocked at air-to-rocdl multi-launch crash; the 1D and pattern-reuse fixes here unblock the first two errors but the third (use-after-free during block deletion) needs separate work Supersedes: - PR Xilinx#1579 (phase 4 op-attribute approach) — close as superseded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Xilinx#1618) * [multi-gpu] Add #air.symmetric_heap memory_space + air-to-rocdl 1D/multi-launch fixes Replaces the in-flight phase 4 PR (Xilinx#1579), reframing symmetric-heap allocation as a memref memory_space attribute rather than an op-attribute on memref.alloc. Foundation for the future channel-to-cacheline lowering (see docs/MultiGPUPhase56Redesign.md). Why memory_space (not op-attribute): - Travels with SSA values automatically; no need to trace defining-op chain when checking from a memref.load/store site. - The AIR herd verifier can reject/accept based purely on the memref type, not on op metadata of a possibly-distant alloc. - Lets memrefs constructed via memref.view / wrap_bytes / etc. (not just memref.alloc) carry the symmetric-heap tag — required for the in-flight cacheline tests where buffers are wrapped from runtime pointers. Changes: * `#air.symmetric_heap` custom memref memory_space attribute, defined via TableGen AttrDef in mlir/include/air/Dialect/AIR/AIROpBase.td. Dialect plumbing in AIROpBase.td (let useDefaultAttributePrinterParser = 1), CMakeLists.txt (mlir_tablegen for AIRAttrs.{h,cpp}.inc), AIRDialect.h (#include the generated header), AIRDialect.cpp (addAttributes<> in initialize()). * AIR herd verifier (verifyComputeMemoryAccess in AIRDialect.cpp) skips the L1-or-better-only check for memrefs whose memory_space is #air.symmetric_heap. Same for verifyAllocMemorySpace (segment-level alloc check). Direct memref.load/store on symmetric-heap memrefs inside air.herd bodies is now legal — required for the kernel-driven cross-rank cacheline pattern on GPU. * AIRSymmetricAllocToMgpu pass dispatches on the memref result type's memory_space, replacing the op-attribute check `op->hasAttr( "air.symmetric")`. FileCheck unit tests rewritten to use `memref.alloc() : memref<..., #air.symmetric_heap>` instead of `memref.alloc() {air.symmetric}`. All 6 cases pass. * AIR-to-ROCDL pass: two fixes uncovered by the air_hierarchy/ cacheline e2e baseline (kept in-tree as the target shape but currently blocked on a separate multi-launch use-after-free in this same pass — see docs/MultiGPUPhase56Redesign.md). 1. 1D / N-D launch + herd shapes are now handled (previously assumed 2D, would crash on getSizeOperands()[1] OOB for 1D). The new `sizeOrOne` helper materializes a `1` constant for missing dims. 2. Pattern set is now frozen once and reused across multiple launches (previously moved into applyPatternsGreedily on first iteration, FrozenRewritePatternSet ctor crashed on the second). * Plan doc (docs/MultiGPUPhase56Redesign.md) updated with the infrastructure landed today, what's still blocked (air-to-rocdl multi-launch use-after-free), and recommended next steps. * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir + Makefile: the target IR shape that phase 6's redesigned lowering should emit. Kept as a draft baseline; will run e2e once the air-to-rocdl multi-launch issue is resolved separately. Tested: - lit unit test for AIRSymmetricAllocToMgpu: 6/6 pass - air_hierarchy e2e: blocked at air-to-rocdl multi-launch crash; the 1D and pattern-reuse fixes here unblock the first two errors but the third (use-after-free during block deletion) needs separate work Supersedes: - PR Xilinx#1579 (phase 4 op-attribute approach) — close as superseded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] AIRToROCDL: fix three multi-launch bugs (use-after-free, wrong pairing, missing arg replace) Multi-launch programs (e.g., scf.if %is_producer { air.launch P } else { air.launch C }) crashed AIRToROCDLPass with "Cannot destroy a value that still has uses!" during block destruction. Three compounding bugs identified via a minimal reproducer (two empty air.launch ops with air.segment + air.herd): 1. blkIdx / gridIdx are class-level vectors that accumulated across iterations of the launch walk. The second launch ended up using the first launch's herd-size operands, dangling values into the new gpu.launch op. Fixed by clearing them at the top of each (launch, segment) pair (along with gridXVal / gridYVal). 2. The post-conversion body-move loop used a *nested* walk: module.walk(gpuLaunchOp) { module.walk(airLaunchOp) { move } } which pairwise-matched every gpu.launch with every air.launch in the module, folding multi-launch programs into the first gpu.launch and leaving the others empty. Fixed by performing the body move *inside* the first walk where the 1:1 pairing is established. 3. The body-move did not replace the air.launch's own block args (kernel operands) with their outer values before moving the body. After air.launch was later erased, the moved ops dangled into the destroyed block args (the actual use-after-free). Fixed with the same replaceAllUsesWith pattern that deleteAirHerd / deleteAirSegment use. The 1D-launch and pattern-reuse fixes from the prior commit on this branch are preserved. Tested: - Minimal reproducer (/tmp/multi_launch_repro.mlir): two air.launches with memref.store inside each herd. Now lowers cleanly to two distinct gpu.launch ops with correct operand routing. - air_hierarchy/cacheline.mlir: now progresses past air-to-rocdl all the way through to a GPU binary (after sed-stripping #air.symmetric_heap from the pre-mlir-opt IR; see Makefile + plan doc). Runtime hang remains; likely the herd's tile-id → block-dim mapping doesn't preserve the cooperative cache-line semantics the handwritten kernel relies on. Documented as next step in docs/MultiGPUPhase56Redesign.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] air_hierarchy/cacheline: use herd 64×1 (PE→thread), fix §2.3 doc The air_hierarchy/cacheline.mlir test was hanging at runtime because the herd was declared (1,1) PE but the body used gpu.thread_id x to address 32 lanes — incompatible with the standard AIR→GPU mapping where the herd iteration space becomes the GPU blockDim (PE → thread). Per AIRComputeModel.md §4.1/§4.5/§5 — and the convention used by test/gpu/4k_4k_mul/air_sync.mlir — a 1-PE herd lowers to 1 GPU thread. Rewrite producer + consumer herds as (64, 1) so blockDim = (64, 1, 1) (one full MI3xx wavefront), and use the herd tile id %tx in place of gpu.thread_id x. Lanes 0..31 do real work, lanes 32..63 stay idle but keep the wavefront full so the consumer's gpu.shuffle width=64 can see the producer's flag lane. Also fix the §2.3 statement that claimed "PE instances mapped to individual warps" on GPU — that contradicted the rest of the doc and the actual lowering in air-to-rocdl, which is PE→thread. Replace with a pointer to §4.1 for the precise mapping, plus guidance that any herd needing wave-cooperative ops (e.g. gpu.shuffle) should pick a herd size that's a multiple of the target wavefront width. Verified on MI350X (gfx950, 8 GPUs): producer + consumer both report PASS, with data[0]=100 and flag=1 transmitted across the symmetric heap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Phase 6: air-gpu-channel-to-cacheline pass + air_channel e2e Adds the redesigned phase-6 pass per docs/MultiGPUPhase56Redesign.md. Lowers air.channel.put/get ops on channels of type "gpu_symmetric_heap" into the kernel-driven cacheline pattern that test/gpu/multi_gpu/handwritten/cacheline.mlir writes by hand: put -> air.translate %src, %from, %to, %bases + cooperative memref.store (lanes 0..30 publish payload, lane 31 stores sync flag = 1) get -> scf.while spin loop + gpu.shuffle idx broadcast of lane 31's observation until the flag arrives; sink store keeps the spin alive past DCE in subsequent passes The pass: - Pairs puts/gets via existing air::getTheOtherChannelOpThroughSymbol util - Infers producer/consumer ranks from enclosing scf.if (cmpi eq %rid, %k) rank-dispatch context - Finds %bases by type-matching memref<?xindex, #air.symmetric_heap> in the put/get's enclosing herd kernel args (semantically-unique combo for the symmetric-heap base table; clear error on 0 or >1 matches) - Errors loudly if any precondition isn't met (herd scope, rank dispatch, bases arg, memref shape) - Erases the channel symbol after expansion Initial scope: 1-put / 1-get cacheline pattern with memref<32xi32, #air.symmetric_heap> source/destination. Allgather + multi-wire topologies are deferred. Verified end-to-end on MI350X (gfx950, 8 GPUs): produces "=== ALL 2 RANKS PASSED ===" with data[0]=100 and flag=1 published cross-rank, functionally equivalent to the handwritten reference and to air_hierarchy/cacheline.mlir. Files added: - mlir/include/air/Conversion/AIRGpuChannelToCachelinePass.h - mlir/lib/Conversion/AIRGpuChannelToCachelinePass.cpp - mlir/test/Conversion/AIRGpuChannelToCacheline/cacheline.mlir - test/gpu/multi_gpu/air_channel/{cacheline.mlir,Makefile} Wired into existing pass registration (GPUPasses.td, GPUPassDetail.h, CMakeLists.txt, Passes.cpp). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] cacheline spin: use upstream-idiomatic atomic_rmw shape Switch the cacheline spin loop in all three multi_gpu cacheline tests (handwritten/, air_hierarchy/, air_channel/) and in the air-gpu-channel-to-cacheline pass's get-side expansion from %final_v = scf.while (%dummy = %c0) : (i32) -> i32 { %v = scf.if %active -> i32 { memref.load %dst[%tx] : ... } ... scf.condition(%not_ready) %v : i32 } do { ^bb0(%vi : i32): scf.yield %vi : i32 } scf.if %active { memref.store %final_v, %dst[%tx] : ... } // sink to the upstream-idiomatic shape from mlir/test/Integration/GPU/CUDA/concurrent-kernels.mlir: scf.while : () -> () { %v = scf.if %active -> i32 { %loaded = memref.atomic_rmw addi %c0_i32, %dst[%tx] : (i32, memref<32xi32, ...>) -> i32 scf.yield %loaded : i32 } else { scf.yield %c0_i32 : i32 } %flag, _ = gpu.shuffle idx %v, %c31, %c64 : i32 %not_ready = arith.cmpi ne, %flag, %c1 : i32 scf.condition(%not_ready) } do { scf.yield } `memref.atomic_rmw addi %c0` is functionally a load (adds 0, returns the prior value) but carries both Read and Write effects in its MemoryEffectOpInterface. This: 1. Survives the DCE inside air-to-rocdl's `applyPatternsGreedily`. With the previous plain memref.load, MLIR's `wouldOpBeTriviallyDead` considered the spin's body to have only Read effects and the greedy driver killed the entire scf.while (verified in GreedyPatternRewriteDriver.cpp:483-490 + SideEffectInterfaces.cpp:84-95). The previous workaround was a sink store after the loop to make %final_v used; this commit removes the workaround. 2. Encodes "this read must be observable across producers" as an IR-level fact instead of relying on a plain memref.load happening to stay observable through the lowering chain. 3. Lets us drop the iter-arg/result plumbing (%dummy, %final_v) and the trailing scf.if/store sink — the spin loop becomes a self-contained zero-result scf.while. Verified on MI350X (gfx950, 8 GPUs): all three tiers (handwritten/cacheline, air_hierarchy/cacheline, air_channel/cacheline) still produce "=== ALL 2 RANKS PASSED ===" with data[0]=100 and flag=1 cross-rank. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Makefiles: auto-detect GFX_TARGET, drop hardcoded gfx942/gfx950 All four multi_gpu test Makefiles (handwritten, air_rank, air_hierarchy, air_channel) previously hardcoded the AMDGPU chip target in their mlir-opt pass pipeline (some at gfx942, the newer ones at gfx950), which meant running them on a node with a different MI3xx variant silently produced an unloadable binary (hipErrorNoBinaryForGpu) at runtime — and the tests reported MISMATCH instead of a clean precondition failure. Replace the hardcoded chip with a GFX_TARGET make variable, auto- detected in two layers: 1. amdgpu-arch (ROCm's clang tool; ships at /opt/rocm/llvm/bin/). Most direct — outputs the gfx name for the visible GPU(s) verbatim. 2. Fallback: parse /sys/class/kfd/kfd/topology/nodes/*/properties for the first GPU node's `gfx_target_version`, which encodes the chip as major*10000 + minor*100 + step. Decode to `gfx<major><minor><step>` (e.g., 90500 -> gfx950, 90402 -> gfx942). Holds for the MI3xx family these tests target. Doesn't depend on ROCm tooling, only the kernel module exposing the topology files. Users can override via `make GFX_TARGET=gfx<NNN>` for cross-compilation or on heterogeneous systems. check-preconditions now fails loudly with a clear message if neither detection layer produces a value. Verified on MI350X (gfx950, 8 GPUs): all four e2e tests detect gfx950 correctly and produce "=== ALL 2 RANKS PASSED ===". Manual override test (`make GFX_TARGET=gfx906 ...`) confirmed the variable propagates through the pipeline pass arguments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Drop docs/MultiGPUPhase56Redesign.md — was a local planning scratchpad This file was an internal planning document — in-flight PR statuses, "drop phase 5", "what is currently wrong", a TL;DR aimed at the author rather than the project. It got included in commit 4a6cd86 by mistake and should not have landed. The actual normative documentation lives in docs/AIRComputeModel.md (GPU §4 explains the AIR-on-GPU mapping that this PR's phase 6 pass lowers into). PR descriptions cover the rationale for the redesign. Removal only, no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Align AIR-on-GPU to original §2.3 compute model (PE = wavefront) The AIR-on-GPU lowering had drifted to PE → thread over time, contradicting docs/AIRComputeModel.md §2.3 which states the herd PE maps to a wavefront (not a thread). The implementation drift, plus an earlier silent "fix" of §2.3 to match the drift, made the model inconsistent across §2.3 vs §4 vs the air-to-rocdl lowering vs the GPU tests. This commit restores PE → wavefront as the canonical model and aligns every GPU-side artifact to it: Doc (docs/AIRComputeModel.md): * §2.3: restored original PE → wavefront wording; added pointer to gpu.lane_id for lane access inside the herd body. * §4.1: rewrote the mapping table — blockDim = (Nx * wave_size, Ny, 1), tile_x → thread_id_x / wave_size, lane within PE → gpu.lane_id. Clarified that the GPU kernel boundary is at the (innermost) segment, not the herd; herd is an inline parallel block inside the kernel. * §4.3: clarified L1 (private) semantics under PE = wavefront. * §4.5: rewrote the matmul example to herd (4,1) — 4 PEs × 64 lanes = 256 effective work items — fitting the §2.3 wavefront-slot budget. * §5: summary table: "air.herd tile" → "Single GPU wavefront", "L1" → "Per-PE (per-warp) VGPRs / private scratch". Pass (mlir/lib/Conversion/AIRToROCDLPass.cpp, mlir/include/air/Conversion/GPUPasses.td): * Added `wave-size` option (default 64; configurable for other arches). * convertLaunchToGPULaunch: blockDim.x = herd.Nx * wave_size. * deleteAirHerd: herd block args remapped to warp-id within block: tile_x → thread_id_x / wave_size tile_y → thread_id_y size_x → block_dim_x / wave_size size_y → block_dim_y Pass (mlir/lib/Conversion/AIRGpuChannelToCachelinePass.cpp): * put / get expansions emit `gpu.lane_id` for the per-lane index instead of using the herd's tile_x (which is now a warp id, not a thread id). * Updated FileCheck unit test to expect gpu.lane_id. Tests: * test/gpu/multi_gpu/air_hierarchy/cacheline.mlir: herd (64, 1) → herd (1, 1) — one PE = one wave; lane index from gpu.lane_id (lanes 0..30 publish payload, lane 31 publishes flag). * test/gpu/multi_gpu/air_channel/cacheline.mlir: same. * test/gpu/4k_4k_mul/air_sync.mlir matmul: herd (256, 1) → herd (4, 1) with %gtid = %tx * 64 + gpu.lane_id; same 256-thread workgroup, expressed as 4 wavefronts × 64 lanes. Honors §2.3's ≤32 PE per herd budget. * test/gpu/4k_4k_mul/run.sh: GFX_TARGET auto-detected (amdgpu-arch primary, KFD-topology fallback). Fixed a pre-existing bug where LLVM_LIB_DIR was set to LLVM_INSTALL_DIR (missing /lib) when LLVM_INSTALL_DIR was set in the environment. Verified on MI350X (gfx950, 8 GPUs): * FileCheck mlir/test/Conversion/AIRGpuChannelToCacheline/cacheline.mlir * multi_gpu/handwritten/cacheline → === ALL 2 RANKS PASSED === * multi_gpu/air_hierarchy/cacheline → === ALL 2 RANKS PASSED === * multi_gpu/air_channel/cacheline → === ALL 2 RANKS PASSED === * 4k_4k_mul matmul → "Output Matched!" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] docs §2.3: drop GPU dialect leak from abstract model section §2.3 ("Platform-specific iteration space semantics") is about herd shape and placement budgets — not how to access state inside a PE. The gpu.lane_id / gpu.shuffle sentence I added in the previous commit (about PE = wavefront) belonged in §4, not §2.3; §4.1 already carries that guidance in the mapping table. Removing the bolted-on sentence restores §2.3 to its original wording (three claims: herd ⊆ CU, PE → warp, ≤32 PE budget) with no dialect-op leak. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] docs §4: minimal diff — only the cell changes forced by §2.3 Previous commits expanded §4 with new intro paragraphs, an extra mapping-table row, an L1 description rewrite, and matmul-example elaboration — all of which were elaboration beyond what reverting §2.3 strictly required. Restore everything to the original wording except the cells that are *literally wrong* under PE = wavefront: §4.1 mapping table: - "blockDim = (bx, by, 1)" → "blockDim = (bx * wave_size, by, 1), with each PE materialised as one wavefront" - "Herd tile index → (threadIdx.x, threadIdx.y)" → "warp-id within block: (threadIdx.x / wave_size, threadIdx.y)" §4.1 prose: "number of threads per block" → "number of wavefronts per block"; "per-thread air.herd body" → "per-PE air.herd body". Two-word tweaks; no new paragraphs. §4.5 matmul example: - herd (256, 1) → herd (4, 1) (256 PEs violated §2.3's ≤32 budget; 4 PEs × 64 lanes = same 256 work items) - mapping summary updated to match §5 summary table: - "air.herd tile → Single GPU thread" → "Single GPU wavefront" §4.3 (memory space mapping) reverted entirely — original "Per-thread private (VGPRs/scratch)" is technically still accurate under PE = warp (MLIR's "private" space is per-thread regardless of how a PE is defined). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): drop unused --check-prefix=ANY RUN line The second RUN line in AIRGpuChannelToCacheline/cacheline.mlir invoked FileCheck with --check-prefix=ANY but the file has no ANY: patterns, so FileCheck errors with "no check strings found with prefix 'ANY'". Drop the RUN line; the remaining (default-prefix) RUN line covers the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): guard blkIdx.getDefiningOp() null in convertLaunchToGPULaunch `blkIdx[i]` is allowed to be a BlockArgument — happens when an air.herd/air.launch size operand is passed in as an SSA value rather than declared as a constant in scope. In that case getDefiningOp() returns null and the unconditional `blockXValOp->moveBefore(launchOp)` would crash with a null deref. Add the obvious null guard: only move when there's a defining op (a BlockArgument already dominates launchOp from the enclosing scope, so no move is needed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): require identity layout in air-symmetric-alloc-to-mgpu buildMemrefDescriptor hard-codes row-major strides from the shape and offset=0; it can't faithfully represent strided / affine layouts. Without an explicit check, a non-identity layout (`memref<NxT, strided<...>, #air.symmetric_heap>`) would be silently miscompiled — the runtime allocation would be sized correctly but the descriptor strides would not match the user-declared layout. Add an `isIdentity()` precondition check in the AllocOp loop, before calling buildMemrefDescriptor. Emit a clear pass-failure diagnostic naming the requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): unify mlir:: prefix style in GPUPasses.td dependentDialects ConvertAIRToROCDL's dependentDialects mixed `gpu::GPUDialect` / `LLVM::LLVMDialect` (no `mlir::` prefix) with `mlir::arith::ArithDialect` (with prefix). The rest of this file (and the repo's convention) uses the bare style. Drop the `mlir::` prefix from the arith entry for consistency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] Review Xilinx#1618 (Copilot): tighten air-gpu-channel-to-cacheline (rank inference, bases shape, wave-size) Three related fixes to the channel-to-cacheline pass surfaced by Copilot review on Xilinx#1618. 1. inferRankFromEnclosingIf (Copilot #2): the pass walked up scf.if chains and accepted the first `cmpi eq %v, %const` as a rank dispatch. It never checked %v derives from %rid of the enclosing air.rank, so any other `eq %v, %const` (e.g. a lane predicate `cmpi eq %lane, %c0` inside the herd body) would mis-infer "rank 0". Add a helper `isRankIdOfEnclosingAirRank(v)` and require %v be a rank-id block arg of an enclosing air.rank before extracting the constant. Also handle the (rid, const) and (const, rid) orderings symmetrically via a small lambda instead of two near-identical branches. 2. findUniqueBasesArg (Copilot #3): the heap_bases arg search filtered only on element type (index) + memory_space (#air.symmetric_heap) but missed `memrefTy.getRank() == 1`. A 2-D index memref with the symmetric_heap memory_space would silently match, and downstream air.translate creation would fail with a poor message. Add the rank check. 3. Wave-size option (Copilot Xilinx#4): the spin loop's gpu.shuffle hard- coded `width = 64`, which is wrong on wave32 targets (NVIDIA). The sister pass air-to-rocdl already has a `wave-size` option; mirror it here. Default 64 (matching air-to-rocdl); user must set both options consistently. Updated td description spells out the must- match-air-to-rocdl + must-be->=32 (cacheline memref shape) constraints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [multi-gpu] CI fix: clang-format-17 line-break adjustments in AIRGpuChannelToCachelinePass.cpp CI's clang-format-17 wants slightly different line breaks than what my local clang-format produced: - Break after `(Value rid,` in the lambda signature in inferRankFromEnclosingIf (line was 1 char too long without wrap). - Break after `gpu::ShuffleOp::create(` instead of after the second arg in expandGetToCachelineSpin (after the waveWidthI32 rename pushed total width over the limit). Pure formatting; no semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Phase 4 of the multi-GPU stack: a new
air-symmetric-alloc-to-mgpuconversion pass that uplevels symmetric-heap allocations from the runtime-ABI level (mgpuSymmetricAlloc+ a hand-built memref descriptor) to the standard memref dialect.What it does
Replaces the runtime-ABI dance:
with the MLIR-native form:
The pass lowers each
memref.alloc {air.symmetric}to a runtime call plus an LLVM memref descriptor build +unrealized_conversion_castback to the original memref type (so downstream uses keep working throughconvert-to-llvm). Eachmemref.deallocwhose operand traces back through the cast becomesmgpuSymmetricFree. No-op when noair.symmetricallocations are present.Why this matters
Every test in main today carries a 15-line
wrap_byteshelper that builds a memref descriptor by hand from a runtime!llvm.ptr— leaking the LLVM-struct ABI into the test IR. Phase 4 replaces this with the standardmemref.allocop + an attribute, putting the symmetric allocation at the same dialect level as the rest of the IR.This is the foundation for phases 5/6 (DMA + channel ops): they can operate on clean
memref<T>values from the user's perspective, without each test having to reconstruct memrefs from raw pointers.What's new
mlir/include/air/Conversion/AIRSymmetricAllocToMgpuPass.h/.cpp— pass implementation (~200 LOC)mlir/include/air/Conversion/GPUPasses.td—air-symmetric-alloc-to-mgpudefmlir/test/Conversion/AIRSymmetricAllocToMgpu/symmetric_alloc.mlir— FileCheck unit tests with// REQUIRES: gpu(the pass isn't registered in non-GPU builds):test/gpu/multi_gpu/air_alloc/cacheline.mlir— e2e test that wraps the producer/consumer cacheline reference inair.rankAND usesmemref.alloc {air.symmetric}for the symmetric data buffer. 1:1 wrap of the air_rank/cacheline reference, with only the symmetric alloc/free swapped for the new abstraction. After lowering through-air-rank-to-mgpu -air-symmetric-alloc-to-mgpu -air-translate-to-llvm, functionally equivalent to handwritten/cacheline.mlir.test/gpu/multi_gpu/air_alloc/Makefile— self-contained, follows the same pattern asair_rank/Makefile.Test plan
data[0]=100, flag=1) — output structurally identical toINPUT=cachelinein handwritten/ and air_rank/, only distinguished by the[mlir/alloc]log tag. 3/3 stability runs.git clang-format origin/mainapplied; "Python and C/C++ Check Format" check should passFollowup
The non-symmetric staging allocations (verify_buf, heap_bases device copy) still go through
mgpuMemAlloc + wrap_bytes. A future pass for non-symmetric device allocs would remove that helper too.🤖 Generated with Claude Code