Skip to content

[multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU#1577

Merged
erwei-xilinx merged 14 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase2-handwritten-e2e
May 12, 2026
Merged

[multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU#1577
erwei-xilinx merged 14 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase2-handwritten-e2e

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented May 3, 2026

Stacked on #1576 (Phase 1).

Why

Lock down the IR shape that the upstream multi-GPU lowering passes (Phases 3–7) should produce, by hand-writing the post-lowering reference. Per @mawad-amd's review feedback, this PR has been redesigned from a host-orchestrated mgpuMemcpy test into a kernel-driven producer/consumer pair that exercises the actual AIR multi-GPU programming model.

What's added

Three commits:

1. [multi-gpu] Phase 2: hand-written e2e test + [multi-gpu] Phase 2: remove SHARE_GPU; fail-fast precondition — set up the test harness:

  • test/gpu/symmetric_heap_dma/run.sh — multi-process driver. Refuses to colocate ranks on a single GPU (would silently bypass XGMI / produce false PASSes); pins each process to its own GPU at the OS level via HIP_VISIBLE_DEVICES=$i.
  • Initial air_sym_handwritten.mlir (host-orchestrated mgpuMemcpy form).

2. [multi-gpu] Phase 2: air.translate op + air-translate-to-llvm lowering — new AIR primitive:

%peer = air.translate %src, %from_rank, %to_rank, %heap_bases
    : memref<NxT, A>, !llvm.ptr

Pure / no side effects, folds when from == to. Lowering pass air-translate-to-llvm expands to bases[to] - bases[from] byte-arithmetic + a freshly-built LLVM memref descriptor over the peer pointer. The expansion is identical at host scope and inside gpu.func. Naming aligned with IRIS's __translate. FileCheck unit tests for parser, folder, and lowering shape (1D, 2D-addrspace, gpu.func body, no-op).

3. [multi-gpu] Phase 2: kernel-driven producer/consumer rewrite — addresses @mawad-amd's review:

  • air_sym_handwritten.mlir is rewritten with gpu.module @sym_kernels containing two gpu.funcs:
    • producer (rank 0): each thread writes 42.0 directly into rank 1's data HBM via XGMI, using air.translate to materialize the peer memref. Lane 0 of each warp signals via llvm.atomicrmw xchg release on rank 1's per-warp flag.
    • consumer (rank 1): lane 0 of each warp spins on its flag with llvm.load atomic acquire, then all 64 lanes copy their data slot to a verify buffer.
  • Host driver dispatches the producer or consumer kernel by rank via scf.if + gpu.launch_func. Heap_bases is copied to a device-resident buffer before kernel launch (workaround for mgpuGetHeapBases() returning a host pointer; documented as a TODO for the runtime).
  • run.sh adds the GPU compilation chain (rocdl-attach-targetconvert-gpu-to-rocdlgpu-module-to-binarygpu-async-regiongpu-to-llvm).

How the four reviewer comments are addressed

Comment Resolution
Add an AIR primitive for address translation New air.translate op
Move address translation inside the kernel Producer/consumer kernels call air.translate inside gpu.func
Drop the local D2D staging buffer Producer writes directly to peer HBM via memref.store; consumer reads its own (now-valid) data
Use compute units, not the copy engine No mgpuMemcpy on the cross-rank data path; transfers are memref.store/memref.load issued by GPU threads

Validation

Run on rad-mi325x-1 (8× MI325X, gfx942, ROCm 7.1.1, fully-connected XGMI), each rank pinned to its own GPU:

WORLD_SIZE Result
2 ✅ rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
4 ✅ ALL 4 RANKS PASSED (rank 0/1 active, ranks 2–3 idle)
8 ✅ ALL 8 RANKS PASSED (rank 0/1 active, ranks 2–7 idle)

The verify[0]=42.0 value is end-to-end evidence that:

  1. Rank 0's GPU compute units (not the HIP copy engine) wrote 42.0 into rank 1's HBM via XGMI peer-mapped stores.
  2. The cross-GPU release/acquire atomic flag handshake worked — the consumer's spin-loop terminated only after the producer's release-store became visible.
  3. The data was correctly read back on rank 1's GPU and verified on the host.

This is the first time GPU compute units have been observed driving cross-rank data movement over XGMI in this stack.

Open follow-ups (not blocking)

  • mgpuGetHeapBases() should return a device-accessible pointer (hipMallocManaged or hipHostMalloc(...,Mapped)), eliminating the host-side copy workaround. Tracked in the test as a TODO comment.
  • Future: extend the test from a 2-rank producer/consumer to an N-way mesh exercising every (i,j) producer/consumer pair.

🤖 Generated with Claude Code

@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase2-handwritten-e2e branch from a1ee757 to 013c9f3 Compare May 3, 2026 20:21
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 3, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase2-handwritten-e2e branch from 013c9f3 to 38b7e10 Compare May 3, 2026 20:25
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 3, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 3, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let comments on the handwritten example.

Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir Outdated
Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir Outdated
Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir Outdated
Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir Outdated
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 5, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-phase2-handwritten-e2e branch from 800c533 to e9b4638 Compare May 6, 2026 01:01
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a
`src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side
`mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`.

The peer pointer is computed at runtime as:
  peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank])

where `local_ptr` is extracted from the local-side memref via
`memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]`
gives this rank's symmetric heap base.

## Restrictions (this initial version)

- Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global)
- The op must be at host scope (not inside a `gpu.launch` or `gpu.func`)
- "Entire memref" form only — no explicit `[offsets][sizes][strides]`
- Only one of `src_rank` / `dst_rank` may be set per op

These restrictions match the hand-written reference's Phase 2 pattern. They
can be relaxed in follow-up work.

## Files

- `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU`
- `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation
- `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration
- `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck
- `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e
  combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering
- `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector

## Test plan

FileCheck unit tests cover:
- src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi,
  byte-stride GEP, mgpuMemcpy)
- dst_rank lowering (peer pointer becomes dst arg)
- 2D memref byte size
- f64 element type byte size
- Multiple cross-rank DMAs share extern decls
- Pass is a no-op for non-cross-rank DMAs

End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks):
- INPUT=handwritten — PASS (Phase 2 baseline)
- INPUT=rank — PASS (Phase 3)
- INPUT=alloc — PASS (Phase 4)
- INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3)
  Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their
  own dst_buf; verification reads back 1.0.

Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 —
true multi-GPU re-validation is needed before declaring multi-GPU
production-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 6, 2026
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx and others added 10 commits May 6, 2026 16:37
Before writing any lowering pass, prove the symmetric-heap runtime works
end-to-end from MLIR by hand-writing the IR that future passes should
emit. This locks down the lowered shape, surfaces ABI gaps early, and
provides a reference oracle for diff-testing the upcoming
air-rank-to-mgpu / cross-rank-DMA / channel-on-GPU passes.

- `test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir` — hand-written
  reference IR. Each rank: init heap, alloc symmetric buffer, fill with
  (rank+1).0, barrier, read peer's buffer via `mgpuGetHeapBases()[peer]`,
  D2D into local copy, D2H readback, verify, print PASS/FAIL.
- `test/gpu/symmetric_heap_dma/run.sh` — driver that lowers the IR with
  `mlir-opt`, then forks N processes with RANK/WORLD_SIZE/LOCAL_RANK env
  vars set and runs `mlir-runner` in each. `SHARE_GPU=1` env makes all
  ranks share GPU 0 for testing on single-GPU hosts.

- ✅ Verified end-to-end on rad-mi300a-sh5-1 (1×MI300A, ROCm 7.1.1) with
  `SHARE_GPU=1` and 2 ranks: rank 0 sees `2.0` from rank 1, rank 1 sees
  `1.0` from rank 0.
- ⚠️ rad-mi300x-1 (8×MI300X, ROCm 6.4.0) hits a runtime-side crash inside
  libamdhip64.so during `establishPeerAccess()`. Same crash reproduces
  with the existing C++ baseline `test/gpu/test_symmetric_heap.cpp` —
  pre-existing runtime/HIP issue unrelated to this change.

No runtime ABI gaps for Phases 3-7. The full lowering pipeline can be
built using only existing exports: `mgpuSymmetricHeapInit/Destroy`,
`mgpuGetRank/WorldSize`, `mgpuSymmetricAlloc/Free`, `mgpuGetHeapBases`,
`mgpuBarrier`, `mgpuMemcpy` (D2D for cross-rank reads — direct kernel
read from peer-VA isn't supported on some chipsets, so D2D-to-local-then-
read is the required pattern).

`docs/MultiGPUPlan.md` updated with Phase 2 status section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the SHARE_GPU=1 escape hatch from run.sh. Colocating ranks on a
single GPU silently bypasses the symmetric-heap / XGMI path and reports
false-positive PASSes — exactly what the test exists to validate.
Replace with a precondition check that exits non-zero when fewer GPUs
are visible than ranks were requested. Validated on rad-mi325x-1
(8x MI325X) at WORLD_SIZE=2,4,8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduce an AIR primitive for the symmetric-heap pointer rebase, in
preparation for the kernel-driven producer/consumer redesign per
@mawad-amd's review feedback on PR Xilinx#1577.

  %peer = air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr

Signature:
- $source: memref on $from_rank's symmetric heap
- $from_rank, $to_rank: index-typed rank ids
- $heap_bases: !llvm.ptr to the per-rank base table from mgpuGetHeapBases()
- result: same memref type, addressing $to_rank's slice of the same
  collective allocation

The op is Pure and folds when from_rank == to_rank (statically equal SSA
values or matching constant attrs). Naming follows IRIS's `__translate`.

Lowering pass `air-translate-to-llvm` expands each op to the
peer-VA arithmetic plus a freshly-built LLVM memref descriptor:

  byte_diff = ptrtoint(bases[to]) - ptrtoint(bases[from])
  peer_aligned_ptr = src_aligned_ptr + byte_diff   (i8 GEP)
  build descriptor { peer_ptr, peer_ptr, 0, sizes, strides }
  unrealized_conversion_cast back to result memref type

The expansion is pure arithmetic (arith + memref + llvm dialect), no
runtime calls — therefore valid both at host scope and inside `gpu.func`,
provided heap_bases is threaded as a kernel argument.

Tests:
- mlir/test/Dialect/AIR/air_translate.mlir: parser/printer + folder
- mlir/test/Conversion/AIRToROCDL/air_translate_to_llvm.mlir: lowering
  shape on 1D, 2D-addrspace, gpu.func body, and no-op cases

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @mawad-amd's review feedback on PR Xilinx#1577: replace the host-orchestrated
mgpuMemcpy reference test with a kernel-driven producer/consumer pair.
Cross-rank data movement is now performed by GPU compute units issuing
loads/stores directly into peer HBM over XGMI, not by the HIP copy engine.

Changes:

- air_sym_handwritten.mlir is rewritten as one gpu.module with two
  gpu.func kernels:
    * producer (rank 0): each thread writes 42.0 into rank 1's `data`
      via memref.store on a peer memref produced by air.translate.
      Lane 0 of each warp signals the per-warp flag with a release
      atomicrmw on rank 1's `flags`.
    * consumer (rank 1): lane 0 of each warp spins on its flag with an
      acquire atomic load until producer signals; gpu.barrier then
      releases all 64 lanes to read their data slot and copy it into
      a verify buffer. Host D2H reads verify_buf and checks 42.0.
  The host driver (func.func @main) initializes the symmetric heap,
  copies heap_bases into a device-resident buffer (workaround for the
  fact that mgpuGetHeapBases returns a host pointer), and dispatches
  the producer or consumer kernel based on rank.

- run.sh adds the GPU compilation chain (rocdl-attach-target,
  convert-gpu-to-rocdl, gpu-module-to-binary, gpu-async-region,
  gpu-to-llvm) before mlir-runner.

- run.sh sets HIP_VISIBLE_DEVICES=$i + LOCAL_RANK=0 per process so each
  rank sees only its own GPU as device 0. This eliminates the
  device-binding ambiguity between airgpu's hipSetDevice and MLIR's
  built-in gpu.launch_func handling that would otherwise cause rank N>0
  to fail with hipErrorInvalidDevice when launching kernels.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED (rank 0/1 active, ranks 2-3 idle)
  W=8: ALL 8 RANKS PASSED (rank 0/1 active, ranks 2-7 idle)

This is the first time GPU compute units (not the HIP copy engine)
have been observed driving cross-rank data movement over XGMI in this
stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI fixes:

1. air_translate_to_llvm.mlir: add `// REQUIRES: gpu`. The pass
   `--air-translate-to-llvm` is only registered when AIR_ENABLE_GPU=ON
   (it lives in the gpu-only conversion-pass set). Without the gate
   the test fails on non-GPU builds with
     air-opt: Unknown command line argument '--air-translate-to-llvm'
   This matches the pattern already used by the sibling tests
   air_to_rocdl.mlir and air_gpu_outlining.mlir.

2. AIRTranslateToLLVMPass.{h,cpp}: clang-format-17 reflow. The header
   banner had a too-long filename which clang-format wrapped into a
   broken two-line banner ("//===- ...PASS.h ----*- C++\n//-*-===//"),
   and a few function calls in the .cpp wanted slightly different
   wrapping. Match the surrounding header-banner convention (80 cols
   wide) and accept the .cpp reflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address layer-violation feedback: air.translate's $heap_bases operand
was typed as !llvm.ptr, mixing LLVM dialect into a high-level AIR op
signature (the only AIR op that did so). The right MLIR-native type for
"array of pointer-width values in memory" is memref<?xindex>:
  - memref expresses the "array in memory" semantic
  - index is the pointer-width integer type already used elsewhere
    (e.g. memref.extract_aligned_pointer_as_index)
  - the dynamic ?-dim matches the variable world_size

Op signature changes from:
  air.translate %src, %from, %to, %bases : memref<NxT, A>, !llvm.ptr
to:
  air.translate %src, %from, %to, %bases : memref<NxT, A>, memref<?xindex>

Lowering pass now does memref.load + arith.subi/addi (steps 1-3 below)
instead of llvm.getelementptr + llvm.load + llvm.ptrtoint + arith.subi
+ llvm.getelementptr-i8. The LLVM dialect only appears in step 4
(materialize peer address as !llvm.ptr) and step 5 (build memref
descriptor) — both unavoidable since memref descriptors *are* LLVM
structs.

Host-side wiring: a small wrap_bases(!llvm.ptr, i64) -> memref<?xindex>
helper builds a memref descriptor over the device-resident heap_bases
buffer once. From there it's a memref everywhere — through
gpu.launch_func, into the kernel, into air.translate.

The air_LLVMPtr type-predicate def in AIR.td is removed; AIR.td no
longer imports any LLVM-dialect type machinery. The
"#include mlir/Dialect/LLVMIR/LLVMTypes.h" in AIRDialect.h is dropped
(no AIR op signature uses LLVM types anymore).

Validated on rad-mi325x-1 (8x MI325X, gfx942, ROCm 7.1.1):
  W=2: rank 1 (consumer): cross-rank kernel write PASS (verify[0]=42.0)
  W=4: ALL 4 RANKS PASSED
  W=8: ALL 8 RANKS PASSED

FileCheck unit tests updated for both the dialect (parser/printer/
folder) and the conversion (lowering shape).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Consumer kernel never calls air.translate (it reads its OWN local
  data, which the producer wrote remotely from the producer side). So
  the %bases : memref<?xindex> arg in @consumer was unused. Drop it
  from both the kernel signature and the host-side gpu.launch_func arg
  list.
- Both kernels declared %c1 = arith.constant 1 : index but neither
  actually used it. Drop.

Verified on rad-mi325x-1 W=2/4/8 — consumer still PASSes with
verify[0]=42.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three pieces of review feedback on the handwritten test:

1. Validation theater. The verify branch only checked element 0 and
   only ever printed PASS — msg_fail was declared but never referenced.
   A bug that signalled flag[0] but failed to write warps 1..3's slice
   would still pass. Now: scf.for over all 256 elements counts
   mismatches, prints msg_fail with the first one, and on any failure
   calls exit(1) so run.sh sees a non-zero process exit and reports
   "SOME RANKS FAILED" (matches the saved no-green-without-validation
   convention).

2. Atomic syncscope is the silent contract that makes XGMI propagation
   work. Producer's atomicrmw release and consumer's atomic load
   acquire emit no syncscope keyword, relying on the LLVM IR default
   = System scope (cross-device on AMDGPU). New FileCheck test
   sym_atomic_syncscope.mlir asserts both ops survive
   convert-gpu-to-rocdl with no syncscope qualifier present, with a
   block comment explaining the AMDGPU LangRef behavior and linking
   to the relevant section. The handwritten file's atomic comment
   blocks now point at this test.

3. Comments throughout were too verbose. Sweeping trim of the file
   header, kernel sections, helpers, and main: 411 -> 348 lines.
   Substance unchanged; comments now state the why (or the contract),
   not the what.

Validated on rad-mi325x-1 (8x MI325X, ROCm 7.1.1):
  W=2/4/8 -> ALL N RANKS PASSED
  consumer reports verify[0]=42.0 with the full 256-element check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous wrap_data / wrap_flags / wrap_bases helpers each
hand-built an LLVM memref descriptor struct
(!llvm.struct<(ptr, ptr, i64, array<1 x i64>, array<1 x i64>)>),
hardcoding the in-flight memref-to-LLVM ABI three times. An upstream
descriptor-layout change would silently break all three.

Collapse to a single wrap_bytes(ptr, size_bytes) -> memref<?xi8> that
builds the descriptor once. Use sites do memref.view to retype:

  %data_bytes  = wrap_bytes(%data_ptr,  %c1024_bytes)
  %data_m      = memref.view %data_bytes[%c0][] : memref<?xi8> to memref<256xf32>
  %flags_bytes = wrap_bytes(%flags_ptr, %c16_bytes)
  %flags_m     = memref.view %flags_bytes[%c0][] : memref<?xi8> to memref<4xi32>
  %bases_bytes = wrap_bytes(%bases_devptr, %bases_size)
  %bases       = memref.view %bases_bytes[%c0][%world_idx]
                     : memref<?xi8> to memref<?xindex>
  ; verify_buf wrapped same way at the consumer

The struct-type literal now appears in exactly one place. memref.view
is a standard upstream op with its own well-tested lowering.

Validated on rad-mi325x-1: W=2/4/8 all PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mic gap

The 5-op extract_aligned_pointer_as_index -> index_cast -> inttoptr ->
index_cast -> getelementptr sequence was duplicated in producer and
consumer kernels. Factor into one private func.func @flag_slot_ptr
inside gpu.module @sym_kernels (gpu.module accepts non-kernel funcs;
the GPU compilation pipeline compiles them alongside the kernels).

Add a TODO comment explaining the upstream memref dialect gap that
forces this descent: memref.atomic_rmw and memref.generic_atomic_rmw
lack ordering and syncscope, and there is no memref.atomic_load /
memref.atomic_store at all. We need release/acquire + system scope
for the cross-XGMI flag handshake, which today only the LLVM dialect
exposes. When upstream memref grows ordering+syncscope on its atomic
ops, this helper goes away in favor of memref.atomic_rmw / load.

Producer and consumer atomic blocks each shrink from 9 ops to 1 + 1
helper call. Net diff: ~16 lines saved across the file.

Validated on rad-mi325x-1: W=2/4/8 all PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx
Copy link
Copy Markdown
Collaborator Author

Thanks for your reviews, @mawad-amd. The handwritten IR example is updated accordingly, and verified working on a multi-gpu MI325X cluster. Could you please kindly review again and confirm whether this PR can land now?

erwei-xilinx and others added 4 commits May 12, 2026 15:26
Change the producer's release-atomicrmw and consumer's acquire-atomic-load
in air_sym_handwritten.mlir from default (no syncscope qualifier) to
`syncscope("")`. The empty string is LLVM IR's canonical spelling of the
System scope; this makes the cross-device intent self-documenting at the
MLIR level rather than relying on a default-omitted contract.

Behavior is unchanged: `syncscope("")` lowers to LLVM IR identical to the
unqualified form (LLVM textual IR omits the `syncscope(...)` token when
scope == System), survives `convert-gpu-to-rocdl`, and runs e2e on 2x
MI325X (verified on rad-mi325x-1).

Update sym_atomic_syncscope.mlir FileCheck contract test accordingly:
assert `syncscope("")` is preserved through the pipeline instead of
asserting absence of any syncscope keyword.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The barrier after lane-0's spin-wait on the per-warp flag is unnecessary
on AMDGPU:

  - Within-wave control sync: lanes execute in SIMT lockstep, so lanes
    1..63 of each wave cannot leave the scf.if before lane 0 does.
  - Memory visibility: L1 is wave-shared, so lane 0's `syncscope("")
    acquire` load makes the producer's writes visible to the whole wave
    without needing a workgroup-level fence.

Verified e2e on 2x MI325X (rad-mi325x-1), 5/5 runs PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ariants

The Phase 2 reference test now ships two parallel kernel-driven examples
of the symmetric-heap producer/consumer pattern, each demonstrating a
different cross-rank synchronization mechanism on the same outer harness:

  air_sym_handwritten_atomic.mlir
    LLVM atomicrmw release (producer) + atomic load acquire (consumer),
    both with syncscope("") = LLVM System scope = cross-device per
    AMDGPUUsage. Spec-defined ordering contract; the lowering invariant
    is pinned by sym_atomic_syncscope.mlir.

  air_sym_handwritten_cacheline.mlir
    Cache-line atomicity: producer writes 32 i32 (one 128-byte line) in
    a single vec store with the flag in-band at lane 31; consumer spins
    via gpu.shuffle of lane 31 until flag==1. No atomics, no fences.
    Trades the LLVM-spec contract for a microarchitectural one (relies
    on gfx940 vec-store cache-line atomicity and XGMI publishing peer
    cache lines whole on MI300).

run.sh now accepts INPUT=atomic|cacheline (default cacheline). The two
files share the mgpu* host harness, the wrap_bytes helper, and the
heap-init / verify_buf D2H readback / fail-loud exit pattern; only the
cross-rank handoff differs. Both verified on 2x MI325X (rad-mi325x-1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The handwritten cross-rank symmetric-heap test fundamentally needs a
producer + a consumer process; world_size=1 has no peer to talk to. The
old %is_solo branch printed a "skipping" message and exited 0, which is
worse than useless now that we have real multi-GPU CI: a misconfigured
single-process launch would be reported as a green test even though
nothing was exercised.

Replace the graceful skip with a fail-loud precondition at the launcher
boundary (run.sh) and remove the corresponding MLIR-level branch:

  - run.sh now refuses NUM_RANKS < 2 with a clear ERROR + exit 1,
    matching the existing pattern for NUM_GPUS < NUM_RANKS.
  - Both air_sym_handwritten_{atomic,cacheline}.mlir lose the %is_solo
    if/else wrapping; rank-dispatch (producer/consumer/idle) is now at
    the top level. The @msg_only1 global is removed.

Verified on 2x MI325X: INPUT=atomic PASS, INPUT=cacheline PASS,
`bash run.sh 1` refused at the launcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants