[multi-gpu] Phase 3: air-rank-to-mgpu lowering pass by erwei-xilinx · Pull Request #1578 · Xilinx/mlir-air

erwei-xilinx · 2026-05-03T18:37:26Z

Summary

Phase 3 of the multi-GPU stack: a new air-rank-to-mgpu conversion pass that lowers high-level air.rank ops to the runtime dispatch model that Phase 2 (#1577) demonstrated by hand.

Before pass	After pass
`air.rank %i = 0 to %N { body(%i) }`	Body inlined in place, with `%i` replaced by `delinearize(mgpuGetRank(), shape)`
Single-process placeholder via `air-rank-to-launch` (which serialized ranks with `scf.for`)	Multi-process: each process executes the body once, rank id resolved from runtime
Heap lifecycle handled by hand in test harness	Pass brackets the enclosing `func.func` with `mgpuSymmetricHeapInit(heap_size)` on entry and `mgpuSymmetricHeapDestroy()` before each `func.return`

What's new

mlir/include/air/Conversion/AIRRankToMgpuPass.h / .cpp — pass implementation (~180 LOC)
mlir/include/air/Conversion/GPUPasses.td — air-rank-to-mgpu def with heap-size option (default 256 MB)
mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir — 10-case FileCheck unit test:
- 1D / 2D rank delinearization (remsi/divsi)
- Default + custom heap-size option
- Async form (token replacement via wait_all)
- Async dependencies (blocking wait_all insertion)
- Multiple air.rank ops per function (init/destroy emitted once)
- Multiple func.return paths (destroy before each)
- Kernel operand mapping (block args replaced by SSA operands)
- Idempotent extern decls across multiple functions
- No-op when no air.rank is present (pass was unconditionally inserting decls — caught by audit, fixed)
test/gpu/symmetric_heap_dma/air_sym_with_rank.mlir — high-level air.rank wrap of air_sym_handwritten_cacheline.mlir from [multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU #1577. Same kernels, same launch dispatch, same validation; only @main differs in being wrapped in air.rank (%rid) in (%rsize = %c2) { ... } and using %rid / %rsize where the handwritten test calls mgpuGetRank() / mgpuGetWorldSize().

This 1:1 correspondence is the property the pass is supposed to establish: writing the multi-process world declaratively via air.rank and lowering through air-rank-to-mgpu should produce something functionally equivalent to the handwritten reference.
test/gpu/symmetric_heap_dma/run.sh — INPUT=rank selector; uses the same full GPU compilation chain as INPUT=cacheline (the lowered output is structurally a superset).

Test plan

FileCheck unit tests pass (10 cases above)
E2E on real 2x MI325X (rad-mi325x-1, NUM_RANKS=2): both ranks PASS the cross-rank cache-line message (data[0]=100, flag=1) — output structurally identical to INPUT=cacheline, only distinguished by the [mlir/rank] log tag (vs [mlir]). 3/3 stability runs.
Equivalence demonstrated: the pass-output of this file lowered through the same backend pipeline produces the same runtime behavior as the handwritten cacheline reference. Cross-XGMI handoff verified by the same per-element validation (lanes 0..30 == lane+100, lane 31 == 1) with exit(1) on mismatch.
git clang-format origin/main applied; "Python and C/C++ Check Format" check passes

🤖 Generated with Claude Code

New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a `src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side `mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`. The peer pointer is computed at runtime as: peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank]) where `local_ptr` is extracted from the local-side memref via `memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]` gives this rank's symmetric heap base. ## Restrictions (this initial version) - Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global) - The op must be at host scope (not inside a `gpu.launch` or `gpu.func`) - "Entire memref" form only — no explicit `[offsets][sizes][strides]` - Only one of `src_rank` / `dst_rank` may be set per op These restrictions match the hand-written reference's Phase 2 pattern. They can be relaxed in follow-up work. ## Files - `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU` - `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck - `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering - `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector ## Test plan FileCheck unit tests cover: - src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi, byte-stride GEP, mgpuMemcpy) - dst_rank lowering (peer pointer becomes dst arg) - 2D memref byte size - f64 element type byte size - Multiple cross-rank DMAs share extern decls - Pass is a no-op for non-cross-rank DMAs End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks): - INPUT=handwritten — PASS (Phase 2 baseline) - INPUT=rank — PASS (Phase 3) - INPUT=alloc — PASS (Phase 4) - INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3) Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their own dst_buf; verification reads back 1.0. Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 — true multi-GPU re-validation is needed before declaring multi-GPU production-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that lowers `air.dma_memcpy_nd` ops carrying a `src_rank` or `dst_rank` integer attribute (added in Phase 1) to host-side `mgpuMemcpy` calls with peer-VA addressing through `mgpuGetHeapBases()`. The peer pointer is computed at runtime as: peer_ptr = bases[peer_rank] + (local_ptr - bases[my_rank]) where `local_ptr` is extracted from the local-side memref via `memref.extract_aligned_pointer_as_index` and `local_base = bases[my_rank]` gives this rank's symmetric heap base. - Both `src` and `dst` memrefs must be in `memory_space=0` (L3/global) - The op must be at host scope (not inside a `gpu.launch` or `gpu.func`) - "Entire memref" form only — no explicit `[offsets][sizes][strides]` - Only one of `src_rank` / `dst_rank` may be set per op These restrictions match the hand-written reference's Phase 2 pattern. They can be relaxed in follow-up work. - `mlir/include/air/Conversion/AIRCrossRankDmaToMgpuPass.h` — header - `mlir/include/air/Conversion/GPUPasses.td` — `air-cross-rank-dma-to-mgpu` def - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRCROSSRANKDMATOMGPU` - `mlir/lib/Conversion/AIRCrossRankDmaToMgpuPass.cpp` — implementation - `mlir/lib/Conversion/{CMakeLists.txt,Passes.cpp}` — registration - `mlir/test/Conversion/AIRCrossRankDmaToMgpu/cross_rank_dma.mlir` — FileCheck - `test/gpu/symmetric_heap_dma/air_sym_with_dma.mlir` — high-level e2e combining Phase 1 attrs + Phase 3 + Phase 4 + Phase 5 lowering - `test/gpu/symmetric_heap_dma/run.sh` — adds `INPUT=dma` selector FileCheck unit tests cover: - src_rank lowering shape (size, ptr extraction, bases, GEP, ptrtoint, subi, byte-stride GEP, mgpuMemcpy) - dst_rank lowering (peer pointer becomes dst arg) - 2D memref byte size - f64 element type byte size - Multiple cross-rank DMAs share extern decls - Pass is a no-op for non-cross-rank DMAs End-to-end on rad-mi300a-sh5-1 (SHARE_GPU=1, 2 ranks): - INPUT=handwritten — PASS (Phase 2 baseline) - INPUT=rank — PASS (Phase 3) - INPUT=alloc — PASS (Phase 4) - INPUT=dma — PASS (Phase 5: chains Phase 5 -> Phase 4 -> Phase 3) Both ranks read rank 0's symmetric src_buf via cross-rank DMA into their own dst_buf; verification reads back 1.0. Same SHARE_GPU=1 single-physical-GPU caveat as Xilinx#1577 / Xilinx#1578 / Xilinx#1579 — true multi-GPU re-validation is needed before declaring multi-GPU production-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… group by IR level Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (3-7) drop into their own subdir without touching anything else. Directory rename: test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/ The old name was misleading — most of the tests in this directory don't do DMA in the conventional sense (the cacheline variant uses vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA. New layout: test/gpu/multi_gpu/ README.md # explains the layered structure handwritten/ Makefile # self-contained, no shared boilerplate cacheline.mlir # was: air_sym_handwritten_cacheline.mlir atomic.mlir # was: air_sym_handwritten_atomic.mlir Per-phase invocation: `make` instead of `bash run.sh`. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes. Why per-subdir self-contained Makefile (no _common.mk): - Each phase's PR touches only its own subdir; no rebase conflicts on a shared file. Phases 2-7 had to re-resolve run.sh case-statement conflicts on every rebase under the old shared-script design. - A shared include rots silently — one phase's edit can break another's pipeline. Duplicating ~30 lines of preconditions + multi-process driver is the cheaper failure mode. - Pipelines genuinely differ per phase (handwritten goes through air-translate-to-llvm + GPU compile; air_rank adds air-rank-to-mgpu; air_alloc adds air-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4. Verified on rad-mi325x-1 (2x MI325X): - `make -C test/gpu/multi_gpu/handwritten` → ALL 2 RANKS PASSED - `make -C test/gpu/multi_gpu/handwritten INPUT=atomic` → ALL 2 RANKS PASSED In-flight phase 3 (Xilinx#1578) and allgather (Xilinx#1611) PRs will need to rebase onto this restructure to use the new paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds Phase 3 of the multi-GPU stack by introducing an air-rank-to-mgpu conversion pass that lowers air.rank into an mgpu runtime dispatch model, plus unit/E2E tests and harness updates to exercise the new lowering in the symmetric-heap DMA pipeline.

Changes:

Implement new air-rank-to-mgpu pass that inlines air.rank bodies, computes rank IDs from mgpuGetRank(), and brackets functions with mgpuSymmetricHeapInit/Destroy.
Register the new pass in the GPU conversion pass set and build system; add heap-size option (default 256MB).
Add FileCheck coverage and an end-to-end “rank-wrapped cacheline” MLIR test + run.sh selector (INPUT=rank).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/gpu/symmetric_heap_dma/run.sh	Adds `INPUT=rank` path that runs `air-rank-to-mgpu` before the existing GPU compilation chain.
test/gpu/symmetric_heap_dma/air_sym_with_rank.mlir	New E2E test input using `air.rank` wrapper equivalent to the handwritten cacheline test.
mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir	New FileCheck unit tests for rank lowering, heap-size option, async forms, multiple ranks/returns, and decl idempotence.
mlir/lib/Conversion/Passes.cpp	Wires in the new pass registration (GPU conversion passes).
mlir/lib/Conversion/CMakeLists.txt	Adds `AIRRankToMgpuPass.cpp` to the conversion library build.
mlir/lib/Conversion/AIRRankToMgpuPass.cpp	Implements the lowering pass logic (extern decls, heap init/destroy insertion, rank delinearization, region inlining, async token handling).
mlir/include/air/Conversion/GPUPasses.td	Declares the `air-rank-to-mgpu` pass and its `heap-size` option.
mlir/include/air/Conversion/GPUPassDetail.h	Adds generated pass base macro for `AIRRankToMgpu`.
mlir/include/air/Conversion/AIRRankToMgpuPass.h	New public pass factory header.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

New conversion pass that replaces each `air.rank` op by inlining its body in place, with rank IDs computed at runtime via `mgpuGetRank()` and delinearized into the rank's N-D iteration space. Replaces `air-rank-to-launch` for the GPU pipeline (which serialized ranks via scf.for — a placeholder for single-process execution). After this pass each process executes the entire `air.rank` body once, with its rank id resolved dynamically from the runtime. Heap lifecycle (`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed around the parent function once per function (not per rank). - `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header - `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def with `heap-size` option (default 256 MB); declares `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes `air.wait_all` ops in the async case). - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU` + `air/Dialect/AIR/AIRDialect.h` include needed by the generated `getDependentDialects` body. - `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation. Materializes the heap-size constant via `IntegerAttr::get(i64Ty, APInt(64, heapSize))` so the full uint64_t value round-trips (a static_cast<int64_t> would silently wrap for values > INT64_MAX). - `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration. - `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu` so it's skipped on non-GPU CI builds where the pass isn't registered. Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1 air.rank wrap of the corresponding handwritten reference from Xilinx#1577 / Xilinx#1611. After lowering, each is functionally equivalent to its handwritten sister — which is the property the pass is supposed to establish (write the multi-process world declaratively, get back the handwritten reference's runtime behavior). air_sym_with_rank_cacheline.mlir wraps air_sym_handwritten_cacheline air_sym_with_rank_allgather.mlir wraps air_sym_handwritten_allgather run.sh's INPUT selector grew accordingly: INPUT=rank_cacheline | rank_allgather FileCheck unit tests cover: - 1D / 2D rank delinearization (remsi/divsi) - Default + custom heap-size option - Async form (token replacement via wait_all) - Async dependencies (blocking wait_all insertion) - Multiple `air.rank` ops per function (init/destroy emitted once) - Multiple `func.return` paths (destroy before each) - Kernel operand mapping (block args replaced by SSA operands) - Idempotent extern decls across multiple functions - No-op when no `air.rank` is present (pass was unconditionally inserting decls — caught by audit, fixed) End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): - INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1) - INPUT=rank_allgather: all-gather PASS (both peers' slots correct) Output structurally identical to the handwritten variants, only distinguished by the `[mlir/rank]` log tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from dc5af89 to fa3a16c Compare May 3, 2026 20:22

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from fa3a16c to cf17742 Compare May 3, 2026 20:25

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from cf17742 to a9c3e28 Compare May 3, 2026 20:27

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from a9c3e28 to f81efbb Compare May 5, 2026 18:36

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from f81efbb to c042b3f Compare May 6, 2026 00:29

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from c042b3f to d4d376d Compare May 6, 2026 01:01

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from d4d376d to 32b24f4 Compare May 6, 2026 04:24

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from 32b24f4 to 2b97153 Compare May 6, 2026 04:39

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from 2b97153 to 5102ff5 Compare May 6, 2026 04:48

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from 5102ff5 to 1e084fa Compare May 6, 2026 05:17

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from 04207b3 to 4514661 Compare May 6, 2026 19:02

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from 4514661 to 6875ed3 Compare May 6, 2026 20:15

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch 2 times, most recently from 597f224 to 641a09f Compare May 12, 2026 15:38

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from 641a09f to 7b40bf7 Compare May 12, 2026 16:19

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch from 7b40bf7 to 866a74c Compare May 12, 2026 17:20

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch 2 times, most recently from 5a123df to 9932566 Compare May 12, 2026 18:39

erwei-xilinx mentioned this pull request May 12, 2026

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level #1613

Merged

5 tasks

erwei-xilinx force-pushed the multigpu-phase3-rank-to-mgpu-pass branch 2 times, most recently from 897bc64 to dc864fe Compare May 12, 2026 21:21

erwei-xilinx marked this pull request as ready for review May 12, 2026 21:44

erwei-xilinx requested a review from fifield as a code owner May 12, 2026 21:44

Copilot AI review requested due to automatic review settings May 12, 2026 21:44

erwei-xilinx enabled auto-merge May 12, 2026 21:45

erwei-xilinx disabled auto-merge May 12, 2026 21:45

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread mlir/lib/Conversion/AIRRankToMgpuPass.cpp

Comment thread mlir/include/air/Conversion/GPUPasses.td Outdated

Comment thread mlir/include/air/Conversion/AIRRankToMgpuPass.h Outdated

erwei-xilinx mentioned this pull request May 19, 2026

[multi-gpu] Phases 4 + 6 + AIR-on-GPU realigned to PE=wavefront model #1618

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi-gpu] Phase 3: air-rank-to-mgpu lowering pass#1578

[multi-gpu] Phase 3: air-rank-to-mgpu lowering pass#1578
erwei-xilinx merged 1 commit into
Xilinx:mainfrom
erwei-xilinx:multigpu-phase3-rank-to-mgpu-pass

erwei-xilinx commented May 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented May 3, 2026 •

edited

Loading