[multi-gpu] handwritten all-gather e2e (cache-line, SIMD across ranks)#1611
Merged
erwei-xilinx merged 2 commits intoMay 12, 2026
Merged
Conversation
…ross ranks) Adds air_sym_handwritten_allgather.mlir, a sister to the existing producer/consumer cacheline test. Where the producer/consumer file demonstrates 1-to-1 cross-rank handoff with asymmetric kernels, this file demonstrates a many-to-many SIMD pattern: every rank runs the SAME kernel, with the only per-rank variation being the rank-id arg. Layout: each rank R has a 32-i32 input slice (one 128-byte cache line) with payload R*1000 + lane+100 (lanes 0..30) and flag=1 (lane 31). The output buffer is W cache lines = slot[0] | slot[1] | ... | slot[W-1]. Phase 1 (publish): for each peer P in 0..W-1, R writes its slice into the sub-buffer P_output[R*32 .. (R+1)*32]. Self-write goes through the same air.translate code path (with from == to → no-op pointer-wise). Phase 2 (collect): for each peer P, R spins on its LOCAL output[P*32 .. (P+1)*32] until lane 31 (flag) shows up via gpu.shuffle, then copies the validated cache line to verify_buf for host check. Cache-line atomicity contract is identical to the cacheline producer/ consumer file — see that file's header for gfx940 / MI300 reasoning. run.sh wires INPUT=allgather alongside the existing atomic + cacheline modes. Verified on 2x MI325X (rad-mi325x-1), 5/5 PASS, with each rank correctly observing both slot[0] = 100 (rank 0's payload) and slot[1] = 1100 (rank 1's payload). Constraint: pinned to W=2 because air.translate today requires a static-shape source memref (AIRTranslateToLLVMPass.cpp:122). The KERNEL itself is W-agnostic — its peer loop is runtime-bounded, only the output memref type and a host-side W==2 precondition are concretized. Comment in the file's header points at the lift path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
f11ddf0 to
1956f1b
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new symmetric-heap multi-GPU end-to-end MLIR test that demonstrates a cache-line-based, SIMD-across-ranks all-gather pattern (many-to-many) using air.translate inside a peer loop, and wires it into the existing run.sh harness via a new INPUT=allgather option.
Changes:
- Add
air_sym_handwritten_allgather.mlir, a new cache-line atomicity all-gather e2e test (pinned to WORLD_SIZE=2 due to currentair.translatestatic-shape constraints). - Extend
test/gpu/symmetric_heap_dma/run.shto acceptINPUT=allgatherand select the new MLIR test.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| test/gpu/symmetric_heap_dma/run.sh | Adds allgather as a supported INPUT variant and updates help text/comments accordingly. |
| test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir | New MLIR e2e test implementing a symmetric-kernel cache-line all-gather across ranks using air.translate and per-slot spin/collect. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Merged
5 tasks
erwei-xilinx
added a commit
to erwei-xilinx/mlir-air-erwei
that referenced
this pull request
May 12, 2026
… group by IR level
Reorganize the multi-GPU e2e tests to match how the lowering stack is
layered. Each subdirectory hosts tests at one IR-abstraction level; future
phases (3-7) drop into their own subdir without touching anything else.
Directory rename:
test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/
The old name was misleading — most of the tests in this directory don't do
DMA in the conventional sense (the cacheline variant uses vec-store +
gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real
DMA later). The common thread is the symmetric-heap fabric, not DMA.
New layout:
test/gpu/multi_gpu/
README.md # explains the layered structure
handwritten/
Makefile # self-contained, no shared boilerplate
cacheline.mlir # was: air_sym_handwritten_cacheline.mlir
atomic.mlir # was: air_sym_handwritten_atomic.mlir
Per-phase invocation: `make` instead of `bash run.sh`. Same default
behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking
avoids re-running the lowering pipeline when only NUM_RANKS changes.
Why per-subdir self-contained Makefile (no _common.mk):
- Each phase's PR touches only its own subdir; no rebase conflicts on a
shared file. Phases 2-7 had to re-resolve run.sh case-statement
conflicts on every rebase under the old shared-script design.
- A shared include rots silently — one phase's edit can break another's
pipeline. Duplicating ~30 lines of preconditions + multi-process
driver is the cheaper failure mode.
- Pipelines genuinely differ per phase (handwritten goes through
air-translate-to-llvm + GPU compile; air_rank adds air-rank-to-mgpu;
air_alloc adds air-symmetric-alloc-to-mgpu; etc.). One unified case
statement would already be hard to read by phase 4.
Verified on rad-mi325x-1 (2x MI325X):
- `make -C test/gpu/multi_gpu/handwritten` → ALL 2 RANKS PASSED
- `make -C test/gpu/multi_gpu/handwritten INPUT=atomic` → ALL 2 RANKS PASSED
In-flight phase 3 (Xilinx#1578) and allgather (Xilinx#1611) PRs will need to rebase
onto this restructure to use the new paths.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx
added a commit
to erwei-xilinx/mlir-air-erwei
that referenced
this pull request
May 12, 2026
New conversion pass that replaces each `air.rank` op by inlining its body in place, with rank IDs computed at runtime via `mgpuGetRank()` and delinearized into the rank's N-D iteration space. Replaces `air-rank-to-launch` for the GPU pipeline (which serialized ranks via scf.for — a placeholder for single-process execution). After this pass each process executes the entire `air.rank` body once, with its rank id resolved dynamically from the runtime. Heap lifecycle (`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed around the parent function once per function (not per rank). - `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header - `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def with `heap-size` option (default 256 MB); declares `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes `air.wait_all` ops in the async case). - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU` + `air/Dialect/AIR/AIRDialect.h` include needed by the generated `getDependentDialects` body. - `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation. Materializes the heap-size constant via `IntegerAttr::get(i64Ty, APInt(64, heapSize))` so the full uint64_t value round-trips (a static_cast<int64_t> would silently wrap for values > INT64_MAX). - `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration. - `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu` so it's skipped on non-GPU CI builds where the pass isn't registered. Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1 air.rank wrap of the corresponding handwritten reference from Xilinx#1577 / Xilinx#1611. After lowering, each is functionally equivalent to its handwritten sister — which is the property the pass is supposed to establish (write the multi-process world declaratively, get back the handwritten reference's runtime behavior). air_sym_with_rank_cacheline.mlir wraps air_sym_handwritten_cacheline air_sym_with_rank_allgather.mlir wraps air_sym_handwritten_allgather run.sh's INPUT selector grew accordingly: INPUT=rank_cacheline | rank_allgather FileCheck unit tests cover: - 1D / 2D rank delinearization (remsi/divsi) - Default + custom heap-size option - Async form (token replacement via wait_all) - Async dependencies (blocking wait_all insertion) - Multiple `air.rank` ops per function (init/destroy emitted once) - Multiple `func.return` paths (destroy before each) - Kernel operand mapping (block args replaced by SSA operands) - Idempotent extern decls across multiple functions - No-op when no `air.rank` is present (pass was unconditionally inserting decls — caught by audit, fixed) End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): - INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1) - INPUT=rank_allgather: all-gather PASS (both peers' slots correct) Output structurally identical to the handwritten variants, only distinguished by the `[mlir/rank]` log tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx
added a commit
to erwei-xilinx/mlir-air-erwei
that referenced
this pull request
May 12, 2026
New conversion pass that replaces each `air.rank` op by inlining its body in place, with rank IDs computed at runtime via `mgpuGetRank()` and delinearized into the rank's N-D iteration space. Replaces `air-rank-to-launch` for the GPU pipeline (which serialized ranks via scf.for — a placeholder for single-process execution). After this pass each process executes the entire `air.rank` body once, with its rank id resolved dynamically from the runtime. Heap lifecycle (`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed around the parent function once per function (not per rank). - `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header - `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def with `heap-size` option (default 256 MB); declares `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes `air.wait_all` ops in the async case). - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU` + `air/Dialect/AIR/AIRDialect.h` include needed by the generated `getDependentDialects` body. - `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation. Materializes the heap-size constant via `IntegerAttr::get(i64Ty, APInt(64, heapSize))` so the full uint64_t value round-trips (a static_cast<int64_t> would silently wrap for values > INT64_MAX). - `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration. - `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu` so it's skipped on non-GPU CI builds where the pass isn't registered. Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1 air.rank wrap of the corresponding handwritten reference from Xilinx#1577 / Xilinx#1611. After lowering, each is functionally equivalent to its handwritten sister — which is the property the pass is supposed to establish (write the multi-process world declaratively, get back the handwritten reference's runtime behavior). air_sym_with_rank_cacheline.mlir wraps air_sym_handwritten_cacheline air_sym_with_rank_allgather.mlir wraps air_sym_handwritten_allgather run.sh's INPUT selector grew accordingly: INPUT=rank_cacheline | rank_allgather FileCheck unit tests cover: - 1D / 2D rank delinearization (remsi/divsi) - Default + custom heap-size option - Async form (token replacement via wait_all) - Async dependencies (blocking wait_all insertion) - Multiple `air.rank` ops per function (init/destroy emitted once) - Multiple `func.return` paths (destroy before each) - Kernel operand mapping (block args replaced by SSA operands) - Idempotent extern decls across multiple functions - No-op when no `air.rank` is present (pass was unconditionally inserting decls — caught by audit, fixed) End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): - INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1) - INPUT=rank_allgather: all-gather PASS (both peers' slots correct) Output structurally identical to the handwritten variants, only distinguished by the `[mlir/rank]` log tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-merge-queue Bot
pushed a commit
that referenced
this pull request
May 12, 2026
New conversion pass that replaces each `air.rank` op by inlining its body in place, with rank IDs computed at runtime via `mgpuGetRank()` and delinearized into the rank's N-D iteration space. Replaces `air-rank-to-launch` for the GPU pipeline (which serialized ranks via scf.for — a placeholder for single-process execution). After this pass each process executes the entire `air.rank` body once, with its rank id resolved dynamically from the runtime. Heap lifecycle (`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed around the parent function once per function (not per rank). - `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header - `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def with `heap-size` option (default 256 MB); declares `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes `air.wait_all` ops in the async case). - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU` + `air/Dialect/AIR/AIRDialect.h` include needed by the generated `getDependentDialects` body. - `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation. Materializes the heap-size constant via `IntegerAttr::get(i64Ty, APInt(64, heapSize))` so the full uint64_t value round-trips (a static_cast<int64_t> would silently wrap for values > INT64_MAX). - `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration. - `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu` so it's skipped on non-GPU CI builds where the pass isn't registered. Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1 air.rank wrap of the corresponding handwritten reference from #1577 / #1611. After lowering, each is functionally equivalent to its handwritten sister — which is the property the pass is supposed to establish (write the multi-process world declaratively, get back the handwritten reference's runtime behavior). air_sym_with_rank_cacheline.mlir wraps air_sym_handwritten_cacheline air_sym_with_rank_allgather.mlir wraps air_sym_handwritten_allgather run.sh's INPUT selector grew accordingly: INPUT=rank_cacheline | rank_allgather FileCheck unit tests cover: - 1D / 2D rank delinearization (remsi/divsi) - Default + custom heap-size option - Async form (token replacement via wait_all) - Async dependencies (blocking wait_all insertion) - Multiple `air.rank` ops per function (init/destroy emitted once) - Multiple `func.return` paths (destroy before each) - Kernel operand mapping (block args replaced by SSA operands) - Idempotent extern decls across multiple functions - No-op when no `air.rank` is present (pass was unconditionally inserting decls — caught by audit, fixed) End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): - INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1) - INPUT=rank_allgather: all-gather PASS (both peers' slots correct) Output structurally identical to the handwritten variants, only distinguished by the `[mlir/rank]` log tag. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
air_sym_handwritten_allgather.mlir, a sister to the producer/consumer cache-line e2e test landed in #1577. Where the producer/consumer file demonstrates 1-to-1 cross-rank handoff with asymmetric kernels, this file demonstrates a many-to-many SIMD pattern: every rank runs the SAME kernel, with the only per-rank variation being the rank-id arg.What's different from the producer/consumer test
Both use the same cache-line atomicity contract from #1577 (no atomics, no fences, payload + flag in one 128-byte line; consumer spins via
gpu.shuffleof lane 31).Layout
Each rank R has a 32-i32 input slice (one cache line):
R*1000 + lane+100(lanes 0..30)1(lane 31)Output buffer: W cache lines =
slot[0] | slot[1] | ... | slot[W-1].Phase 1 (publish): for each peer P in 0..W-1, R writes its slice into the sub-buffer
P_output[R*32 .. (R+1)*32]. Self-write goes through the sameair.translatecode path (withfrom == to→ no-op pointer-wise).Phase 2 (collect): for each peer P, R spins on its LOCAL
output[P*32 .. (P+1)*32]until lane 31 (flag) shows up viagpu.shuffle, then copies the validated cache line toverify_buffor host check.Constraint
Pinned to W=2 because
air.translatetoday requires a static-shape source memref (seeAIRTranslateToLLVMPass.cpp:122, added in #1577). The KERNEL itself is W-agnostic — its peer loop is runtime-bounded, only the output memref type and a host-sideworld == 2precondition are concretized. The file's header comment points at the lift path (extendbuildPeerDescriptorto thread dynamic dim values).Test plan
slot[0] = {data=100, flag=1}(rank-0 payload) andslot[1] = {data=1100, flag=1}(rank-1 payload) — confirming both ranks correctly gathered both peers' slices via cross-XGMI cache-line transferslot*1000 + lane+100(or 1 at lane 31); fail-loudexit(1)on any mismatchbash run.sh 1refused at the launcher (inherited from [multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU #1577); MLIR-levelworld == 2preconditionexit(1)for any other world size🤖 Generated with Claude Code