[multi-gpu] handwritten all-gather e2e (cache-line, SIMD across ranks) by erwei-xilinx · Pull Request #1611 · Xilinx/mlir-air

erwei-xilinx · 2026-05-12T17:50:10Z

Summary

Adds air_sym_handwritten_allgather.mlir, a sister to the producer/consumer cache-line e2e test landed in #1577. Where the producer/consumer file demonstrates 1-to-1 cross-rank handoff with asymmetric kernels, this file demonstrates a many-to-many SIMD pattern: every rank runs the SAME kernel, with the only per-rank variation being the rank-id arg.

What's different from the producer/consumer test

	producer/consumer (cacheline)	all-gather (this PR)
Rank symmetry	Asymmetric (rank 0 = producer, rank 1 = consumer kernels)	Symmetric — same kernel on every rank, only the rank-id differs
Cache-line publishes	1 (one producer → one consumer)	N (each rank publishes its slice to N peers)
Cache-line spins	1 (consumer waits for producer)	N (each rank waits for N peers' slices)
Output validation	rank 1's verify_buf only	every rank's verify_buf (all should be identical after gather)

Both use the same cache-line atomicity contract from #1577 (no atomics, no fences, payload + flag in one 128-byte line; consumer spins via gpu.shuffle of lane 31).

Layout

Each rank R has a 32-i32 input slice (one cache line):

payload R*1000 + lane+100 (lanes 0..30)
flag = 1 (lane 31)

Output buffer: W cache lines = slot[0] | slot[1] | ... | slot[W-1].

Phase 1 (publish): for each peer P in 0..W-1, R writes its slice into the sub-buffer P_output[R*32 .. (R+1)*32]. Self-write goes through the same air.translate code path (with from == to → no-op pointer-wise).

Phase 2 (collect): for each peer P, R spins on its LOCAL output[P*32 .. (P+1)*32] until lane 31 (flag) shows up via gpu.shuffle, then copies the validated cache line to verify_buf for host check.

Constraint

Pinned to W=2 because air.translate today requires a static-shape source memref (see AIRTranslateToLLVMPass.cpp:122, added in #1577). The KERNEL itself is W-agnostic — its peer loop is runtime-bounded, only the output memref type and a host-side world == 2 precondition are concretized. The file's header comment points at the lift path (extend buildPeerDescriptor to thread dynamic dim values).

Test plan

Verified on 2x MI325X (rad-mi325x-1), 5/5 PASS pre-rebase + 3/3 PASS post-rebase onto current main
Each rank's verify_buf shows: slot[0] = {data=100, flag=1} (rank-0 payload) and slot[1] = {data=1100, flag=1} (rank-1 payload) — confirming both ranks correctly gathered both peers' slices via cross-XGMI cache-line transfer
Per-element validation: every (slot, lane) pair matches expected slot*1000 + lane+100 (or 1 at lane 31); fail-loud exit(1) on any mismatch
bash run.sh 1 refused at the launcher (inherited from [multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU #1577); MLIR-level world == 2 precondition exit(1) for any other world size

🤖 Generated with Claude Code

…ross ranks) Adds air_sym_handwritten_allgather.mlir, a sister to the existing producer/consumer cacheline test. Where the producer/consumer file demonstrates 1-to-1 cross-rank handoff with asymmetric kernels, this file demonstrates a many-to-many SIMD pattern: every rank runs the SAME kernel, with the only per-rank variation being the rank-id arg. Layout: each rank R has a 32-i32 input slice (one 128-byte cache line) with payload R*1000 + lane+100 (lanes 0..30) and flag=1 (lane 31). The output buffer is W cache lines = slot[0] | slot[1] | ... | slot[W-1]. Phase 1 (publish): for each peer P in 0..W-1, R writes its slice into the sub-buffer P_output[R*32 .. (R+1)*32]. Self-write goes through the same air.translate code path (with from == to → no-op pointer-wise). Phase 2 (collect): for each peer P, R spins on its LOCAL output[P*32 .. (P+1)*32] until lane 31 (flag) shows up via gpu.shuffle, then copies the validated cache line to verify_buf for host check. Cache-line atomicity contract is identical to the cacheline producer/ consumer file — see that file's header for gfx940 / MI300 reasoning. run.sh wires INPUT=allgather alongside the existing atomic + cacheline modes. Verified on 2x MI325X (rad-mi325x-1), 5/5 PASS, with each rank correctly observing both slot[0] = 100 (rank 0's payload) and slot[1] = 1100 (rank 1's payload). Constraint: pinned to W=2 because air.translate today requires a static-shape source memref (AIRTranslateToLLVMPass.cpp:122). The KERNEL itself is W-agnostic — its peer loop is runtime-bounded, only the output memref type and a host-side W==2 precondition are concretized. Comment in the file's header points at the lift path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new symmetric-heap multi-GPU end-to-end MLIR test that demonstrates a cache-line-based, SIMD-across-ranks all-gather pattern (many-to-many) using air.translate inside a peer loop, and wires it into the existing run.sh harness via a new INPUT=allgather option.

Changes:

Add air_sym_handwritten_allgather.mlir, a new cache-line atomicity all-gather e2e test (pinned to WORLD_SIZE=2 due to current air.translate static-shape constraints).
Extend test/gpu/symmetric_heap_dma/run.sh to accept INPUT=allgather and select the new MLIR test.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
test/gpu/symmetric_heap_dma/run.sh	Adds `allgather` as a supported `INPUT` variant and updates help text/comments accordingly.
test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir	New MLIR e2e test implementing a symmetric-kernel cache-line all-gather across ranks using `air.translate` and per-slot spin/collect.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

… group by IR level Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (3-7) drop into their own subdir without touching anything else. Directory rename: test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/ The old name was misleading — most of the tests in this directory don't do DMA in the conventional sense (the cacheline variant uses vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA. New layout: test/gpu/multi_gpu/ README.md # explains the layered structure handwritten/ Makefile # self-contained, no shared boilerplate cacheline.mlir # was: air_sym_handwritten_cacheline.mlir atomic.mlir # was: air_sym_handwritten_atomic.mlir Per-phase invocation: `make` instead of `bash run.sh`. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes. Why per-subdir self-contained Makefile (no _common.mk): - Each phase's PR touches only its own subdir; no rebase conflicts on a shared file. Phases 2-7 had to re-resolve run.sh case-statement conflicts on every rebase under the old shared-script design. - A shared include rots silently — one phase's edit can break another's pipeline. Duplicating ~30 lines of preconditions + multi-process driver is the cheaper failure mode. - Pipelines genuinely differ per phase (handwritten goes through air-translate-to-llvm + GPU compile; air_rank adds air-rank-to-mgpu; air_alloc adds air-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4. Verified on rad-mi325x-1 (2x MI325X): - `make -C test/gpu/multi_gpu/handwritten` → ALL 2 RANKS PASSED - `make -C test/gpu/multi_gpu/handwritten INPUT=atomic` → ALL 2 RANKS PASSED In-flight phase 3 (Xilinx#1578) and allgather (Xilinx#1611) PRs will need to rebase onto this restructure to use the new paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that replaces each `air.rank` op by inlining its body in place, with rank IDs computed at runtime via `mgpuGetRank()` and delinearized into the rank's N-D iteration space. Replaces `air-rank-to-launch` for the GPU pipeline (which serialized ranks via scf.for — a placeholder for single-process execution). After this pass each process executes the entire `air.rank` body once, with its rank id resolved dynamically from the runtime. Heap lifecycle (`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed around the parent function once per function (not per rank). - `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header - `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def with `heap-size` option (default 256 MB); declares `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes `air.wait_all` ops in the async case). - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU` + `air/Dialect/AIR/AIRDialect.h` include needed by the generated `getDependentDialects` body. - `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation. Materializes the heap-size constant via `IntegerAttr::get(i64Ty, APInt(64, heapSize))` so the full uint64_t value round-trips (a static_cast<int64_t> would silently wrap for values > INT64_MAX). - `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration. - `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu` so it's skipped on non-GPU CI builds where the pass isn't registered. Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1 air.rank wrap of the corresponding handwritten reference from Xilinx#1577 / Xilinx#1611. After lowering, each is functionally equivalent to its handwritten sister — which is the property the pass is supposed to establish (write the multi-process world declaratively, get back the handwritten reference's runtime behavior). air_sym_with_rank_cacheline.mlir wraps air_sym_handwritten_cacheline air_sym_with_rank_allgather.mlir wraps air_sym_handwritten_allgather run.sh's INPUT selector grew accordingly: INPUT=rank_cacheline | rank_allgather FileCheck unit tests cover: - 1D / 2D rank delinearization (remsi/divsi) - Default + custom heap-size option - Async form (token replacement via wait_all) - Async dependencies (blocking wait_all insertion) - Multiple `air.rank` ops per function (init/destroy emitted once) - Multiple `func.return` paths (destroy before each) - Kernel operand mapping (block args replaced by SSA operands) - Idempotent extern decls across multiple functions - No-op when no `air.rank` is present (pass was unconditionally inserting decls — caught by audit, fixed) End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): - INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1) - INPUT=rank_allgather: all-gather PASS (both peers' slots correct) Output structurally identical to the handwritten variants, only distinguished by the `[mlir/rank]` log tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New conversion pass that replaces each `air.rank` op by inlining its body in place, with rank IDs computed at runtime via `mgpuGetRank()` and delinearized into the rank's N-D iteration space. Replaces `air-rank-to-launch` for the GPU pipeline (which serialized ranks via scf.for — a placeholder for single-process execution). After this pass each process executes the entire `air.rank` body once, with its rank id resolved dynamically from the runtime. Heap lifecycle (`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed around the parent function once per function (not per rank). - `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header - `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def with `heap-size` option (default 256 MB); declares `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes `air.wait_all` ops in the async case). - `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU` + `air/Dialect/AIR/AIRDialect.h` include needed by the generated `getDependentDialects` body. - `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation. Materializes the heap-size constant via `IntegerAttr::get(i64Ty, APInt(64, heapSize))` so the full uint64_t value round-trips (a static_cast<int64_t> would silently wrap for values > INT64_MAX). - `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration. - `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu` so it's skipped on non-GPU CI builds where the pass isn't registered. Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1 air.rank wrap of the corresponding handwritten reference from #1577 / #1611. After lowering, each is functionally equivalent to its handwritten sister — which is the property the pass is supposed to establish (write the multi-process world declaratively, get back the handwritten reference's runtime behavior). air_sym_with_rank_cacheline.mlir wraps air_sym_handwritten_cacheline air_sym_with_rank_allgather.mlir wraps air_sym_handwritten_allgather run.sh's INPUT selector grew accordingly: INPUT=rank_cacheline | rank_allgather FileCheck unit tests cover: - 1D / 2D rank delinearization (remsi/divsi) - Default + custom heap-size option - Async form (token replacement via wait_all) - Async dependencies (blocking wait_all insertion) - Multiple `air.rank` ops per function (init/destroy emitted once) - Multiple `func.return` paths (destroy before each) - Kernel operand mapping (block args replaced by SSA operands) - Idempotent extern decls across multiple functions - No-op when no `air.rank` is present (pass was unconditionally inserting decls — caught by audit, fixed) End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2): - INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1) - INPUT=rank_allgather: all-gather PASS (both peers' slots correct) Output structurally identical to the handwritten variants, only distinguished by the `[mlir/rank]` log tag. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erwei-xilinx force-pushed the multigpu-handwritten-allgather branch from f11ddf0 to 1956f1b Compare May 12, 2026 18:02

erwei-xilinx marked this pull request as ready for review May 12, 2026 18:04

Copilot AI review requested due to automatic review settings May 12, 2026 18:04

Copilot started reviewing on behalf of erwei-xilinx May 12, 2026 18:06 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir

Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir Outdated

Potential fix for pull request finding

4c47480

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

erwei-xilinx enabled auto-merge May 12, 2026 18:33

erwei-xilinx added this pull request to the merge queue May 12, 2026

Merged via the queue into Xilinx:main with commit df28429 May 12, 2026
27 checks passed

erwei-xilinx deleted the multigpu-handwritten-allgather branch May 12, 2026 19:17

erwei-xilinx mentioned this pull request May 12, 2026

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level #1613

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi-gpu] handwritten all-gather e2e (cache-line, SIMD across ranks)#1611

[multi-gpu] handwritten all-gather e2e (cache-line, SIMD across ranks)#1611
erwei-xilinx merged 2 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-handwritten-allgather

erwei-xilinx commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's different from the producer/consumer test

Layout

Constraint

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented May 12, 2026 •

edited

Loading