Skip to content

[multi-gpu] handwritten all-gather e2e (cache-line, SIMD across ranks)#1611

Merged
erwei-xilinx merged 2 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-handwritten-allgather
May 12, 2026
Merged

[multi-gpu] handwritten all-gather e2e (cache-line, SIMD across ranks)#1611
erwei-xilinx merged 2 commits into
Xilinx:mainfrom
erwei-xilinx:multigpu-handwritten-allgather

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented May 12, 2026

Summary

Adds air_sym_handwritten_allgather.mlir, a sister to the producer/consumer cache-line e2e test landed in #1577. Where the producer/consumer file demonstrates 1-to-1 cross-rank handoff with asymmetric kernels, this file demonstrates a many-to-many SIMD pattern: every rank runs the SAME kernel, with the only per-rank variation being the rank-id arg.

What's different from the producer/consumer test

producer/consumer (cacheline) all-gather (this PR)
Rank symmetry Asymmetric (rank 0 = producer, rank 1 = consumer kernels) Symmetric — same kernel on every rank, only the rank-id differs
Cache-line publishes 1 (one producer → one consumer) N (each rank publishes its slice to N peers)
Cache-line spins 1 (consumer waits for producer) N (each rank waits for N peers' slices)
Output validation rank 1's verify_buf only every rank's verify_buf (all should be identical after gather)

Both use the same cache-line atomicity contract from #1577 (no atomics, no fences, payload + flag in one 128-byte line; consumer spins via gpu.shuffle of lane 31).

Layout

Each rank R has a 32-i32 input slice (one cache line):

  • payload R*1000 + lane+100 (lanes 0..30)
  • flag = 1 (lane 31)

Output buffer: W cache lines = slot[0] | slot[1] | ... | slot[W-1].

Phase 1 (publish): for each peer P in 0..W-1, R writes its slice into the sub-buffer P_output[R*32 .. (R+1)*32]. Self-write goes through the same air.translate code path (with from == to → no-op pointer-wise).

Phase 2 (collect): for each peer P, R spins on its LOCAL output[P*32 .. (P+1)*32] until lane 31 (flag) shows up via gpu.shuffle, then copies the validated cache line to verify_buf for host check.

Constraint

Pinned to W=2 because air.translate today requires a static-shape source memref (see AIRTranslateToLLVMPass.cpp:122, added in #1577). The KERNEL itself is W-agnostic — its peer loop is runtime-bounded, only the output memref type and a host-side world == 2 precondition are concretized. The file's header comment points at the lift path (extend buildPeerDescriptor to thread dynamic dim values).

Test plan

  • Verified on 2x MI325X (rad-mi325x-1), 5/5 PASS pre-rebase + 3/3 PASS post-rebase onto current main
  • Each rank's verify_buf shows: slot[0] = {data=100, flag=1} (rank-0 payload) and slot[1] = {data=1100, flag=1} (rank-1 payload) — confirming both ranks correctly gathered both peers' slices via cross-XGMI cache-line transfer
  • Per-element validation: every (slot, lane) pair matches expected slot*1000 + lane+100 (or 1 at lane 31); fail-loud exit(1) on any mismatch
  • bash run.sh 1 refused at the launcher (inherited from [multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU #1577); MLIR-level world == 2 precondition exit(1) for any other world size

🤖 Generated with Claude Code

…ross ranks)

Adds air_sym_handwritten_allgather.mlir, a sister to the existing
producer/consumer cacheline test. Where the producer/consumer file
demonstrates 1-to-1 cross-rank handoff with asymmetric kernels, this
file demonstrates a many-to-many SIMD pattern: every rank runs the
SAME kernel, with the only per-rank variation being the rank-id arg.

Layout: each rank R has a 32-i32 input slice (one 128-byte cache line)
with payload R*1000 + lane+100 (lanes 0..30) and flag=1 (lane 31). The
output buffer is W cache lines = slot[0] | slot[1] | ... | slot[W-1].

Phase 1 (publish): for each peer P in 0..W-1, R writes its slice into
the sub-buffer P_output[R*32 .. (R+1)*32]. Self-write goes through the
same air.translate code path (with from == to → no-op pointer-wise).

Phase 2 (collect): for each peer P, R spins on its LOCAL output[P*32
.. (P+1)*32] until lane 31 (flag) shows up via gpu.shuffle, then
copies the validated cache line to verify_buf for host check.

Cache-line atomicity contract is identical to the cacheline producer/
consumer file — see that file's header for gfx940 / MI300 reasoning.

run.sh wires INPUT=allgather alongside the existing atomic + cacheline
modes. Verified on 2x MI325X (rad-mi325x-1), 5/5 PASS, with each rank
correctly observing both slot[0] = 100 (rank 0's payload) and slot[1]
= 1100 (rank 1's payload).

Constraint: pinned to W=2 because air.translate today requires a
static-shape source memref (AIRTranslateToLLVMPass.cpp:122). The
KERNEL itself is W-agnostic — its peer loop is runtime-bounded, only
the output memref type and a host-side W==2 precondition are
concretized. Comment in the file's header points at the lift path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-handwritten-allgather branch from f11ddf0 to 1956f1b Compare May 12, 2026 18:02
@erwei-xilinx erwei-xilinx marked this pull request as ready for review May 12, 2026 18:04
Copilot AI review requested due to automatic review settings May 12, 2026 18:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new symmetric-heap multi-GPU end-to-end MLIR test that demonstrates a cache-line-based, SIMD-across-ranks all-gather pattern (many-to-many) using air.translate inside a peer loop, and wires it into the existing run.sh harness via a new INPUT=allgather option.

Changes:

  • Add air_sym_handwritten_allgather.mlir, a new cache-line atomicity all-gather e2e test (pinned to WORLD_SIZE=2 due to current air.translate static-shape constraints).
  • Extend test/gpu/symmetric_heap_dma/run.sh to accept INPUT=allgather and select the new MLIR test.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
test/gpu/symmetric_heap_dma/run.sh Adds allgather as a supported INPUT variant and updates help text/comments accordingly.
test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir New MLIR e2e test implementing a symmetric-kernel cache-line all-gather across ranks using air.translate and per-slot spin/collect.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir
Comment thread test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@erwei-xilinx erwei-xilinx enabled auto-merge May 12, 2026 18:33
@erwei-xilinx erwei-xilinx added this pull request to the merge queue May 12, 2026
Merged via the queue into Xilinx:main with commit df28429 May 12, 2026
27 checks passed
@erwei-xilinx erwei-xilinx deleted the multigpu-handwritten-allgather branch May 12, 2026 19:17
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 12, 2026
… group by IR level

Reorganize the multi-GPU e2e tests to match how the lowering stack is
layered. Each subdirectory hosts tests at one IR-abstraction level; future
phases (3-7) drop into their own subdir without touching anything else.

Directory rename:
  test/gpu/symmetric_heap_dma/  →  test/gpu/multi_gpu/

The old name was misleading — most of the tests in this directory don't do
DMA in the conventional sense (the cacheline variant uses vec-store +
gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real
DMA later). The common thread is the symmetric-heap fabric, not DMA.

New layout:
  test/gpu/multi_gpu/
    README.md                    # explains the layered structure
    handwritten/
      Makefile                   # self-contained, no shared boilerplate
      cacheline.mlir             # was: air_sym_handwritten_cacheline.mlir
      atomic.mlir                # was: air_sym_handwritten_atomic.mlir

Per-phase invocation: `make` instead of `bash run.sh`. Same default
behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking
avoids re-running the lowering pipeline when only NUM_RANKS changes.

Why per-subdir self-contained Makefile (no _common.mk):
  - Each phase's PR touches only its own subdir; no rebase conflicts on a
    shared file. Phases 2-7 had to re-resolve run.sh case-statement
    conflicts on every rebase under the old shared-script design.
  - A shared include rots silently — one phase's edit can break another's
    pipeline. Duplicating ~30 lines of preconditions + multi-process
    driver is the cheaper failure mode.
  - Pipelines genuinely differ per phase (handwritten goes through
    air-translate-to-llvm + GPU compile; air_rank adds air-rank-to-mgpu;
    air_alloc adds air-symmetric-alloc-to-mgpu; etc.). One unified case
    statement would already be hard to read by phase 4.

Verified on rad-mi325x-1 (2x MI325X):
  - `make -C test/gpu/multi_gpu/handwritten` → ALL 2 RANKS PASSED
  - `make -C test/gpu/multi_gpu/handwritten INPUT=atomic` → ALL 2 RANKS PASSED

In-flight phase 3 (Xilinx#1578) and allgather (Xilinx#1611) PRs will need to rebase
onto this restructure to use the new paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 12, 2026
New conversion pass that replaces each `air.rank` op by inlining its body
in place, with rank IDs computed at runtime via `mgpuGetRank()` and
delinearized into the rank's N-D iteration space. Replaces
`air-rank-to-launch` for the GPU pipeline (which serialized ranks via
scf.for — a placeholder for single-process execution).

After this pass each process executes the entire `air.rank` body once,
with its rank id resolved dynamically from the runtime. Heap lifecycle
(`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed
around the parent function once per function (not per rank).

- `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def
  with `heap-size` option (default 256 MB); declares
  `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes
  `air.wait_all` ops in the async case).
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU`
  + `air/Dialect/AIR/AIRDialect.h` include needed by the generated
  `getDependentDialects` body.
- `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation.
  Materializes the heap-size constant via `IntegerAttr::get(i64Ty,
  APInt(64, heapSize))` so the full uint64_t value round-trips (a
  static_cast<int64_t> would silently wrap for values > INT64_MAX).
- `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration.
- `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck
  unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu`
  so it's skipped on non-GPU CI builds where the pass isn't registered.

Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1
air.rank wrap of the corresponding handwritten reference from Xilinx#1577 /
Xilinx#1611. After lowering, each is functionally equivalent to its
handwritten sister — which is the property the pass is supposed to
establish (write the multi-process world declaratively, get back the
handwritten reference's runtime behavior).

  air_sym_with_rank_cacheline.mlir   wraps air_sym_handwritten_cacheline
  air_sym_with_rank_allgather.mlir   wraps air_sym_handwritten_allgather

run.sh's INPUT selector grew accordingly:
  INPUT=rank_cacheline | rank_allgather

FileCheck unit tests cover:
- 1D / 2D rank delinearization (remsi/divsi)
- Default + custom heap-size option
- Async form (token replacement via wait_all)
- Async dependencies (blocking wait_all insertion)
- Multiple `air.rank` ops per function (init/destroy emitted once)
- Multiple `func.return` paths (destroy before each)
- Kernel operand mapping (block args replaced by SSA operands)
- Idempotent extern decls across multiple functions
- No-op when no `air.rank` is present (pass was unconditionally
  inserting decls — caught by audit, fixed)

End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2):
- INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1)
- INPUT=rank_allgather: all-gather PASS (both peers' slots correct)
Output structurally identical to the handwritten variants, only
distinguished by the `[mlir/rank]` log tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
erwei-xilinx added a commit to erwei-xilinx/mlir-air-erwei that referenced this pull request May 12, 2026
New conversion pass that replaces each `air.rank` op by inlining its body
in place, with rank IDs computed at runtime via `mgpuGetRank()` and
delinearized into the rank's N-D iteration space. Replaces
`air-rank-to-launch` for the GPU pipeline (which serialized ranks via
scf.for — a placeholder for single-process execution).

After this pass each process executes the entire `air.rank` body once,
with its rank id resolved dynamically from the runtime. Heap lifecycle
(`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed
around the parent function once per function (not per rank).

- `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def
  with `heap-size` option (default 256 MB); declares
  `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes
  `air.wait_all` ops in the async case).
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU`
  + `air/Dialect/AIR/AIRDialect.h` include needed by the generated
  `getDependentDialects` body.
- `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation.
  Materializes the heap-size constant via `IntegerAttr::get(i64Ty,
  APInt(64, heapSize))` so the full uint64_t value round-trips (a
  static_cast<int64_t> would silently wrap for values > INT64_MAX).
- `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration.
- `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck
  unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu`
  so it's skipped on non-GPU CI builds where the pass isn't registered.

Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1
air.rank wrap of the corresponding handwritten reference from Xilinx#1577 /
Xilinx#1611. After lowering, each is functionally equivalent to its
handwritten sister — which is the property the pass is supposed to
establish (write the multi-process world declaratively, get back the
handwritten reference's runtime behavior).

  air_sym_with_rank_cacheline.mlir   wraps air_sym_handwritten_cacheline
  air_sym_with_rank_allgather.mlir   wraps air_sym_handwritten_allgather

run.sh's INPUT selector grew accordingly:
  INPUT=rank_cacheline | rank_allgather

FileCheck unit tests cover:
- 1D / 2D rank delinearization (remsi/divsi)
- Default + custom heap-size option
- Async form (token replacement via wait_all)
- Async dependencies (blocking wait_all insertion)
- Multiple `air.rank` ops per function (init/destroy emitted once)
- Multiple `func.return` paths (destroy before each)
- Kernel operand mapping (block args replaced by SSA operands)
- Idempotent extern decls across multiple functions
- No-op when no `air.rank` is present (pass was unconditionally
  inserting decls — caught by audit, fixed)

End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2):
- INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1)
- INPUT=rank_allgather: all-gather PASS (both peers' slots correct)
Output structurally identical to the handwritten variants, only
distinguished by the `[mlir/rank]` log tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
github-merge-queue Bot pushed a commit that referenced this pull request May 12, 2026
New conversion pass that replaces each `air.rank` op by inlining its body
in place, with rank IDs computed at runtime via `mgpuGetRank()` and
delinearized into the rank's N-D iteration space. Replaces
`air-rank-to-launch` for the GPU pipeline (which serialized ranks via
scf.for — a placeholder for single-process execution).

After this pass each process executes the entire `air.rank` body once,
with its rank id resolved dynamically from the runtime. Heap lifecycle
(`mgpuSymmetricHeapInit` / `mgpuSymmetricHeapDestroy`) is bracketed
around the parent function once per function (not per rank).

- `mlir/include/air/Conversion/AIRRankToMgpuPass.h` — public header
- `mlir/include/air/Conversion/GPUPasses.td` — `air-rank-to-mgpu` def
  with `heap-size` option (default 256 MB); declares
  `xilinx::air::airDialect` as a dependent dialect (the pass synthesizes
  `air.wait_all` ops in the async case).
- `mlir/include/air/Conversion/GPUPassDetail.h` — `GEN_PASS_DEF_AIRRANKTOMGPU`
  + `air/Dialect/AIR/AIRDialect.h` include needed by the generated
  `getDependentDialects` body.
- `mlir/lib/Conversion/AIRRankToMgpuPass.cpp` — pass implementation.
  Materializes the heap-size constant via `IntegerAttr::get(i64Ty,
  APInt(64, heapSize))` so the full uint64_t value round-trips (a
  static_cast<int64_t> would silently wrap for values > INT64_MAX).
- `mlir/lib/Conversion/CMakeLists.txt`, `Passes.cpp` — registration.
- `mlir/test/Conversion/AIRRankToMgpu/rank_to_mgpu.mlir` — FileCheck
  unit tests (10 cases; see Test plan below). Marked `// REQUIRES: gpu`
  so it's skipped on non-GPU CI builds where the pass isn't registered.

Two e2e variants under test/gpu/symmetric_heap_dma/, each a 1:1
air.rank wrap of the corresponding handwritten reference from #1577 /
#1611. After lowering, each is functionally equivalent to its
handwritten sister — which is the property the pass is supposed to
establish (write the multi-process world declaratively, get back the
handwritten reference's runtime behavior).

  air_sym_with_rank_cacheline.mlir   wraps air_sym_handwritten_cacheline
  air_sym_with_rank_allgather.mlir   wraps air_sym_handwritten_allgather

run.sh's INPUT selector grew accordingly:
  INPUT=rank_cacheline | rank_allgather

FileCheck unit tests cover:
- 1D / 2D rank delinearization (remsi/divsi)
- Default + custom heap-size option
- Async form (token replacement via wait_all)
- Async dependencies (blocking wait_all insertion)
- Multiple `air.rank` ops per function (init/destroy emitted once)
- Multiple `func.return` paths (destroy before each)
- Kernel operand mapping (block args replaced by SSA operands)
- Idempotent extern decls across multiple functions
- No-op when no `air.rank` is present (pass was unconditionally
  inserting decls — caught by audit, fixed)

End-to-end on rad-mi325x-1 (real 2x MI325X, NUM_RANKS=2):
- INPUT=rank_cacheline: cache-line message PASS (data[0]=100, flag=1)
- INPUT=rank_allgather: all-gather PASS (both peers' slots correct)
Output structurally identical to the handwritten variants, only
distinguished by the `[mlir/rank]` log tag.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants