[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level by erwei-xilinx · Pull Request #1613 · Xilinx/mlir-air

erwei-xilinx · 2026-05-12T20:53:57Z

Summary

Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (4-7) drop into their own subdir without touching anything else.

Directory rename

test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/

The old name was misleading — most of the tests don't do DMA in the conventional sense (the cacheline / allgather / rank variants use vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA.

New layout

test/gpu/multi_gpu/
  README.md                   # explains the layered structure
  handwritten/                # Phase 2 reference
    Makefile                  # self-contained
    cacheline.mlir            # was: air_sym_handwritten_cacheline.mlir
    atomic.mlir               # was: air_sym_handwritten_atomic.mlir
    allgather.mlir            # was: air_sym_handwritten_allgather.mlir
  air_rank/                   # Phase 3 (air.rank wrapping)
    Makefile                  # self-contained
    cacheline.mlir            # was: air_sym_with_rank_cacheline.mlir
    allgather.mlir            # was: air_sym_with_rank_allgather.mlir

Future phases will add sibling subdirs:

Subdir	Phase	Adds
`air_alloc/`	4	`memref.alloc {air.symmetric}` declares symmetric-heap allocations.
`air_dma/`	5	`air.dma_memcpy_nd {src_rank/dst_rank}`.
`air_channel/`	6	`air.channel {channel_type = "gpu_symmetric_heap"}`.

Per-phase invocation

make instead of bash run.sh. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes.

make -C test/gpu/multi_gpu/handwritten                    # default
make -C test/gpu/multi_gpu/handwritten INPUT=atomic
make -C test/gpu/multi_gpu/handwritten INPUT=allgather
make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
make -C test/gpu/multi_gpu/air_rank                        # default
make -C test/gpu/multi_gpu/air_rank INPUT=allgather
make -C test/gpu/multi_gpu/handwritten clean

Why per-subdir self-contained Makefile (no `_common.mk` / no `_common.sh`)

Each phase's PR touches only its own subdir — no rebase conflicts on a shared file. Phases 2-7 had to re-resolve run.sh case-statement conflicts on every rebase under the old shared-script design (the cascade hit 4 conflicts during just the phase 2 rebase, and phase 3's rebase needed another).
A shared include rots silently — one phase's edit can break another's pipeline without obvious blame attribution. Duplicating ~30 lines of preconditions + multi-process driver per Makefile is the cheaper failure mode.
Pipelines genuinely differ per phase (handwritten goes through air-translate-to-llvm + GPU compile; air_rank prepends air-rank-to-mgpu; air_alloc will add air-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4.

Test plan

make -C test/gpu/multi_gpu/handwritten (cacheline, default) → ALL 2 RANKS PASSED
make ... handwritten INPUT=atomic → ALL 2 RANKS PASSED
make ... handwritten INPUT=allgather → ALL 2 RANKS PASSED
make ... air_rank (cacheline, default) → ALL 2 RANKS PASSED
make ... air_rank INPUT=allgather → ALL 2 RANKS PASSED
All 5 variants verified on rad-mi325x-1 (real 2x MI325X)
make clean removes the per-input TMPDIR

Knock-on work

In-flight phase 4-7 PRs (#1579-#1582) will need to rebase onto this restructure and add their own subdir + Makefile when they reach turn-to-land.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR reorganizes the symmetric-heap multi-GPU end-to-end GPU tests to align with the multi-phase lowering stack, renaming the old symmetric_heap_dma directory to multi_gpu and grouping tests by IR abstraction level (handwritten reference vs air.rank-wrapped).

Changes:

Renamed/restructured the e2e test directory into test/gpu/multi_gpu/ with a layered layout and new top-level README.
Replaced the prior run.sh harness with per-subdir self-contained Makefile drivers (handwritten/, air_rank/).
Updated test file naming/headers to match the new directory structure.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
test/gpu/symmetric_heap_dma/run.sh	Removes the old shared bash launcher in favor of per-subdir Makefile drivers.
test/gpu/multi_gpu/README.md	Documents the new layered multi-GPU e2e test layout and how to run each phase.
test/gpu/multi_gpu/handwritten/Makefile	Adds a Make-based build/run harness for the handwritten reference tests.
test/gpu/multi_gpu/handwritten/cacheline.mlir	Renames/retitles the handwritten cacheline variant under the new layout.
test/gpu/multi_gpu/handwritten/atomic.mlir	Renames/retitles the handwritten atomic variant under the new layout.
test/gpu/multi_gpu/handwritten/allgather.mlir	Renames/retitles the handwritten allgather variant under the new layout.
test/gpu/multi_gpu/air_rank/Makefile	Adds a Make-based build/run harness for the `air.rank`-wrapped tests.
test/gpu/multi_gpu/air_rank/cacheline.mlir	Renames/retitles the `air.rank` cacheline wrapper under the new layout.
test/gpu/multi_gpu/air_rank/allgather.mlir	Renames/retitles the `air.rank` allgather wrapper under the new layout.

Comments suppressed due to low confidence (2)

test/gpu/multi_gpu/handwritten/Makefile:117

The launcher always sets HIP_VISIBLE_DEVICES=$i per rank. If the user sets HIP_VISIBLE_DEVICES to a subset list (e.g. "2,3") to constrain GPU selection, this overrides it and can accidentally run on GPUs 0/1 instead. Suggest parsing the original HIP_VISIBLE_DEVICES list (when set) and selecting the i-th entry for each rank; otherwise fall back to $i when it’s unset.

	for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do                         \
	  ( set -o pipefail;                                                  \
	    RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0                     \
	    HIP_VISIBLE_DEVICES=$$i                                           \
	    mlir-runner --entry-point-result=void                             \

test/gpu/multi_gpu/air_rank/Makefile:112

The launcher always sets HIP_VISIBLE_DEVICES=$i per rank. If the user sets HIP_VISIBLE_DEVICES to a subset list (e.g. "2,3") to constrain GPU selection, this overrides it and can accidentally run on GPUs 0/1 instead. Suggest parsing the original HIP_VISIBLE_DEVICES list (when set) and selecting the i-th entry for each rank; otherwise fall back to $i when it’s unset.

	for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do                         \
	  ( set -o pipefail;                                                  \
	    RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0                     \
	    HIP_VISIBLE_DEVICES=$$i                                           \
	    mlir-runner --entry-point-result=void                             \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… group by IR level Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (4-7) drop into their own subdir without touching anything else. Directory rename: test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/ The old name was misleading — most of the tests don't do DMA in the conventional sense (the cacheline / allgather / rank variants use vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA. New layout: test/gpu/multi_gpu/ README.md # explains the layered structure handwritten/ # Phase 2 reference Makefile # self-contained; INPUT=cacheline|atomic|allgather cacheline.mlir # was: air_sym_handwritten_cacheline.mlir atomic.mlir # was: air_sym_handwritten_atomic.mlir allgather.mlir # was: air_sym_handwritten_allgather.mlir air_rank/ # Phase 3 (air.rank wrapping) Makefile # self-contained; INPUT=cacheline|allgather cacheline.mlir # was: air_sym_with_rank_cacheline.mlir allgather.mlir # was: air_sym_with_rank_allgather.mlir Per-phase invocation: `make` instead of `bash run.sh`. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes. make -C test/gpu/multi_gpu/handwritten # default make -C test/gpu/multi_gpu/handwritten INPUT=atomic make -C test/gpu/multi_gpu/handwritten INPUT=allgather make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4 make -C test/gpu/multi_gpu/air_rank # default make -C test/gpu/multi_gpu/air_rank INPUT=allgather make -C test/gpu/multi_gpu/handwritten clean Why per-subdir self-contained Makefile (no _common.mk / no _common.sh): - Each phase's PR touches only its own subdir; no rebase conflicts on a shared file. Phases 2-7 had to re-resolve run.sh case-statement conflicts on every rebase under the old shared-script design (the cascade hit 4 conflicts during just the phase 2 rebase). - A shared include rots silently — one phase's edit can break another's pipeline without obvious blame attribution. Duplicating ~30 lines of preconditions + multi-process driver per Makefile is the cheaper failure mode. - Pipelines genuinely differ per phase (handwritten goes through air-translate-to-llvm + GPU compile; air_rank prepends air-rank-to-mgpu; air_alloc will add air-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4. Verified on rad-mi325x-1 (2x MI325X), all 5 variants PASS: - make -C test/gpu/multi_gpu/handwritten INPUT={cacheline,atomic,allgather} - make -C test/gpu/multi_gpu/air_rank INPUT={cacheline,allgather} Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

erwei-xilinx force-pushed the multigpu-test-restructure branch 2 times, most recently from 9d331c5 to 79c0888 Compare May 12, 2026 22:40

erwei-xilinx marked this pull request as ready for review May 12, 2026 22:43

Copilot AI review requested due to automatic review settings May 12, 2026 22:43

Copilot started reviewing on behalf of erwei-xilinx May 12, 2026 22:44 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread test/gpu/multi_gpu/README.md Outdated

Comment thread test/gpu/multi_gpu/handwritten/Makefile

Comment thread test/gpu/multi_gpu/air_rank/Makefile

erwei-xilinx force-pushed the multigpu-test-restructure branch from 79c0888 to 75a3ca8 Compare May 12, 2026 22:51

erwei-xilinx enabled auto-merge May 12, 2026 22:57

erwei-xilinx added this pull request to the merge queue May 12, 2026

Merged via the queue into Xilinx:main with commit e18b051 May 12, 2026
27 checks passed

erwei-xilinx deleted the multigpu-test-restructure branch May 12, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level#1613

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level#1613
erwei-xilinx merged 1 commit into
Xilinx:mainfrom
erwei-xilinx:multigpu-test-restructure

erwei-xilinx commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Directory rename

New layout

Per-phase invocation

Why per-subdir self-contained Makefile (no _common.mk / no _common.sh)

Test plan

Knock-on work

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented May 12, 2026 •

edited

Loading

Why per-subdir self-contained Makefile (no `_common.mk` / no `_common.sh`)