[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level#1613
Merged
Merged
Conversation
9d331c5 to
79c0888
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reorganizes the symmetric-heap multi-GPU end-to-end GPU tests to align with the multi-phase lowering stack, renaming the old symmetric_heap_dma directory to multi_gpu and grouping tests by IR abstraction level (handwritten reference vs air.rank-wrapped).
Changes:
- Renamed/restructured the e2e test directory into
test/gpu/multi_gpu/with a layered layout and new top-level README. - Replaced the prior
run.shharness with per-subdir self-containedMakefiledrivers (handwritten/,air_rank/). - Updated test file naming/headers to match the new directory structure.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/gpu/symmetric_heap_dma/run.sh | Removes the old shared bash launcher in favor of per-subdir Makefile drivers. |
| test/gpu/multi_gpu/README.md | Documents the new layered multi-GPU e2e test layout and how to run each phase. |
| test/gpu/multi_gpu/handwritten/Makefile | Adds a Make-based build/run harness for the handwritten reference tests. |
| test/gpu/multi_gpu/handwritten/cacheline.mlir | Renames/retitles the handwritten cacheline variant under the new layout. |
| test/gpu/multi_gpu/handwritten/atomic.mlir | Renames/retitles the handwritten atomic variant under the new layout. |
| test/gpu/multi_gpu/handwritten/allgather.mlir | Renames/retitles the handwritten allgather variant under the new layout. |
| test/gpu/multi_gpu/air_rank/Makefile | Adds a Make-based build/run harness for the air.rank-wrapped tests. |
| test/gpu/multi_gpu/air_rank/cacheline.mlir | Renames/retitles the air.rank cacheline wrapper under the new layout. |
| test/gpu/multi_gpu/air_rank/allgather.mlir | Renames/retitles the air.rank allgather wrapper under the new layout. |
Comments suppressed due to low confidence (2)
test/gpu/multi_gpu/handwritten/Makefile:117
- The launcher always sets
HIP_VISIBLE_DEVICES=$iper rank. If the user sets HIP_VISIBLE_DEVICES to a subset list (e.g. "2,3") to constrain GPU selection, this overrides it and can accidentally run on GPUs 0/1 instead. Suggest parsing the original HIP_VISIBLE_DEVICES list (when set) and selecting the i-th entry for each rank; otherwise fall back to$iwhen it’s unset.
for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \
( set -o pipefail; \
RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \
HIP_VISIBLE_DEVICES=$$i \
mlir-runner --entry-point-result=void \
test/gpu/multi_gpu/air_rank/Makefile:112
- The launcher always sets
HIP_VISIBLE_DEVICES=$iper rank. If the user sets HIP_VISIBLE_DEVICES to a subset list (e.g. "2,3") to constrain GPU selection, this overrides it and can accidentally run on GPUs 0/1 instead. Suggest parsing the original HIP_VISIBLE_DEVICES list (when set) and selecting the i-th entry for each rank; otherwise fall back to$iwhen it’s unset.
for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \
( set -o pipefail; \
RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \
HIP_VISIBLE_DEVICES=$$i \
mlir-runner --entry-point-result=void \
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… group by IR level
Reorganize the multi-GPU e2e tests to match how the lowering stack is
layered. Each subdirectory hosts tests at one IR-abstraction level;
future phases (4-7) drop into their own subdir without touching anything
else.
Directory rename:
test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/
The old name was misleading — most of the tests don't do DMA in the
conventional sense (the cacheline / allgather / rank variants use
vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6
will add real DMA later). The common thread is the symmetric-heap
fabric, not DMA.
New layout:
test/gpu/multi_gpu/
README.md # explains the layered structure
handwritten/ # Phase 2 reference
Makefile # self-contained; INPUT=cacheline|atomic|allgather
cacheline.mlir # was: air_sym_handwritten_cacheline.mlir
atomic.mlir # was: air_sym_handwritten_atomic.mlir
allgather.mlir # was: air_sym_handwritten_allgather.mlir
air_rank/ # Phase 3 (air.rank wrapping)
Makefile # self-contained; INPUT=cacheline|allgather
cacheline.mlir # was: air_sym_with_rank_cacheline.mlir
allgather.mlir # was: air_sym_with_rank_allgather.mlir
Per-phase invocation: `make` instead of `bash run.sh`. Same default
behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking
avoids re-running the lowering pipeline when only NUM_RANKS changes.
make -C test/gpu/multi_gpu/handwritten # default
make -C test/gpu/multi_gpu/handwritten INPUT=atomic
make -C test/gpu/multi_gpu/handwritten INPUT=allgather
make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
make -C test/gpu/multi_gpu/air_rank # default
make -C test/gpu/multi_gpu/air_rank INPUT=allgather
make -C test/gpu/multi_gpu/handwritten clean
Why per-subdir self-contained Makefile (no _common.mk / no _common.sh):
- Each phase's PR touches only its own subdir; no rebase conflicts on
a shared file. Phases 2-7 had to re-resolve run.sh case-statement
conflicts on every rebase under the old shared-script design (the
cascade hit 4 conflicts during just the phase 2 rebase).
- A shared include rots silently — one phase's edit can break another's
pipeline without obvious blame attribution. Duplicating ~30 lines of
preconditions + multi-process driver per Makefile is the cheaper
failure mode.
- Pipelines genuinely differ per phase (handwritten goes through
air-translate-to-llvm + GPU compile; air_rank prepends
air-rank-to-mgpu; air_alloc will add air-symmetric-alloc-to-mgpu;
etc.). One unified case statement would already be hard to read by
phase 4.
Verified on rad-mi325x-1 (2x MI325X), all 5 variants PASS:
- make -C test/gpu/multi_gpu/handwritten INPUT={cacheline,atomic,allgather}
- make -C test/gpu/multi_gpu/air_rank INPUT={cacheline,allgather}
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
79c0888 to
75a3ca8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (4-7) drop into their own subdir without touching anything else.
Directory rename
test/gpu/symmetric_heap_dma/→test/gpu/multi_gpu/The old name was misleading — most of the tests don't do DMA in the conventional sense (the cacheline / allgather / rank variants use vec-store +
gpu.shuffle; the atomic variant usesatomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA.New layout
Future phases will add sibling subdirs:
air_alloc/memref.alloc {air.symmetric}declares symmetric-heap allocations.air_dma/air.dma_memcpy_nd {src_rank/dst_rank}.air_channel/air.channel {channel_type = "gpu_symmetric_heap"}.Per-phase invocation
makeinstead ofbash run.sh. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes.Why per-subdir self-contained Makefile (no
_common.mk/ no_common.sh)run.shcase-statement conflicts on every rebase under the old shared-script design (the cascade hit 4 conflicts during just the phase 2 rebase, and phase 3's rebase needed another).air-translate-to-llvm+ GPU compile; air_rank prependsair-rank-to-mgpu; air_alloc will addair-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4.Test plan
make -C test/gpu/multi_gpu/handwritten(cacheline, default) → ALL 2 RANKS PASSEDmake ... handwritten INPUT=atomic→ ALL 2 RANKS PASSEDmake ... handwritten INPUT=allgather→ ALL 2 RANKS PASSEDmake ... air_rank(cacheline, default) → ALL 2 RANKS PASSEDmake ... air_rank INPUT=allgather→ ALL 2 RANKS PASSEDmake cleanremoves the per-input TMPDIRKnock-on work
In-flight phase 4-7 PRs (#1579-#1582) will need to rebase onto this restructure and add their own subdir + Makefile when they reach turn-to-land.
🤖 Generated with Claude Code