Skip to content

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level#1613

Merged
erwei-xilinx merged 1 commit into
Xilinx:mainfrom
erwei-xilinx:multigpu-test-restructure
May 12, 2026
Merged

[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level#1613
erwei-xilinx merged 1 commit into
Xilinx:mainfrom
erwei-xilinx:multigpu-test-restructure

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented May 12, 2026

Summary

Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (4-7) drop into their own subdir without touching anything else.

Directory rename

test/gpu/symmetric_heap_dma/test/gpu/multi_gpu/

The old name was misleading — most of the tests don't do DMA in the conventional sense (the cacheline / allgather / rank variants use vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA.

New layout

test/gpu/multi_gpu/
  README.md                   # explains the layered structure
  handwritten/                # Phase 2 reference
    Makefile                  # self-contained
    cacheline.mlir            # was: air_sym_handwritten_cacheline.mlir
    atomic.mlir               # was: air_sym_handwritten_atomic.mlir
    allgather.mlir            # was: air_sym_handwritten_allgather.mlir
  air_rank/                   # Phase 3 (air.rank wrapping)
    Makefile                  # self-contained
    cacheline.mlir            # was: air_sym_with_rank_cacheline.mlir
    allgather.mlir            # was: air_sym_with_rank_allgather.mlir

Future phases will add sibling subdirs:

Subdir Phase Adds
air_alloc/ 4 memref.alloc {air.symmetric} declares symmetric-heap allocations.
air_dma/ 5 air.dma_memcpy_nd {src_rank/dst_rank}.
air_channel/ 6 air.channel {channel_type = "gpu_symmetric_heap"}.

Per-phase invocation

make instead of bash run.sh. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes.

make -C test/gpu/multi_gpu/handwritten                    # default
make -C test/gpu/multi_gpu/handwritten INPUT=atomic
make -C test/gpu/multi_gpu/handwritten INPUT=allgather
make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
make -C test/gpu/multi_gpu/air_rank                        # default
make -C test/gpu/multi_gpu/air_rank INPUT=allgather
make -C test/gpu/multi_gpu/handwritten clean

Why per-subdir self-contained Makefile (no _common.mk / no _common.sh)

  • Each phase's PR touches only its own subdir — no rebase conflicts on a shared file. Phases 2-7 had to re-resolve run.sh case-statement conflicts on every rebase under the old shared-script design (the cascade hit 4 conflicts during just the phase 2 rebase, and phase 3's rebase needed another).
  • A shared include rots silently — one phase's edit can break another's pipeline without obvious blame attribution. Duplicating ~30 lines of preconditions + multi-process driver per Makefile is the cheaper failure mode.
  • Pipelines genuinely differ per phase (handwritten goes through air-translate-to-llvm + GPU compile; air_rank prepends air-rank-to-mgpu; air_alloc will add air-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4.

Test plan

  • make -C test/gpu/multi_gpu/handwritten (cacheline, default) → ALL 2 RANKS PASSED
  • make ... handwritten INPUT=atomic → ALL 2 RANKS PASSED
  • make ... handwritten INPUT=allgather → ALL 2 RANKS PASSED
  • make ... air_rank (cacheline, default) → ALL 2 RANKS PASSED
  • make ... air_rank INPUT=allgather → ALL 2 RANKS PASSED
  • All 5 variants verified on rad-mi325x-1 (real 2x MI325X)
  • make clean removes the per-input TMPDIR

Knock-on work

In-flight phase 4-7 PRs (#1579-#1582) will need to rebase onto this restructure and add their own subdir + Makefile when they reach turn-to-land.

🤖 Generated with Claude Code

@erwei-xilinx erwei-xilinx force-pushed the multigpu-test-restructure branch 2 times, most recently from 9d331c5 to 79c0888 Compare May 12, 2026 22:40
@erwei-xilinx erwei-xilinx marked this pull request as ready for review May 12, 2026 22:43
Copilot AI review requested due to automatic review settings May 12, 2026 22:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reorganizes the symmetric-heap multi-GPU end-to-end GPU tests to align with the multi-phase lowering stack, renaming the old symmetric_heap_dma directory to multi_gpu and grouping tests by IR abstraction level (handwritten reference vs air.rank-wrapped).

Changes:

  • Renamed/restructured the e2e test directory into test/gpu/multi_gpu/ with a layered layout and new top-level README.
  • Replaced the prior run.sh harness with per-subdir self-contained Makefile drivers (handwritten/, air_rank/).
  • Updated test file naming/headers to match the new directory structure.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/gpu/symmetric_heap_dma/run.sh Removes the old shared bash launcher in favor of per-subdir Makefile drivers.
test/gpu/multi_gpu/README.md Documents the new layered multi-GPU e2e test layout and how to run each phase.
test/gpu/multi_gpu/handwritten/Makefile Adds a Make-based build/run harness for the handwritten reference tests.
test/gpu/multi_gpu/handwritten/cacheline.mlir Renames/retitles the handwritten cacheline variant under the new layout.
test/gpu/multi_gpu/handwritten/atomic.mlir Renames/retitles the handwritten atomic variant under the new layout.
test/gpu/multi_gpu/handwritten/allgather.mlir Renames/retitles the handwritten allgather variant under the new layout.
test/gpu/multi_gpu/air_rank/Makefile Adds a Make-based build/run harness for the air.rank-wrapped tests.
test/gpu/multi_gpu/air_rank/cacheline.mlir Renames/retitles the air.rank cacheline wrapper under the new layout.
test/gpu/multi_gpu/air_rank/allgather.mlir Renames/retitles the air.rank allgather wrapper under the new layout.
Comments suppressed due to low confidence (2)

test/gpu/multi_gpu/handwritten/Makefile:117

  • The launcher always sets HIP_VISIBLE_DEVICES=$i per rank. If the user sets HIP_VISIBLE_DEVICES to a subset list (e.g. "2,3") to constrain GPU selection, this overrides it and can accidentally run on GPUs 0/1 instead. Suggest parsing the original HIP_VISIBLE_DEVICES list (when set) and selecting the i-th entry for each rank; otherwise fall back to $i when it’s unset.
	for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do                         \
	  ( set -o pipefail;                                                  \
	    RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0                     \
	    HIP_VISIBLE_DEVICES=$$i                                           \
	    mlir-runner --entry-point-result=void                             \

test/gpu/multi_gpu/air_rank/Makefile:112

  • The launcher always sets HIP_VISIBLE_DEVICES=$i per rank. If the user sets HIP_VISIBLE_DEVICES to a subset list (e.g. "2,3") to constrain GPU selection, this overrides it and can accidentally run on GPUs 0/1 instead. Suggest parsing the original HIP_VISIBLE_DEVICES list (when set) and selecting the i-th entry for each rank; otherwise fall back to $i when it’s unset.
	for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do                         \
	  ( set -o pipefail;                                                  \
	    RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0                     \
	    HIP_VISIBLE_DEVICES=$$i                                           \
	    mlir-runner --entry-point-result=void                             \

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/gpu/multi_gpu/README.md Outdated
Comment thread test/gpu/multi_gpu/handwritten/Makefile
Comment thread test/gpu/multi_gpu/air_rank/Makefile
… group by IR level

Reorganize the multi-GPU e2e tests to match how the lowering stack is
layered. Each subdirectory hosts tests at one IR-abstraction level;
future phases (4-7) drop into their own subdir without touching anything
else.

Directory rename:
  test/gpu/symmetric_heap_dma/  →  test/gpu/multi_gpu/

The old name was misleading — most of the tests don't do DMA in the
conventional sense (the cacheline / allgather / rank variants use
vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6
will add real DMA later). The common thread is the symmetric-heap
fabric, not DMA.

New layout:
  test/gpu/multi_gpu/
    README.md                   # explains the layered structure
    handwritten/                # Phase 2 reference
      Makefile                  # self-contained; INPUT=cacheline|atomic|allgather
      cacheline.mlir            # was: air_sym_handwritten_cacheline.mlir
      atomic.mlir               # was: air_sym_handwritten_atomic.mlir
      allgather.mlir            # was: air_sym_handwritten_allgather.mlir
    air_rank/                   # Phase 3 (air.rank wrapping)
      Makefile                  # self-contained; INPUT=cacheline|allgather
      cacheline.mlir            # was: air_sym_with_rank_cacheline.mlir
      allgather.mlir            # was: air_sym_with_rank_allgather.mlir

Per-phase invocation: `make` instead of `bash run.sh`. Same default
behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking
avoids re-running the lowering pipeline when only NUM_RANKS changes.

  make -C test/gpu/multi_gpu/handwritten                    # default
  make -C test/gpu/multi_gpu/handwritten INPUT=atomic
  make -C test/gpu/multi_gpu/handwritten INPUT=allgather
  make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
  make -C test/gpu/multi_gpu/air_rank                        # default
  make -C test/gpu/multi_gpu/air_rank INPUT=allgather
  make -C test/gpu/multi_gpu/handwritten clean

Why per-subdir self-contained Makefile (no _common.mk / no _common.sh):
  - Each phase's PR touches only its own subdir; no rebase conflicts on
    a shared file. Phases 2-7 had to re-resolve run.sh case-statement
    conflicts on every rebase under the old shared-script design (the
    cascade hit 4 conflicts during just the phase 2 rebase).
  - A shared include rots silently — one phase's edit can break another's
    pipeline without obvious blame attribution. Duplicating ~30 lines of
    preconditions + multi-process driver per Makefile is the cheaper
    failure mode.
  - Pipelines genuinely differ per phase (handwritten goes through
    air-translate-to-llvm + GPU compile; air_rank prepends
    air-rank-to-mgpu; air_alloc will add air-symmetric-alloc-to-mgpu;
    etc.). One unified case statement would already be hard to read by
    phase 4.

Verified on rad-mi325x-1 (2x MI325X), all 5 variants PASS:
  - make -C test/gpu/multi_gpu/handwritten INPUT={cacheline,atomic,allgather}
  - make -C test/gpu/multi_gpu/air_rank INPUT={cacheline,allgather}

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx force-pushed the multigpu-test-restructure branch from 79c0888 to 75a3ca8 Compare May 12, 2026 22:51
@erwei-xilinx erwei-xilinx enabled auto-merge May 12, 2026 22:57
@erwei-xilinx erwei-xilinx added this pull request to the merge queue May 12, 2026
Merged via the queue into Xilinx:main with commit e18b051 May 12, 2026
27 checks passed
@erwei-xilinx erwei-xilinx deleted the multigpu-test-restructure branch May 12, 2026 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants