Skip to content

Commit e18b051

Browse files
erwei-xilinxclaude
andauthored
[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level (#1613)
Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (4-7) drop into their own subdir without touching anything else. Directory rename: test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/ The old name was misleading — most of the tests don't do DMA in the conventional sense (the cacheline / allgather / rank variants use vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA. New layout: test/gpu/multi_gpu/ README.md # explains the layered structure handwritten/ # Phase 2 reference Makefile # self-contained; INPUT=cacheline|atomic|allgather cacheline.mlir # was: air_sym_handwritten_cacheline.mlir atomic.mlir # was: air_sym_handwritten_atomic.mlir allgather.mlir # was: air_sym_handwritten_allgather.mlir air_rank/ # Phase 3 (air.rank wrapping) Makefile # self-contained; INPUT=cacheline|allgather cacheline.mlir # was: air_sym_with_rank_cacheline.mlir allgather.mlir # was: air_sym_with_rank_allgather.mlir Per-phase invocation: `make` instead of `bash run.sh`. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes. make -C test/gpu/multi_gpu/handwritten # default make -C test/gpu/multi_gpu/handwritten INPUT=atomic make -C test/gpu/multi_gpu/handwritten INPUT=allgather make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4 make -C test/gpu/multi_gpu/air_rank # default make -C test/gpu/multi_gpu/air_rank INPUT=allgather make -C test/gpu/multi_gpu/handwritten clean Why per-subdir self-contained Makefile (no _common.mk / no _common.sh): - Each phase's PR touches only its own subdir; no rebase conflicts on a shared file. Phases 2-7 had to re-resolve run.sh case-statement conflicts on every rebase under the old shared-script design (the cascade hit 4 conflicts during just the phase 2 rebase). - A shared include rots silently — one phase's edit can break another's pipeline without obvious blame attribution. Duplicating ~30 lines of preconditions + multi-process driver per Makefile is the cheaper failure mode. - Pipelines genuinely differ per phase (handwritten goes through air-translate-to-llvm + GPU compile; air_rank prepends air-rank-to-mgpu; air_alloc will add air-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4. Verified on rad-mi325x-1 (2x MI325X), all 5 variants PASS: - make -C test/gpu/multi_gpu/handwritten INPUT={cacheline,atomic,allgather} - make -C test/gpu/multi_gpu/air_rank INPUT={cacheline,allgather} Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 958bb96 commit e18b051

9 files changed

Lines changed: 404 additions & 159 deletions

File tree

test/gpu/multi_gpu/README.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# `multi_gpu` — symmetric-heap multi-GPU end-to-end tests
2+
3+
End-to-end tests for the symmetric-heap multi-GPU stack. Each test launches
4+
N processes — one per physical GPU — that coordinate via the symmetric heap
5+
(XGMI peer-mapped VMem buffers).
6+
7+
The `mlir/test/Conversion/AIR*ToMgpu/` lit tests pin pass-level invariants
8+
with FileCheck. The tests in this directory are the e2e counterparts: they
9+
build through the full lowering chain and run on real hardware.
10+
11+
## Layout
12+
13+
Tests are organized by IR-abstraction level. Each subdirectory holds tests
14+
written at one level. Lower levels (closer to LLVM dialect) are the lowering
15+
targets that higher levels reduce to.
16+
17+
| Subdir | Phase | Abstraction added |
18+
|---|---|---|
19+
| `handwritten/` | 2 | none — raw MLIR with hand-written GPU kernels and direct `mgpuSymmetricAlloc` / `mgpuGetRank` calls. The reference target. Variants: `cacheline`, `atomic`, `allgather`. |
20+
| `air_rank/` | 3 | `air.rank` declares the multi-process world; replaces hand-written `mgpuGetRank` / heap init/destroy plumbing. Lowered by `air-rank-to-mgpu`. Variants: `cacheline`, `allgather` — each a 1:1 wrap of the corresponding `handwritten/` test. |
21+
| `air_alloc/` | 4 (TBD) | `memref.alloc {air.symmetric}` declares symmetric-heap allocations. Lowered by `air-symmetric-alloc-to-mgpu`. |
22+
| `air_dma/` | 5 (TBD) | `air.dma_memcpy_nd {src_rank/dst_rank}` declares cross-rank DMAs. Lowered by `air-cross-rank-dma-to-mgpu`. |
23+
| `air_channel/` | 6 (TBD) | `air.channel {channel_type = "gpu_symmetric_heap"}` declares cross-rank channels. Lowered by `air-gpu-channel-to-mgpu`. |
24+
25+
A higher-level test should produce — after running its phase's lowering pass
26+
— IR functionally equivalent to one of the `handwritten/` references.
27+
28+
## Running
29+
30+
Each subdirectory has its own self-contained `Makefile`. There is no shared
31+
include or sourced helper — duplication is intentional, so that each phase's
32+
PR touches only its own subdir and there's no cross-phase coupling that can
33+
rot.
34+
35+
Default invocation forks 2 processes:
36+
37+
make -C test/gpu/multi_gpu/handwritten
38+
39+
Inside a subdirectory, common knobs:
40+
41+
make -C test/gpu/multi_gpu/handwritten INPUT=cacheline # default
42+
make -C test/gpu/multi_gpu/handwritten INPUT=atomic
43+
make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
44+
make -C test/gpu/multi_gpu/handwritten clean
45+
46+
Each `Makefile` documents its own `INPUT` choices in the header comment.
47+
48+
## Preconditions
49+
50+
Each `Makefile`'s `check-preconditions` target refuses to launch if either:
51+
52+
- `NUM_RANKS < 2` — the cross-rank symmetric-heap test fundamentally needs
53+
a peer; a single-process launch has nothing to talk to.
54+
- Fewer physical GPUs than `NUM_RANKS` — colocating ranks on one GPU would
55+
silently bypass XGMI/peer-VA (transparently falling back to local memory)
56+
and report false-positive PASSes.
57+
58+
## Required environment
59+
60+
The Makefiles invoke `air-opt`, `mlir-opt`, and `mlir-runner` via PATH, plus dlopen `libairgpu.so` and the `libmlir_*.so` runtime libraries. There are three ways to satisfy this:
61+
62+
1. **Source `utils/env_setup_gpu.sh`** (recommended) — sets `PATH`, `LD_LIBRARY_PATH`, `MLIR_AIR_INSTALL_DIR`, and `LLVM_INSTALL_DIR` in one go.
63+
2. **Pass install dirs on the make command line**:
64+
```
65+
make MLIR_AIR_INSTALL_DIR=… LLVM_INSTALL_DIR=…
66+
```
67+
(PATH must still contain the binaries — these vars only affect `--shared-libs` paths.)
68+
3. **Have the binaries in `PATH` already** — the Makefile derives `LLVM_INSTALL_DIR` / `MLIR_AIR_INSTALL_DIR` from `dirname $(dirname $(command -v mlir-opt))` etc.
69+
70+
The `check-preconditions` target validates that the resolved `LLVM_LIB_DIR` and `AIRGPU_LIB` paths actually exist before launching, so a missing env shows a clear error rather than a `dlopen` failure deep inside `mlir-runner`.
71+
72+
## Why duplicated boilerplate per subdir
73+
74+
A shared `_common.mk` or `_common.sh` would let one phase's edit silently
75+
break another phase's tests. The boilerplate is small (~30 lines of
76+
preconditions + driver per Makefile) and stable — phases differ in their
77+
compile pipeline, not in the multi-process driver. Duplication is the
78+
cheaper failure mode.
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Multi-process symmetric-heap multi-GPU e2e — air.rank wrapped tests.
2+
#
3+
# These tests express the multi-process world declaratively via
4+
# `air.rank (%rid) in (%rsize = %c2) { ... }`. The air-rank-to-mgpu
5+
# pass (Phase 3) replaces the air.rank op with body-inlined IR that
6+
# resolves %rid from mgpuGetRank() at runtime and brackets the
7+
# enclosing function with mgpuSymmetricHeapInit / Destroy.
8+
#
9+
# Each variant in this dir is a 1:1 wrap of the corresponding test in
10+
# ../handwritten/. After lowering through air-rank-to-mgpu the IR is
11+
# functionally equivalent to the handwritten reference.
12+
#
13+
# Variants (selected via INPUT):
14+
# cacheline Wrap of ../handwritten/cacheline.mlir (producer/consumer,
15+
# 1-to-1, cache-line atomicity).
16+
# allgather Wrap of ../handwritten/allgather.mlir (many-to-many SIMD,
17+
# cache-line atomicity).
18+
#
19+
# Usage:
20+
# make # default: INPUT=cacheline NUM_RANKS=2
21+
# make INPUT=allgather
22+
# make NUM_RANKS=4
23+
# make clean
24+
#
25+
# Required environment (auto-detected when sourced via env_setup_gpu.sh):
26+
# MLIR_AIR_INSTALL_DIR — path containing lib/libairgpu.so
27+
# LLVM_INSTALL_DIR — path containing bin/mlir-opt + lib/libmlir_*.so
28+
#
29+
# This Makefile is intentionally self-contained — no included files, no
30+
# sourced helpers. Other multi_gpu/<level>/ subdirs each have their own
31+
# complete Makefile so that each phase's PR touches only its own dir.
32+
33+
SHELL := /bin/bash
34+
.SHELLFLAGS := -eu -o pipefail -c
35+
36+
INPUT ?= cacheline
37+
NUM_RANKS ?= 2
38+
TMPDIR ?= /tmp/air_multi_gpu_air_rank
39+
40+
SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST)))))
41+
42+
# Derive install dirs from PATH if not explicitly provided. Matches the
43+
# original run.sh fallback (`dirname $(dirname $(which mlir-opt))`).
44+
LLVM_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null)
45+
MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null)
46+
LLVM_LIB_DIR ?= $(LLVM_INSTALL_DIR)/lib
47+
AIRGPU_LIB ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so
48+
49+
ifeq ($(filter $(INPUT),cacheline allgather),)
50+
$(error Unknown INPUT=$(INPUT); expected 'cacheline' or 'allgather')
51+
endif
52+
53+
SRC_MLIR := $(SCRIPT_DIR)/$(INPUT).mlir
54+
POST_RANK := $(TMPDIR)/$(INPUT)_post_rank.mlir
55+
LOWERED := $(TMPDIR)/$(INPUT)_lowered.mlir
56+
57+
.PHONY: run clean check-preconditions
58+
.DEFAULT_GOAL := run
59+
60+
$(TMPDIR):
61+
@mkdir -p $@
62+
63+
# Step 1a: lower air.rank to mgpu* runtime + expand air.translate.
64+
$(POST_RANK): $(SRC_MLIR) | $(TMPDIR)
65+
@echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($(INPUT))"
66+
air-opt $< -air-rank-to-mgpu --air-translate-to-llvm -o $@
67+
68+
# Step 1b: compile gpu.module to AMDGPU binary + finalize host. Same
69+
# pipeline as ../handwritten/Makefile (the lowered output is structurally
70+
# a superset of the corresponding handwritten test).
71+
$(LOWERED): $(POST_RANK)
72+
@echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
73+
mlir-opt $< \
74+
--pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
75+
-o $@
76+
77+
# Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer
78+
# physical GPUs than NUM_RANKS (would silently bypass XGMI and report
79+
# false-positive PASSes by colocating ranks on one GPU), or if the
80+
# install paths are missing (mlir-runner would fail at dlopen with a
81+
# more cryptic message).
82+
check-preconditions:
83+
@if [ ! -d "$(LLVM_LIB_DIR)" ]; then \
84+
echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2; \
85+
echo " Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR." \
86+
>&2; \
87+
exit 1; \
88+
fi
89+
@if [ ! -f "$(AIRGPU_LIB)" ]; then \
90+
echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2; \
91+
echo " Source utils/env_setup_gpu.sh or set" \
92+
"MLIR_AIR_INSTALL_DIR." >&2; \
93+
exit 1; \
94+
fi
95+
@if [ "$(NUM_RANKS)" -lt 2 ]; then \
96+
echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +" \
97+
"consumer)." >&2; \
98+
exit 1; \
99+
fi
100+
@if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then \
101+
NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .); \
102+
else \
103+
NUM_GPUS=$$(grep -l '^simd_count [1-9]' \
104+
/sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \
105+
fi; \
106+
if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then \
107+
echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI" \
108+
"traffic; found $$NUM_GPUS." >&2; \
109+
echo " This test refuses to colocate ranks on a single GPU" \
110+
"because it would silently" >&2; \
111+
echo " bypass the symmetric-heap path and report false PASSes." \
112+
>&2; \
113+
exit 1; \
114+
fi
115+
116+
# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via
117+
# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any
118+
# nested call into libmlir_rocm_runtime.so) only ever sees one device,
119+
# so it can't accidentally launch on the wrong one. Every rank still
120+
# sees device 0 internally, so airgpu uses LOCAL_RANK=0.
121+
run: check-preconditions $(LOWERED)
122+
@echo "Step 2: Run as $(NUM_RANKS) processes"
123+
@export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}"; \
124+
PIDS=(); \
125+
PASS=1; \
126+
for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \
127+
( set -o pipefail; \
128+
RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \
129+
HIP_VISIBLE_DEVICES=$$i \
130+
mlir-runner --entry-point-result=void \
131+
--shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so" \
132+
--shared-libs="$(AIRGPU_LIB)" \
133+
--shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so" \
134+
--shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so" \
135+
$(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") & \
136+
PIDS+=($$!); \
137+
done; \
138+
for pid in "$${PIDS[@]}"; do \
139+
if ! wait "$$pid"; then PASS=0; fi; \
140+
done; \
141+
if [ $$PASS -eq 1 ]; then \
142+
echo "=== ALL $(NUM_RANKS) RANKS PASSED ==="; \
143+
else \
144+
echo "=== SOME RANKS FAILED ==="; \
145+
exit 1; \
146+
fi
147+
148+
clean:
149+
rm -rf $(TMPDIR)

test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir renamed to test/gpu/multi_gpu/air_rank/allgather.mlir

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
//===- air_sym_with_rank_allgather.mlir - air.rank wrap of allgather -----===//
1+
//===- air_rank/allgather.mlir - air.rank wrap of handwritten allgather --===//
22
//
33
// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
44
// SPDX-License-Identifier: MIT
55
//
66
//===-----------------------------------------------------------------------===//
77
//
8-
// High-level version of air_sym_handwritten_allgather.mlir.
8+
// High-level version of handwritten/allgather.mlir.
99
//
1010
// This file is a 1:1 wrap of the SIMD-across-ranks all-gather test inside
1111
// an `air.rank` op:
@@ -20,9 +20,9 @@
2020
// - mgpuSymmetricHeapDestroy before each func.return
2121
//
2222
// After lowering the IR is functionally equivalent to
23-
// air_sym_handwritten_allgather.mlir (same kernel, same launch dispatch,
24-
// same validation). Sister file: air_sym_with_rank_cacheline.mlir does
25-
// the analogous wrap of the producer/consumer cacheline test.
23+
// handwritten/allgather.mlir (same kernel, same launch dispatch, same
24+
// validation). Sister file: air_rank/cacheline.mlir does the analogous
25+
// wrap of the producer/consumer cacheline test.
2626
//
2727
// The kernel and helpers (gpu.module @sym_kernels, @wrap_bytes) are
2828
// duplicated verbatim from the handwritten allgather. Only @main differs
@@ -33,7 +33,7 @@
3333
// source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
3434
// the handwritten allgather.
3535
//
36-
// Launcher: run.sh with INPUT=rank_allgather forks 2 processes.
36+
// Launcher: `make INPUT=allgather` from this subdir forks 2 processes.
3737
//
3838
//===-----------------------------------------------------------------------===//
3939

test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir renamed to test/gpu/multi_gpu/air_rank/cacheline.mlir

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
//===- air_sym_with_rank_cacheline.mlir - air.rank wrap of cacheline -----===//
1+
//===- air_rank/cacheline.mlir - air.rank wrap of handwritten cacheline --===//
22
//
33
// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
44
// SPDX-License-Identifier: MIT
55
//
66
//===-----------------------------------------------------------------------===//
77
//
8-
// High-level version of air_sym_handwritten_cacheline.mlir.
8+
// High-level version of handwritten/cacheline.mlir.
99
//
1010
// This file is a 1:1 wrap of the cacheline producer/consumer test inside
1111
// an `air.rank` op:
@@ -20,10 +20,10 @@
2020
// - mgpuSymmetricHeapDestroy before each func.return
2121
//
2222
// After lowering the IR is functionally equivalent to
23-
// air_sym_handwritten_cacheline.mlir (same kernels, same launch
24-
// dispatch, same validation). This file's job is to demonstrate that
25-
// the user can write the multi-process world declaratively via air.rank
26-
// and have the pass produce the handwritten reference.
23+
// handwritten/cacheline.mlir (same kernels, same launch dispatch, same
24+
// validation). This file's job is to demonstrate that the user can
25+
// write the multi-process world declaratively via air.rank and have
26+
// the pass produce the handwritten reference.
2727
//
2828
// The kernels and helpers (gpu.module @sym_kernels, @wrap_bytes) are
2929
// duplicated verbatim from the cacheline test. Only @main differs in
@@ -34,8 +34,8 @@
3434
// source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
3535
// the handwritten cacheline test.
3636
//
37-
// Launcher: run.sh with INPUT=rank forks 2 processes. The
38-
// air-rank-to-mgpu pass converts air.rank to runtime dispatch.
37+
// Launcher: `make INPUT=cacheline` from this subdir forks 2 processes.
38+
// The air-rank-to-mgpu pass converts air.rank to runtime dispatch.
3939
//
4040
//===-----------------------------------------------------------------------===//
4141

0 commit comments

Comments
 (0)