Skip to content

Commit 79c0888

Browse files
erwei-xilinxclaude
andcommitted
[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level
Reorganize the multi-GPU e2e tests to match how the lowering stack is layered. Each subdirectory hosts tests at one IR-abstraction level; future phases (4-7) drop into their own subdir without touching anything else. Directory rename: test/gpu/symmetric_heap_dma/ → test/gpu/multi_gpu/ The old name was misleading — most of the tests don't do DMA in the conventional sense (the cacheline / allgather / rank variants use vec-store + gpu.shuffle; the atomic variant uses atomicrmw; phases 5/6 will add real DMA later). The common thread is the symmetric-heap fabric, not DMA. New layout: test/gpu/multi_gpu/ README.md # explains the layered structure handwritten/ # Phase 2 reference Makefile # self-contained; INPUT=cacheline|atomic|allgather cacheline.mlir # was: air_sym_handwritten_cacheline.mlir atomic.mlir # was: air_sym_handwritten_atomic.mlir allgather.mlir # was: air_sym_handwritten_allgather.mlir air_rank/ # Phase 3 (air.rank wrapping) Makefile # self-contained; INPUT=cacheline|allgather cacheline.mlir # was: air_sym_with_rank_cacheline.mlir allgather.mlir # was: air_sym_with_rank_allgather.mlir Per-phase invocation: `make` instead of `bash run.sh`. Same default behavior (NUM_RANKS=2, INPUT=cacheline). Make's dependency tracking avoids re-running the lowering pipeline when only NUM_RANKS changes. make -C test/gpu/multi_gpu/handwritten # default make -C test/gpu/multi_gpu/handwritten INPUT=atomic make -C test/gpu/multi_gpu/handwritten INPUT=allgather make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4 make -C test/gpu/multi_gpu/air_rank # default make -C test/gpu/multi_gpu/air_rank INPUT=allgather make -C test/gpu/multi_gpu/handwritten clean Why per-subdir self-contained Makefile (no _common.mk / no _common.sh): - Each phase's PR touches only its own subdir; no rebase conflicts on a shared file. Phases 2-7 had to re-resolve run.sh case-statement conflicts on every rebase under the old shared-script design (the cascade hit 4 conflicts during just the phase 2 rebase). - A shared include rots silently — one phase's edit can break another's pipeline without obvious blame attribution. Duplicating ~30 lines of preconditions + multi-process driver per Makefile is the cheaper failure mode. - Pipelines genuinely differ per phase (handwritten goes through air-translate-to-llvm + GPU compile; air_rank prepends air-rank-to-mgpu; air_alloc will add air-symmetric-alloc-to-mgpu; etc.). One unified case statement would already be hard to read by phase 4. Verified on rad-mi325x-1 (2x MI325X), all 5 variants PASS: - make -C test/gpu/multi_gpu/handwritten INPUT={cacheline,atomic,allgather} - make -C test/gpu/multi_gpu/air_rank INPUT={cacheline,allgather} Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 958bb96 commit 79c0888

9 files changed

Lines changed: 363 additions & 159 deletions

File tree

test/gpu/multi_gpu/README.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# `multi_gpu` — symmetric-heap multi-GPU end-to-end tests
2+
3+
End-to-end tests for the symmetric-heap multi-GPU stack. Each test launches
4+
N processes — one per physical GPU — that coordinate via the symmetric heap
5+
(XGMI peer-mapped VMem buffers).
6+
7+
The `mlir/test/Conversion/AIR*ToMgpu/` lit tests pin pass-level invariants
8+
with FileCheck. The tests in this directory are the e2e counterparts: they
9+
build through the full lowering chain and run on real hardware.
10+
11+
## Layout
12+
13+
Tests are organized by IR-abstraction level. Each subdirectory holds tests
14+
written at one level. Lower levels (closer to LLVM dialect) are the lowering
15+
targets that higher levels reduce to.
16+
17+
| Subdir | Phase | Abstraction added |
18+
|---|---|---|
19+
| `handwritten/` | 2 | none — raw MLIR with hand-written GPU kernels and direct `mgpuSymmetricAlloc` / `mgpuGetRank` calls. The reference target. Variants: `cacheline`, `atomic`, `allgather`. |
20+
| `air_rank/` | 3 | `air.rank` declares the multi-process world; replaces hand-written `mgpuGetRank` / heap init/destroy plumbing. Lowered by `air-rank-to-mgpu`. Variants: `cacheline`, `allgather` — each a 1:1 wrap of the corresponding `handwritten/` test. |
21+
| `air_alloc/` | 4 (TBD) | `memref.alloc {air.symmetric}` declares symmetric-heap allocations. Lowered by `air-symmetric-alloc-to-mgpu`. |
22+
| `air_dma/` | 5 (TBD) | `air.dma_memcpy_nd {src_rank/dst_rank}` declares cross-rank DMAs. Lowered by `air-cross-rank-dma-to-mgpu`. |
23+
| `air_channel/` | 6 (TBD) | `air.channel {channel_type = "gpu_symmetric_heap"}` declares cross-rank channels. Lowered by `air-gpu-channel-to-mgpu`. |
24+
25+
A higher-level test should produce — after running its phase's lowering pass
26+
— IR functionally equivalent to one of the `handwritten/` references.
27+
28+
## Running
29+
30+
Each subdirectory has its own self-contained `Makefile`. There is no shared
31+
include or sourced helper — duplication is intentional, so that each phase's
32+
PR touches only its own subdir and there's no cross-phase coupling that can
33+
rot.
34+
35+
Default invocation forks 2 processes:
36+
37+
make -C test/gpu/multi_gpu/handwritten
38+
39+
Inside a subdirectory, common knobs:
40+
41+
make -C test/gpu/multi_gpu/handwritten INPUT=cacheline # default
42+
make -C test/gpu/multi_gpu/handwritten INPUT=atomic
43+
make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
44+
make -C test/gpu/multi_gpu/handwritten clean
45+
46+
Each `Makefile` documents its own `INPUT` choices in the header comment.
47+
48+
## Preconditions
49+
50+
Each `Makefile`'s `check-preconditions` target refuses to launch if either:
51+
52+
- `NUM_RANKS < 2` — the cross-rank symmetric-heap test fundamentally needs
53+
a peer; a single-process launch has nothing to talk to.
54+
- Fewer physical GPUs than `NUM_RANKS` — colocating ranks on one GPU would
55+
silently bypass XGMI/peer-VA (transparently falling back to local memory)
56+
and report false-positive PASSes.
57+
58+
## Required environment
59+
60+
When `utils/env_setup_gpu.sh` is sourced these are auto-detected:
61+
62+
- `MLIR_AIR_INSTALL_DIR` — path containing `lib/libairgpu.so`
63+
- `LLVM_INSTALL_DIR` — path containing `bin/mlir-opt` + `lib/libmlir_*.so`
64+
65+
Otherwise pass them on the command line: `make MLIR_AIR_INSTALL_DIR=… LLVM_INSTALL_DIR=… …`.
66+
67+
## Why duplicated boilerplate per subdir
68+
69+
A shared `_common.mk` or `_common.sh` would let one phase's edit silently
70+
break another phase's tests. The boilerplate is small (~30 lines of
71+
preconditions + driver per Makefile) and stable — phases differ in their
72+
compile pipeline, not in the multi-process driver. Duplication is the
73+
cheaper failure mode.
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Multi-process symmetric-heap multi-GPU e2e — air.rank wrapped tests.
2+
#
3+
# These tests express the multi-process world declaratively via
4+
# `air.rank (%rid) in (%rsize = %c2) { ... }`. The air-rank-to-mgpu
5+
# pass (Phase 3) replaces the air.rank op with body-inlined IR that
6+
# resolves %rid from mgpuGetRank() at runtime and brackets the
7+
# enclosing function with mgpuSymmetricHeapInit / Destroy.
8+
#
9+
# Each variant in this dir is a 1:1 wrap of the corresponding test in
10+
# ../handwritten/. After lowering through air-rank-to-mgpu the IR is
11+
# functionally equivalent to the handwritten reference.
12+
#
13+
# Variants (selected via INPUT):
14+
# cacheline Wrap of ../handwritten/cacheline.mlir (producer/consumer,
15+
# 1-to-1, cache-line atomicity).
16+
# allgather Wrap of ../handwritten/allgather.mlir (many-to-many SIMD,
17+
# cache-line atomicity).
18+
#
19+
# Usage:
20+
# make # default: INPUT=cacheline NUM_RANKS=2
21+
# make INPUT=allgather
22+
# make NUM_RANKS=4
23+
# make clean
24+
#
25+
# Required environment (auto-detected when sourced via env_setup_gpu.sh):
26+
# MLIR_AIR_INSTALL_DIR — path containing lib/libairgpu.so
27+
# LLVM_INSTALL_DIR — path containing bin/mlir-opt + lib/libmlir_*.so
28+
#
29+
# This Makefile is intentionally self-contained — no included files, no
30+
# sourced helpers. Other multi_gpu/<level>/ subdirs each have their own
31+
# complete Makefile so that each phase's PR touches only its own dir.
32+
33+
SHELL := /bin/bash
34+
.SHELLFLAGS := -eu -o pipefail -c
35+
36+
INPUT ?= cacheline
37+
NUM_RANKS ?= 2
38+
TMPDIR ?= /tmp/air_multi_gpu_air_rank
39+
40+
SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST)))))
41+
42+
LLVM_LIB_DIR ?= $(LLVM_INSTALL_DIR)/lib
43+
AIRGPU_LIB ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so
44+
45+
ifeq ($(filter $(INPUT),cacheline allgather),)
46+
$(error Unknown INPUT=$(INPUT); expected 'cacheline' or 'allgather')
47+
endif
48+
49+
SRC_MLIR := $(SCRIPT_DIR)/$(INPUT).mlir
50+
POST_RANK := $(TMPDIR)/$(INPUT)_post_rank.mlir
51+
LOWERED := $(TMPDIR)/$(INPUT)_lowered.mlir
52+
53+
.PHONY: run clean check-preconditions
54+
.DEFAULT_GOAL := run
55+
56+
$(TMPDIR):
57+
@mkdir -p $@
58+
59+
# Step 1a: lower air.rank to mgpu* runtime + expand air.translate.
60+
$(POST_RANK): $(SRC_MLIR) | $(TMPDIR)
61+
@echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($(INPUT))"
62+
air-opt $< -air-rank-to-mgpu --air-translate-to-llvm -o $@
63+
64+
# Step 1b: compile gpu.module to AMDGPU binary + finalize host. Same
65+
# pipeline as ../handwritten/Makefile (the lowered output is structurally
66+
# a superset of the corresponding handwritten test).
67+
$(LOWERED): $(POST_RANK)
68+
@echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
69+
mlir-opt $< \
70+
--pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
71+
-o $@
72+
73+
# Refuse to launch if NUM_RANKS < 2 (no peer to talk to) or fewer
74+
# physical GPUs than NUM_RANKS (would silently bypass XGMI and report
75+
# false-positive PASSes by colocating ranks on one GPU).
76+
check-preconditions:
77+
@if [ "$(NUM_RANKS)" -lt 2 ]; then \
78+
echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +" \
79+
"consumer)." >&2; \
80+
exit 1; \
81+
fi
82+
@if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then \
83+
NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .); \
84+
else \
85+
NUM_GPUS=$$(grep -l '^simd_count [1-9]' \
86+
/sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \
87+
fi; \
88+
if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then \
89+
echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI" \
90+
"traffic; found $$NUM_GPUS." >&2; \
91+
echo " This test refuses to colocate ranks on a single GPU" \
92+
"because it would silently" >&2; \
93+
echo " bypass the symmetric-heap path and report false PASSes." \
94+
>&2; \
95+
exit 1; \
96+
fi
97+
98+
# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via
99+
# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any
100+
# nested call into libmlir_rocm_runtime.so) only ever sees one device,
101+
# so it can't accidentally launch on the wrong one. Every rank still
102+
# sees device 0 internally, so airgpu uses LOCAL_RANK=0.
103+
run: check-preconditions $(LOWERED)
104+
@echo "Step 2: Run as $(NUM_RANKS) processes"
105+
@export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}"; \
106+
PIDS=(); \
107+
PASS=1; \
108+
for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \
109+
( set -o pipefail; \
110+
RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \
111+
HIP_VISIBLE_DEVICES=$$i \
112+
mlir-runner --entry-point-result=void \
113+
--shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so" \
114+
--shared-libs="$(AIRGPU_LIB)" \
115+
--shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so" \
116+
--shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so" \
117+
$(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") & \
118+
PIDS+=($$!); \
119+
done; \
120+
for pid in "$${PIDS[@]}"; do \
121+
if ! wait "$$pid"; then PASS=0; fi; \
122+
done; \
123+
if [ $$PASS -eq 1 ]; then \
124+
echo "=== ALL $(NUM_RANKS) RANKS PASSED ==="; \
125+
else \
126+
echo "=== SOME RANKS FAILED ==="; \
127+
exit 1; \
128+
fi
129+
130+
clean:
131+
rm -rf $(TMPDIR)

test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir renamed to test/gpu/multi_gpu/air_rank/allgather.mlir

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
//===- air_sym_with_rank_allgather.mlir - air.rank wrap of allgather -----===//
1+
//===- air_rank/allgather.mlir - air.rank wrap of handwritten allgather --===//
22
//
33
// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
44
// SPDX-License-Identifier: MIT
55
//
66
//===-----------------------------------------------------------------------===//
77
//
8-
// High-level version of air_sym_handwritten_allgather.mlir.
8+
// High-level version of handwritten/allgather.mlir.
99
//
1010
// This file is a 1:1 wrap of the SIMD-across-ranks all-gather test inside
1111
// an `air.rank` op:
@@ -20,9 +20,9 @@
2020
// - mgpuSymmetricHeapDestroy before each func.return
2121
//
2222
// After lowering the IR is functionally equivalent to
23-
// air_sym_handwritten_allgather.mlir (same kernel, same launch dispatch,
24-
// same validation). Sister file: air_sym_with_rank_cacheline.mlir does
25-
// the analogous wrap of the producer/consumer cacheline test.
23+
// handwritten/allgather.mlir (same kernel, same launch dispatch, same
24+
// validation). Sister file: air_rank/cacheline.mlir does the analogous
25+
// wrap of the producer/consumer cacheline test.
2626
//
2727
// The kernel and helpers (gpu.module @sym_kernels, @wrap_bytes) are
2828
// duplicated verbatim from the handwritten allgather. Only @main differs
@@ -33,7 +33,7 @@
3333
// source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
3434
// the handwritten allgather.
3535
//
36-
// Launcher: run.sh with INPUT=rank_allgather forks 2 processes.
36+
// Launcher: `make INPUT=allgather` from this subdir forks 2 processes.
3737
//
3838
//===-----------------------------------------------------------------------===//
3939

test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir renamed to test/gpu/multi_gpu/air_rank/cacheline.mlir

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
//===- air_sym_with_rank_cacheline.mlir - air.rank wrap of cacheline -----===//
1+
//===- air_rank/cacheline.mlir - air.rank wrap of handwritten cacheline --===//
22
//
33
// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
44
// SPDX-License-Identifier: MIT
55
//
66
//===-----------------------------------------------------------------------===//
77
//
8-
// High-level version of air_sym_handwritten_cacheline.mlir.
8+
// High-level version of handwritten/cacheline.mlir.
99
//
1010
// This file is a 1:1 wrap of the cacheline producer/consumer test inside
1111
// an `air.rank` op:
@@ -20,10 +20,10 @@
2020
// - mgpuSymmetricHeapDestroy before each func.return
2121
//
2222
// After lowering the IR is functionally equivalent to
23-
// air_sym_handwritten_cacheline.mlir (same kernels, same launch
24-
// dispatch, same validation). This file's job is to demonstrate that
25-
// the user can write the multi-process world declaratively via air.rank
26-
// and have the pass produce the handwritten reference.
23+
// handwritten/cacheline.mlir (same kernels, same launch dispatch, same
24+
// validation). This file's job is to demonstrate that the user can
25+
// write the multi-process world declaratively via air.rank and have
26+
// the pass produce the handwritten reference.
2727
//
2828
// The kernels and helpers (gpu.module @sym_kernels, @wrap_bytes) are
2929
// duplicated verbatim from the cacheline test. Only @main differs in
@@ -34,8 +34,8 @@
3434
// source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
3535
// the handwritten cacheline test.
3636
//
37-
// Launcher: run.sh with INPUT=rank forks 2 processes. The
38-
// air-rank-to-mgpu pass converts air.rank to runtime dispatch.
37+
// Launcher: `make INPUT=cacheline` from this subdir forks 2 processes.
38+
// The air-rank-to-mgpu pass converts air.rank to runtime dispatch.
3939
//
4040
//===-----------------------------------------------------------------------===//
4141

0 commit comments

Comments
 (0)