Xilinx · erwei-xilinx · May 12, 2026 · May 12, 2026
diff --git a/test/gpu/multi_gpu/README.md b/test/gpu/multi_gpu/README.md
@@ -0,0 +1,78 @@
+# `multi_gpu` — symmetric-heap multi-GPU end-to-end tests
+
+End-to-end tests for the symmetric-heap multi-GPU stack. Each test launches
+N processes — one per physical GPU — that coordinate via the symmetric heap
+(XGMI peer-mapped VMem buffers).
+
+The `mlir/test/Conversion/AIR*ToMgpu/` lit tests pin pass-level invariants
+with FileCheck. The tests in this directory are the e2e counterparts: they
+build through the full lowering chain and run on real hardware.
+
+## Layout
+
+Tests are organized by IR-abstraction level. Each subdirectory holds tests
+written at one level. Lower levels (closer to LLVM dialect) are the lowering
+targets that higher levels reduce to.
+
+| Subdir | Phase | Abstraction added |
+|---|---|---|
+| `handwritten/` | 2 | none — raw MLIR with hand-written GPU kernels and direct `mgpuSymmetricAlloc` / `mgpuGetRank` calls. The reference target. Variants: `cacheline`, `atomic`, `allgather`. |
+| `air_rank/` | 3 | `air.rank` declares the multi-process world; replaces hand-written `mgpuGetRank` / heap init/destroy plumbing. Lowered by `air-rank-to-mgpu`. Variants: `cacheline`, `allgather` — each a 1:1 wrap of the corresponding `handwritten/` test. |
+| `air_alloc/` | 4 (TBD) | `memref.alloc {air.symmetric}` declares symmetric-heap allocations. Lowered by `air-symmetric-alloc-to-mgpu`. |
+| `air_dma/` | 5 (TBD) | `air.dma_memcpy_nd {src_rank/dst_rank}` declares cross-rank DMAs. Lowered by `air-cross-rank-dma-to-mgpu`. |
+| `air_channel/` | 6 (TBD) | `air.channel {channel_type = "gpu_symmetric_heap"}` declares cross-rank channels. Lowered by `air-gpu-channel-to-mgpu`. |
+
+A higher-level test should produce — after running its phase's lowering pass
+— IR functionally equivalent to one of the `handwritten/` references.
+
+## Running
+
+Each subdirectory has its own self-contained `Makefile`. There is no shared
+include or sourced helper — duplication is intentional, so that each phase's
+PR touches only its own subdir and there's no cross-phase coupling that can
+rot.
+
+Default invocation forks 2 processes:
+
+    make -C test/gpu/multi_gpu/handwritten
+
+Inside a subdirectory, common knobs:
+
+    make -C test/gpu/multi_gpu/handwritten INPUT=cacheline   # default
+    make -C test/gpu/multi_gpu/handwritten INPUT=atomic
+    make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
+    make -C test/gpu/multi_gpu/handwritten clean
+
+Each `Makefile` documents its own `INPUT` choices in the header comment.
+
+## Preconditions
+
+Each `Makefile`'s `check-preconditions` target refuses to launch if either:
+
+- `NUM_RANKS < 2` — the cross-rank symmetric-heap test fundamentally needs
+  a peer; a single-process launch has nothing to talk to.
+- Fewer physical GPUs than `NUM_RANKS` — colocating ranks on one GPU would
+  silently bypass XGMI/peer-VA (transparently falling back to local memory)
+  and report false-positive PASSes.
+
+## Required environment
+
+The Makefiles invoke `air-opt`, `mlir-opt`, and `mlir-runner` via PATH, plus dlopen `libairgpu.so` and the `libmlir_*.so` runtime libraries. There are three ways to satisfy this:
+
+1. **Source `utils/env_setup_gpu.sh`** (recommended) — sets `PATH`, `LD_LIBRARY_PATH`, `MLIR_AIR_INSTALL_DIR`, and `LLVM_INSTALL_DIR` in one go.
+2. **Pass install dirs on the make command line**:
+   ```
+   make MLIR_AIR_INSTALL_DIR=… LLVM_INSTALL_DIR=…
+   ```
+   (PATH must still contain the binaries — these vars only affect `--shared-libs` paths.)
+3. **Have the binaries in `PATH` already** — the Makefile derives `LLVM_INSTALL_DIR` / `MLIR_AIR_INSTALL_DIR` from `dirname $(dirname $(command -v mlir-opt))` etc.
+
+The `check-preconditions` target validates that the resolved `LLVM_LIB_DIR` and `AIRGPU_LIB` paths actually exist before launching, so a missing env shows a clear error rather than a `dlopen` failure deep inside `mlir-runner`.
+
+## Why duplicated boilerplate per subdir
+
+A shared `_common.mk` or `_common.sh` would let one phase's edit silently
+break another phase's tests. The boilerplate is small (~30 lines of
+preconditions + driver per Makefile) and stable — phases differ in their
+compile pipeline, not in the multi-process driver. Duplication is the
+cheaper failure mode.
diff --git a/test/gpu/multi_gpu/air_rank/Makefile b/test/gpu/multi_gpu/air_rank/Makefile
@@ -0,0 +1,149 @@
+# Multi-process symmetric-heap multi-GPU e2e — air.rank wrapped tests.
+#
+# These tests express the multi-process world declaratively via
+# `air.rank (%rid) in (%rsize = %c2) { ... }`. The air-rank-to-mgpu
+# pass (Phase 3) replaces the air.rank op with body-inlined IR that
+# resolves %rid from mgpuGetRank() at runtime and brackets the
+# enclosing function with mgpuSymmetricHeapInit / Destroy.
+#
+# Each variant in this dir is a 1:1 wrap of the corresponding test in
+# ../handwritten/. After lowering through air-rank-to-mgpu the IR is
+# functionally equivalent to the handwritten reference.
+#
+# Variants (selected via INPUT):
+#   cacheline  Wrap of ../handwritten/cacheline.mlir (producer/consumer,
+#              1-to-1, cache-line atomicity).
+#   allgather  Wrap of ../handwritten/allgather.mlir (many-to-many SIMD,
+#              cache-line atomicity).
+#
+# Usage:
+#   make                       # default: INPUT=cacheline NUM_RANKS=2
+#   make INPUT=allgather
+#   make NUM_RANKS=4
+#   make clean
+#
+# Required environment (auto-detected when sourced via env_setup_gpu.sh):
+#   MLIR_AIR_INSTALL_DIR  — path containing lib/libairgpu.so
+#   LLVM_INSTALL_DIR      — path containing bin/mlir-opt + lib/libmlir_*.so
+#
+# This Makefile is intentionally self-contained — no included files, no
+# sourced helpers. Other multi_gpu/<level>/ subdirs each have their own
+# complete Makefile so that each phase's PR touches only its own dir.
+
+SHELL       := /bin/bash
+.SHELLFLAGS := -eu -o pipefail -c
+
+INPUT      ?= cacheline
+NUM_RANKS  ?= 2
+TMPDIR     ?= /tmp/air_multi_gpu_air_rank
+
+SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST)))))
+
+# Derive install dirs from PATH if not explicitly provided. Matches the
+# original run.sh fallback (`dirname $(dirname $(which mlir-opt))`).
+LLVM_INSTALL_DIR     ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null)
+MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null)
+LLVM_LIB_DIR         ?= $(LLVM_INSTALL_DIR)/lib
+AIRGPU_LIB           ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so
+
+ifeq ($(filter $(INPUT),cacheline allgather),)
+$(error Unknown INPUT=$(INPUT); expected 'cacheline' or 'allgather')
+endif
+
+SRC_MLIR  := $(SCRIPT_DIR)/$(INPUT).mlir
+POST_RANK := $(TMPDIR)/$(INPUT)_post_rank.mlir
+LOWERED   := $(TMPDIR)/$(INPUT)_lowered.mlir
+
+.PHONY: run clean check-preconditions
+.DEFAULT_GOAL := run
+
+$(TMPDIR):
+	@mkdir -p $@
+
+# Step 1a: lower air.rank to mgpu* runtime + expand air.translate.
+$(POST_RANK): $(SRC_MLIR) | $(TMPDIR)
+	@echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($(INPUT))"
+	air-opt $< -air-rank-to-mgpu --air-translate-to-llvm -o $@
+
+# Step 1b: compile gpu.module to AMDGPU binary + finalize host. Same
+# pipeline as ../handwritten/Makefile (the lowered output is structurally
+# a superset of the corresponding handwritten test).
+$(LOWERED): $(POST_RANK)
+	@echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
+	mlir-opt $< \
+	    --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
+	    -o $@
+
+# Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer
+# physical GPUs than NUM_RANKS (would silently bypass XGMI and report
+# false-positive PASSes by colocating ranks on one GPU), or if the
+# install paths are missing (mlir-runner would fail at dlopen with a
+# more cryptic message).
+check-preconditions:
+	@if [ ! -d "$(LLVM_LIB_DIR)" ]; then                                       \
+	  echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2;         \
+	  echo "       Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR."    \
+	       >&2;                                                                \
+	  exit 1;                                                                  \
+	fi
+	@if [ ! -f "$(AIRGPU_LIB)" ]; then                                         \
+	  echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2;             \
+	  echo "       Source utils/env_setup_gpu.sh or set"                      \
+	       "MLIR_AIR_INSTALL_DIR." >&2;                                        \
+	  exit 1;                                                                  \
+	fi
+	@if [ "$(NUM_RANKS)" -lt 2 ]; then                                         \
+	  echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +"   \
+	       "consumer)." >&2;                                                   \
+	  exit 1;                                                                  \
+	fi
+	@if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then                               \
+	  NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .);     \
+	else                                                                       \
+	  NUM_GPUS=$$(grep -l '^simd_count [1-9]'                                  \
+	      /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \
+	fi;                                                                        \
+	if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then                               \
+	  echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI"     \
+	       "traffic; found $$NUM_GPUS." >&2;                                   \
+	  echo "       This test refuses to colocate ranks on a single GPU"      \
+	       "because it would silently" >&2;                                    \
+	  echo "       bypass the symmetric-heap path and report false PASSes." \
+	       >&2;                                                                \
+	  exit 1;                                                                  \
+	fi
+
+# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via
+# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any
+# nested call into libmlir_rocm_runtime.so) only ever sees one device,
+# so it can't accidentally launch on the wrong one. Every rank still
+# sees device 0 internally, so airgpu uses LOCAL_RANK=0.
+run: check-preconditions $(LOWERED)
+	@echo "Step 2: Run as $(NUM_RANKS) processes"
+	@export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}";                     \
+	PIDS=();                                                              \
+	PASS=1;                                                               \
+	for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do                         \
+	  ( set -o pipefail;                                                  \
+	    RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0                     \
+	    HIP_VISIBLE_DEVICES=$$i                                           \
+	    mlir-runner --entry-point-result=void                             \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so"       \
+	        --shared-libs="$(AIRGPU_LIB)"                                 \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so"       \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so"     \
+	        $(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") &                   \
+	  PIDS+=($$!);                                                        \
+	done;                                                                 \
+	for pid in "$${PIDS[@]}"; do                                          \
+	  if ! wait "$$pid"; then PASS=0; fi;                                 \
+	done;                                                                 \
+	if [ $$PASS -eq 1 ]; then                                             \
+	  echo "=== ALL $(NUM_RANKS) RANKS PASSED ===";                       \
+	else                                                                  \
+	  echo "=== SOME RANKS FAILED ===";                                   \
+	  exit 1;                                                             \
+	fi
+
+clean:
+	rm -rf $(TMPDIR)
diff --git a/...heap_dma/air_sym_with_rank_allgather.mlir → test/gpu/multi_gpu/air_rank/allgather.mlir b/...heap_dma/air_sym_with_rank_allgather.mlir → test/gpu/multi_gpu/air_rank/allgather.mlir
@@ -1,11 +1,11 @@
-//===- air_sym_with_rank_allgather.mlir - air.rank wrap of allgather -----===//
+//===- air_rank/allgather.mlir - air.rank wrap of handwritten allgather --===//
 //
 // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
 // SPDX-License-Identifier: MIT
 //
 //===-----------------------------------------------------------------------===//
 //
-// High-level version of air_sym_handwritten_allgather.mlir.
+// High-level version of handwritten/allgather.mlir.
 //
 // This file is a 1:1 wrap of the SIMD-across-ranks all-gather test inside
 // an `air.rank` op:
@@ -20,9 +20,9 @@
 //   - mgpuSymmetricHeapDestroy before each func.return
 //
 // After lowering the IR is functionally equivalent to
-// air_sym_handwritten_allgather.mlir (same kernel, same launch dispatch,
-// same validation). Sister file: air_sym_with_rank_cacheline.mlir does
-// the analogous wrap of the producer/consumer cacheline test.
+// handwritten/allgather.mlir (same kernel, same launch dispatch, same
+// validation). Sister file: air_rank/cacheline.mlir does the analogous
+// wrap of the producer/consumer cacheline test.
 //
 // The kernel and helpers (gpu.module @sym_kernels, @wrap_bytes) are
 // duplicated verbatim from the handwritten allgather. Only @main differs
@@ -33,7 +33,7 @@
 // source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
 // the handwritten allgather.
 //
-// Launcher: run.sh with INPUT=rank_allgather forks 2 processes.
+// Launcher: `make INPUT=allgather` from this subdir forks 2 processes.
 //
 //===-----------------------------------------------------------------------===//
 

diff --git a/...heap_dma/air_sym_with_rank_cacheline.mlir → test/gpu/multi_gpu/air_rank/cacheline.mlir b/...heap_dma/air_sym_with_rank_cacheline.mlir → test/gpu/multi_gpu/air_rank/cacheline.mlir
@@ -1,11 +1,11 @@
-//===- air_sym_with_rank_cacheline.mlir - air.rank wrap of cacheline -----===//
+//===- air_rank/cacheline.mlir - air.rank wrap of handwritten cacheline --===//
 //
 // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
 // SPDX-License-Identifier: MIT
 //
 //===-----------------------------------------------------------------------===//
 //
-// High-level version of air_sym_handwritten_cacheline.mlir.
+// High-level version of handwritten/cacheline.mlir.
 //
 // This file is a 1:1 wrap of the cacheline producer/consumer test inside
 // an `air.rank` op:
@@ -20,10 +20,10 @@
 //   - mgpuSymmetricHeapDestroy before each func.return
 //
 // After lowering the IR is functionally equivalent to
-// air_sym_handwritten_cacheline.mlir (same kernels, same launch
-// dispatch, same validation). This file's job is to demonstrate that
-// the user can write the multi-process world declaratively via air.rank
-// and have the pass produce the handwritten reference.
+// handwritten/cacheline.mlir (same kernels, same launch dispatch, same
+// validation). This file's job is to demonstrate that the user can
+// write the multi-process world declaratively via air.rank and have
+// the pass produce the handwritten reference.
 //
 // The kernels and helpers (gpu.module @sym_kernels, @wrap_bytes) are
 // duplicated verbatim from the cacheline test. Only @main differs in
@@ -34,8 +34,8 @@
 // source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
 // the handwritten cacheline test.
 //
-// Launcher: run.sh with INPUT=rank forks 2 processes. The
-// air-rank-to-mgpu pass converts air.rank to runtime dispatch.
+// Launcher: `make INPUT=cacheline` from this subdir forks 2 processes.
+// The air-rank-to-mgpu pass converts air.rank to runtime dispatch.
 //
 //===-----------------------------------------------------------------------===//