diff --git a/test/gpu/multi_gpu/README.md b/test/gpu/multi_gpu/README.md
new file mode 100644
index 000000000..ecad1dee1
--- /dev/null
+++ b/test/gpu/multi_gpu/README.md
@@ -0,0 +1,78 @@
+# `multi_gpu` — symmetric-heap multi-GPU end-to-end tests
+
+End-to-end tests for the symmetric-heap multi-GPU stack. Each test launches
+N processes — one per physical GPU — that coordinate via the symmetric heap
+(XGMI peer-mapped VMem buffers).
+
+The `mlir/test/Conversion/AIR*ToMgpu/` lit tests pin pass-level invariants
+with FileCheck. The tests in this directory are the e2e counterparts: they
+build through the full lowering chain and run on real hardware.
+
+## Layout
+
+Tests are organized by IR-abstraction level. Each subdirectory holds tests
+written at one level. Lower levels (closer to LLVM dialect) are the lowering
+targets that higher levels reduce to.
+
+| Subdir | Phase | Abstraction added |
+|---|---|---|
+| `handwritten/` | 2 | none — raw MLIR with hand-written GPU kernels and direct `mgpuSymmetricAlloc` / `mgpuGetRank` calls. The reference target. Variants: `cacheline`, `atomic`, `allgather`. |
+| `air_rank/` | 3 | `air.rank` declares the multi-process world; replaces hand-written `mgpuGetRank` / heap init/destroy plumbing. Lowered by `air-rank-to-mgpu`. Variants: `cacheline`, `allgather` — each a 1:1 wrap of the corresponding `handwritten/` test. |
+| `air_alloc/` | 4 (TBD) | `memref.alloc {air.symmetric}` declares symmetric-heap allocations. Lowered by `air-symmetric-alloc-to-mgpu`. |
+| `air_dma/` | 5 (TBD) | `air.dma_memcpy_nd {src_rank/dst_rank}` declares cross-rank DMAs. Lowered by `air-cross-rank-dma-to-mgpu`. |
+| `air_channel/` | 6 (TBD) | `air.channel {channel_type = "gpu_symmetric_heap"}` declares cross-rank channels. Lowered by `air-gpu-channel-to-mgpu`. |
+
+A higher-level test should produce — after running its phase's lowering pass
+— IR functionally equivalent to one of the `handwritten/` references.
+
+## Running
+
+Each subdirectory has its own self-contained `Makefile`. There is no shared
+include or sourced helper — duplication is intentional, so that each phase's
+PR touches only its own subdir and there's no cross-phase coupling that can
+rot.
+
+Default invocation forks 2 processes:
+
+    make -C test/gpu/multi_gpu/handwritten
+
+Inside a subdirectory, common knobs:
+
+    make -C test/gpu/multi_gpu/handwritten INPUT=cacheline   # default
+    make -C test/gpu/multi_gpu/handwritten INPUT=atomic
+    make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
+    make -C test/gpu/multi_gpu/handwritten clean
+
+Each `Makefile` documents its own `INPUT` choices in the header comment.
+
+## Preconditions
+
+Each `Makefile`'s `check-preconditions` target refuses to launch if either:
+
+- `NUM_RANKS < 2` — the cross-rank symmetric-heap test fundamentally needs
+  a peer; a single-process launch has nothing to talk to.
+- Fewer physical GPUs than `NUM_RANKS` — colocating ranks on one GPU would
+  silently bypass XGMI/peer-VA (transparently falling back to local memory)
+  and report false-positive PASSes.
+
+## Required environment
+
+The Makefiles invoke `air-opt`, `mlir-opt`, and `mlir-runner` via PATH, plus dlopen `libairgpu.so` and the `libmlir_*.so` runtime libraries. There are three ways to satisfy this:
+
+1. **Source `utils/env_setup_gpu.sh`** (recommended) — sets `PATH`, `LD_LIBRARY_PATH`, `MLIR_AIR_INSTALL_DIR`, and `LLVM_INSTALL_DIR` in one go.
+2. **Pass install dirs on the make command line**:
+   ```
+   make MLIR_AIR_INSTALL_DIR=… LLVM_INSTALL_DIR=…
+   ```
+   (PATH must still contain the binaries — these vars only affect `--shared-libs` paths.)
+3. **Have the binaries in `PATH` already** — the Makefile derives `LLVM_INSTALL_DIR` / `MLIR_AIR_INSTALL_DIR` from `dirname $(dirname $(command -v mlir-opt))` etc.
+
+The `check-preconditions` target validates that the resolved `LLVM_LIB_DIR` and `AIRGPU_LIB` paths actually exist before launching, so a missing env shows a clear error rather than a `dlopen` failure deep inside `mlir-runner`.
+
+## Why duplicated boilerplate per subdir
+
+A shared `_common.mk` or `_common.sh` would let one phase's edit silently
+break another phase's tests. The boilerplate is small (~30 lines of
+preconditions + driver per Makefile) and stable — phases differ in their
+compile pipeline, not in the multi-process driver. Duplication is the
+cheaper failure mode.
diff --git a/test/gpu/multi_gpu/air_rank/Makefile b/test/gpu/multi_gpu/air_rank/Makefile
new file mode 100644
index 000000000..6d9b4bd66
--- /dev/null
+++ b/test/gpu/multi_gpu/air_rank/Makefile
@@ -0,0 +1,149 @@
+# Multi-process symmetric-heap multi-GPU e2e — air.rank wrapped tests.
+#
+# These tests express the multi-process world declaratively via
+# `air.rank (%rid) in (%rsize = %c2) { ... }`. The air-rank-to-mgpu
+# pass (Phase 3) replaces the air.rank op with body-inlined IR that
+# resolves %rid from mgpuGetRank() at runtime and brackets the
+# enclosing function with mgpuSymmetricHeapInit / Destroy.
+#
+# Each variant in this dir is a 1:1 wrap of the corresponding test in
+# ../handwritten/. After lowering through air-rank-to-mgpu the IR is
+# functionally equivalent to the handwritten reference.
+#
+# Variants (selected via INPUT):
+#   cacheline  Wrap of ../handwritten/cacheline.mlir (producer/consumer,
+#              1-to-1, cache-line atomicity).
+#   allgather  Wrap of ../handwritten/allgather.mlir (many-to-many SIMD,
+#              cache-line atomicity).
+#
+# Usage:
+#   make                       # default: INPUT=cacheline NUM_RANKS=2
+#   make INPUT=allgather
+#   make NUM_RANKS=4
+#   make clean
+#
+# Required environment (auto-detected when sourced via env_setup_gpu.sh):
+#   MLIR_AIR_INSTALL_DIR  — path containing lib/libairgpu.so
+#   LLVM_INSTALL_DIR      — path containing bin/mlir-opt + lib/libmlir_*.so
+#
+# This Makefile is intentionally self-contained — no included files, no
+# sourced helpers. Other multi_gpu/<level>/ subdirs each have their own
+# complete Makefile so that each phase's PR touches only its own dir.
+
+SHELL       := /bin/bash
+.SHELLFLAGS := -eu -o pipefail -c
+
+INPUT      ?= cacheline
+NUM_RANKS  ?= 2
+TMPDIR     ?= /tmp/air_multi_gpu_air_rank
+
+SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST)))))
+
+# Derive install dirs from PATH if not explicitly provided. Matches the
+# original run.sh fallback (`dirname $(dirname $(which mlir-opt))`).
+LLVM_INSTALL_DIR     ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null)
+MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null)
+LLVM_LIB_DIR         ?= $(LLVM_INSTALL_DIR)/lib
+AIRGPU_LIB           ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so
+
+ifeq ($(filter $(INPUT),cacheline allgather),)
+$(error Unknown INPUT=$(INPUT); expected 'cacheline' or 'allgather')
+endif
+
+SRC_MLIR  := $(SCRIPT_DIR)/$(INPUT).mlir
+POST_RANK := $(TMPDIR)/$(INPUT)_post_rank.mlir
+LOWERED   := $(TMPDIR)/$(INPUT)_lowered.mlir
+
+.PHONY: run clean check-preconditions
+.DEFAULT_GOAL := run
+
+$(TMPDIR):
+	@mkdir -p $@
+
+# Step 1a: lower air.rank to mgpu* runtime + expand air.translate.
+$(POST_RANK): $(SRC_MLIR) | $(TMPDIR)
+	@echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($(INPUT))"
+	air-opt $< -air-rank-to-mgpu --air-translate-to-llvm -o $@
+
+# Step 1b: compile gpu.module to AMDGPU binary + finalize host. Same
+# pipeline as ../handwritten/Makefile (the lowered output is structurally
+# a superset of the corresponding handwritten test).
+$(LOWERED): $(POST_RANK)
+	@echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
+	mlir-opt $< \
+	    --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
+	    -o $@
+
+# Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer
+# physical GPUs than NUM_RANKS (would silently bypass XGMI and report
+# false-positive PASSes by colocating ranks on one GPU), or if the
+# install paths are missing (mlir-runner would fail at dlopen with a
+# more cryptic message).
+check-preconditions:
+	@if [ ! -d "$(LLVM_LIB_DIR)" ]; then                                       \
+	  echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2;         \
+	  echo "       Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR."    \
+	       >&2;                                                                \
+	  exit 1;                                                                  \
+	fi
+	@if [ ! -f "$(AIRGPU_LIB)" ]; then                                         \
+	  echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2;             \
+	  echo "       Source utils/env_setup_gpu.sh or set"                      \
+	       "MLIR_AIR_INSTALL_DIR." >&2;                                        \
+	  exit 1;                                                                  \
+	fi
+	@if [ "$(NUM_RANKS)" -lt 2 ]; then                                         \
+	  echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +"   \
+	       "consumer)." >&2;                                                   \
+	  exit 1;                                                                  \
+	fi
+	@if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then                               \
+	  NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .);     \
+	else                                                                       \
+	  NUM_GPUS=$$(grep -l '^simd_count [1-9]'                                  \
+	      /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \
+	fi;                                                                        \
+	if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then                               \
+	  echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI"     \
+	       "traffic; found $$NUM_GPUS." >&2;                                   \
+	  echo "       This test refuses to colocate ranks on a single GPU"      \
+	       "because it would silently" >&2;                                    \
+	  echo "       bypass the symmetric-heap path and report false PASSes." \
+	       >&2;                                                                \
+	  exit 1;                                                                  \
+	fi
+
+# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via
+# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any
+# nested call into libmlir_rocm_runtime.so) only ever sees one device,
+# so it can't accidentally launch on the wrong one. Every rank still
+# sees device 0 internally, so airgpu uses LOCAL_RANK=0.
+run: check-preconditions $(LOWERED)
+	@echo "Step 2: Run as $(NUM_RANKS) processes"
+	@export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}";                     \
+	PIDS=();                                                              \
+	PASS=1;                                                               \
+	for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do                         \
+	  ( set -o pipefail;                                                  \
+	    RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0                     \
+	    HIP_VISIBLE_DEVICES=$$i                                           \
+	    mlir-runner --entry-point-result=void                             \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so"       \
+	        --shared-libs="$(AIRGPU_LIB)"                                 \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so"       \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so"     \
+	        $(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") &                   \
+	  PIDS+=($$!);                                                        \
+	done;                                                                 \
+	for pid in "$${PIDS[@]}"; do                                          \
+	  if ! wait "$$pid"; then PASS=0; fi;                                 \
+	done;                                                                 \
+	if [ $$PASS -eq 1 ]; then                                             \
+	  echo "=== ALL $(NUM_RANKS) RANKS PASSED ===";                       \
+	else                                                                  \
+	  echo "=== SOME RANKS FAILED ===";                                   \
+	  exit 1;                                                             \
+	fi
+
+clean:
+	rm -rf $(TMPDIR)
diff --git a/test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir b/test/gpu/multi_gpu/air_rank/allgather.mlir
similarity index 97%
rename from test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir
rename to test/gpu/multi_gpu/air_rank/allgather.mlir
index 46861105c..5ae93d50d 100644
--- a/test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir
+++ b/test/gpu/multi_gpu/air_rank/allgather.mlir
@@ -1,11 +1,11 @@
-//===- air_sym_with_rank_allgather.mlir - air.rank wrap of allgather -----===//
+//===- air_rank/allgather.mlir - air.rank wrap of handwritten allgather --===//
 //
 // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
 // SPDX-License-Identifier: MIT
 //
 //===-----------------------------------------------------------------------===//
 //
-// High-level version of air_sym_handwritten_allgather.mlir.
+// High-level version of handwritten/allgather.mlir.
 //
 // This file is a 1:1 wrap of the SIMD-across-ranks all-gather test inside
 // an `air.rank` op:
@@ -20,9 +20,9 @@
 //   - mgpuSymmetricHeapDestroy before each func.return
 //
 // After lowering the IR is functionally equivalent to
-// air_sym_handwritten_allgather.mlir (same kernel, same launch dispatch,
-// same validation). Sister file: air_sym_with_rank_cacheline.mlir does
-// the analogous wrap of the producer/consumer cacheline test.
+// handwritten/allgather.mlir (same kernel, same launch dispatch, same
+// validation). Sister file: air_rank/cacheline.mlir does the analogous
+// wrap of the producer/consumer cacheline test.
 //
 // The kernel and helpers (gpu.module @sym_kernels, @wrap_bytes) are
 // duplicated verbatim from the handwritten allgather. Only @main differs
@@ -33,7 +33,7 @@
 // source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
 // the handwritten allgather.
 //
-// Launcher: run.sh with INPUT=rank_allgather forks 2 processes.
+// Launcher: `make INPUT=allgather` from this subdir forks 2 processes.
 //
 //===-----------------------------------------------------------------------===//
 
diff --git a/test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir b/test/gpu/multi_gpu/air_rank/cacheline.mlir
similarity index 96%
rename from test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir
rename to test/gpu/multi_gpu/air_rank/cacheline.mlir
index e4cec3f8f..2950b0d0b 100644
--- a/test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir
+++ b/test/gpu/multi_gpu/air_rank/cacheline.mlir
@@ -1,11 +1,11 @@
-//===- air_sym_with_rank_cacheline.mlir - air.rank wrap of cacheline -----===//
+//===- air_rank/cacheline.mlir - air.rank wrap of handwritten cacheline --===//
 //
 // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
 // SPDX-License-Identifier: MIT
 //
 //===-----------------------------------------------------------------------===//
 //
-// High-level version of air_sym_handwritten_cacheline.mlir.
+// High-level version of handwritten/cacheline.mlir.
 //
 // This file is a 1:1 wrap of the cacheline producer/consumer test inside
 // an `air.rank` op:
@@ -20,10 +20,10 @@
 //   - mgpuSymmetricHeapDestroy before each func.return
 //
 // After lowering the IR is functionally equivalent to
-// air_sym_handwritten_cacheline.mlir (same kernels, same launch
-// dispatch, same validation). This file's job is to demonstrate that
-// the user can write the multi-process world declaratively via air.rank
-// and have the pass produce the handwritten reference.
+// handwritten/cacheline.mlir (same kernels, same launch dispatch, same
+// validation). This file's job is to demonstrate that the user can
+// write the multi-process world declaratively via air.rank and have
+// the pass produce the handwritten reference.
 //
 // The kernels and helpers (gpu.module @sym_kernels, @wrap_bytes) are
 // duplicated verbatim from the cacheline test. Only @main differs in
@@ -34,8 +34,8 @@
 // source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
 // the handwritten cacheline test.
 //
-// Launcher: run.sh with INPUT=rank forks 2 processes. The
-// air-rank-to-mgpu pass converts air.rank to runtime dispatch.
+// Launcher: `make INPUT=cacheline` from this subdir forks 2 processes.
+// The air-rank-to-mgpu pass converts air.rank to runtime dispatch.
 //
 //===-----------------------------------------------------------------------===//
 
diff --git a/test/gpu/multi_gpu/handwritten/Makefile b/test/gpu/multi_gpu/handwritten/Makefile
new file mode 100644
index 000000000..245592a71
--- /dev/null
+++ b/test/gpu/multi_gpu/handwritten/Makefile
@@ -0,0 +1,154 @@
+# Multi-process symmetric-heap multi-GPU e2e — hand-written reference.
+#
+# Compiles and runs the hand-written MLIR test as N processes. Each
+# process executes the full IR; processes coordinate via the symmetric
+# heap (XGMI peer-mapped VMem buffers).
+#
+# Variants (selected via INPUT):
+#   cacheline  Producer/consumer (1-to-1) using cache-line atomicity:
+#              producer writes 32 i32 (one 128-byte line) in a single vec
+#              store with the flag in-band at lane 31; consumer spins via
+#              gpu.shuffle of lane 31. Trades a spec-defined LLVM contract
+#              for a microarchitectural one (relies on the XGMI fabric
+#              publishing peer cache lines whole on gfx940 / MI300).
+#   atomic     Producer/consumer (1-to-1) using LLVM atomicrmw release /
+#              atomic load acquire with syncscope("") (= AMDGPUUsage
+#              System scope = cross-device). Spec-defined ordering
+#              contract; pinned by
+#              mlir/test/Conversion/AIRToROCDL/sym_atomic_syncscope.mlir.
+#   allgather  Many-to-many SIMD: every rank runs the SAME kernel and
+#              writes its slice into slot[my_rank] of every peer's output,
+#              then spins on each peer's slot. Cache-line atomicity (same
+#              mechanism as 'cacheline'), generalized.
+#
+# Usage:
+#   make                       # default: INPUT=cacheline NUM_RANKS=2
+#   make INPUT=atomic
+#   make NUM_RANKS=4
+#   make clean
+#
+# Required environment (auto-detected when sourced via env_setup_gpu.sh):
+#   MLIR_AIR_INSTALL_DIR  — path containing lib/libairgpu.so
+#   LLVM_INSTALL_DIR      — path containing bin/mlir-opt + lib/libmlir_*.so
+#
+# This Makefile is intentionally self-contained — no included files, no
+# sourced helpers. Other multi_gpu/<level>/ subdirs each have their own
+# complete Makefile so that each phase's PR touches only its own dir.
+
+SHELL       := /bin/bash
+.SHELLFLAGS := -eu -o pipefail -c
+
+INPUT      ?= cacheline
+NUM_RANKS  ?= 2
+TMPDIR     ?= /tmp/air_multi_gpu_handwritten
+
+SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST)))))
+
+# Derive install dirs from PATH if not explicitly provided. Matches the
+# original run.sh fallback (`dirname $(dirname $(which mlir-opt))`).
+LLVM_INSTALL_DIR     ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null)
+MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null)
+LLVM_LIB_DIR         ?= $(LLVM_INSTALL_DIR)/lib
+AIRGPU_LIB           ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so
+
+# Reject unknown INPUT at parse time so a typo errors immediately
+# instead of running through a half-broken pipeline.
+ifeq ($(filter $(INPUT),cacheline atomic allgather),)
+$(error Unknown INPUT=$(INPUT); expected 'cacheline', 'atomic', or 'allgather')
+endif
+
+SRC_MLIR       := $(SCRIPT_DIR)/$(INPUT).mlir
+POST_TRANSLATE := $(TMPDIR)/$(INPUT)_post_translate.mlir
+LOWERED        := $(TMPDIR)/$(INPUT)_lowered.mlir
+
+.PHONY: run clean check-preconditions
+.DEFAULT_GOAL := run
+
+$(TMPDIR):
+	@mkdir -p $@
+
+# Step 1a: expand air.translate ops to memref descriptor rebases.
+$(POST_TRANSLATE): $(SRC_MLIR) | $(TMPDIR)
+	@echo "Step 1a: Expand air.translate ops ($(INPUT) variant)"
+	air-opt $< --air-translate-to-llvm -o $@
+
+# Step 1b: compile gpu.module to AMDGPU binary + finalize host.
+$(LOWERED): $(POST_TRANSLATE)
+	@echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
+	mlir-opt $< \
+	    --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
+	    -o $@
+
+# Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer
+# physical GPUs than NUM_RANKS (would silently bypass XGMI and report
+# false-positive PASSes by colocating ranks on one GPU), or if the
+# install paths are missing (mlir-runner would fail at dlopen with a
+# more cryptic message).
+check-preconditions:
+	@if [ ! -d "$(LLVM_LIB_DIR)" ]; then                                       \
+	  echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2;         \
+	  echo "       Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR."    \
+	       >&2;                                                                \
+	  exit 1;                                                                  \
+	fi
+	@if [ ! -f "$(AIRGPU_LIB)" ]; then                                         \
+	  echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2;             \
+	  echo "       Source utils/env_setup_gpu.sh or set"                      \
+	       "MLIR_AIR_INSTALL_DIR." >&2;                                        \
+	  exit 1;                                                                  \
+	fi
+	@if [ "$(NUM_RANKS)" -lt 2 ]; then                                         \
+	  echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +"   \
+	       "consumer)." >&2;                                                   \
+	  exit 1;                                                                  \
+	fi
+	@if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then                               \
+	  NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .);     \
+	else                                                                       \
+	  NUM_GPUS=$$(grep -l '^simd_count [1-9]'                                  \
+	      /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \
+	fi;                                                                        \
+	if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then                               \
+	  echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI"     \
+	       "traffic; found $$NUM_GPUS." >&2;                                   \
+	  echo "       This test refuses to colocate ranks on a single GPU"      \
+	       "because it would silently" >&2;                                    \
+	  echo "       bypass the symmetric-heap path and report false PASSes." \
+	       >&2;                                                                \
+	  exit 1;                                                                  \
+	fi
+
+# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via
+# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any
+# nested call into libmlir_rocm_runtime.so) only ever sees one device,
+# so it can't accidentally launch on the wrong one. Every rank still
+# sees device 0 internally, so airgpu uses LOCAL_RANK=0.
+run: check-preconditions $(LOWERED)
+	@echo "Step 2: Run as $(NUM_RANKS) processes"
+	@export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}";                     \
+	PIDS=();                                                              \
+	PASS=1;                                                               \
+	for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do                         \
+	  ( set -o pipefail;                                                  \
+	    RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0                     \
+	    HIP_VISIBLE_DEVICES=$$i                                           \
+	    mlir-runner --entry-point-result=void                             \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so"       \
+	        --shared-libs="$(AIRGPU_LIB)"                                 \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so"       \
+	        --shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so"     \
+	        $(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") &                   \
+	  PIDS+=($$!);                                                        \
+	done;                                                                 \
+	for pid in "$${PIDS[@]}"; do                                          \
+	  if ! wait "$$pid"; then PASS=0; fi;                                 \
+	done;                                                                 \
+	if [ $$PASS -eq 1 ]; then                                             \
+	  echo "=== ALL $(NUM_RANKS) RANKS PASSED ===";                       \
+	else                                                                  \
+	  echo "=== SOME RANKS FAILED ===";                                   \
+	  exit 1;                                                             \
+	fi
+
+clean:
+	rm -rf $(TMPDIR)
diff --git a/test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir b/test/gpu/multi_gpu/handwritten/allgather.mlir
similarity index 98%
rename from test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir
rename to test/gpu/multi_gpu/handwritten/allgather.mlir
index 120a9b46d..e067d7ed9 100644
--- a/test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir
+++ b/test/gpu/multi_gpu/handwritten/allgather.mlir
@@ -1,4 +1,4 @@
-//===- air_sym_handwritten_allgather.mlir - multi-GPU all-gather (cache line) ===//
+//===- handwritten/allgather.mlir - multi-GPU all-gather (cache-line) ----===//
 //
 // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
 // SPDX-License-Identifier: MIT
@@ -10,9 +10,9 @@
 // rank ID, which determines both (a) the payload value and (b) which
 // slot of each peer's output buffer to write into.
 //
-// Sister file: air_sym_handwritten_cacheline.mlir is the producer/
-// consumer (1-to-1) version of the same cache-line atomicity mechanism.
-// This file generalizes it to a many-to-many collective.
+// Sister file: handwritten/cacheline.mlir is the producer/consumer
+// (1-to-1) version of the same cache-line atomicity mechanism. This
+// file generalizes it to a many-to-many collective.
 //
 // Layout
 // ======
diff --git a/test/gpu/symmetric_heap_dma/air_sym_handwritten_atomic.mlir b/test/gpu/multi_gpu/handwritten/atomic.mlir
similarity index 99%
rename from test/gpu/symmetric_heap_dma/air_sym_handwritten_atomic.mlir
rename to test/gpu/multi_gpu/handwritten/atomic.mlir
index a0743e60c..f6c54640a 100644
--- a/test/gpu/symmetric_heap_dma/air_sym_handwritten_atomic.mlir
+++ b/test/gpu/multi_gpu/handwritten/atomic.mlir
@@ -1,4 +1,4 @@
-//===- air_sym_handwritten_atomic.mlir - multi-GPU e2e (atomic flag) ------===//
+//===- handwritten/atomic.mlir - multi-GPU e2e (atomic-flag variant) ------===//
 //
 // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
 // SPDX-License-Identifier: MIT
@@ -6,7 +6,7 @@
 //===------------------------------------------------------------------===//
 //
 // Symmetric-heap producer/consumer e2e (WORLD_SIZE=2), atomic-flag variant.
-// Sister file: air_sym_handwritten_cacheline.mlir uses cache-line atomicity
+// Sister file: handwritten/cacheline.mlir uses cache-line atomicity
 // instead of LLVM atomics for the cross-rank handoff.
 //
 //   rank 0 launches @producer; rank 1 launches @consumer.
diff --git a/test/gpu/symmetric_heap_dma/air_sym_handwritten_cacheline.mlir b/test/gpu/multi_gpu/handwritten/cacheline.mlir
similarity index 98%
rename from test/gpu/symmetric_heap_dma/air_sym_handwritten_cacheline.mlir
rename to test/gpu/multi_gpu/handwritten/cacheline.mlir
index 5c65a6bd0..ee70b14e3 100644
--- a/test/gpu/symmetric_heap_dma/air_sym_handwritten_cacheline.mlir
+++ b/test/gpu/multi_gpu/handwritten/cacheline.mlir
@@ -1,4 +1,4 @@
-//===- air_sym_handwritten_cacheline.mlir - multi-GPU e2e (cache line) ----===//
+//===- handwritten/cacheline.mlir - multi-GPU e2e (cache-line variant) ----===//
 //
 // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
 // SPDX-License-Identifier: MIT
@@ -6,8 +6,8 @@
 //===------------------------------------------------------------------===//
 //
 // Symmetric-heap producer/consumer e2e (WORLD_SIZE=2), cache-line variant.
-// Sister file: air_sym_handwritten_atomic.mlir uses LLVM atomicrmw / atomic
-// load with syncscope("") for the cross-rank handoff.
+// Sister file: handwritten/atomic.mlir uses LLVM atomicrmw / atomic load
+// with syncscope("") for the cross-rank handoff.
 //
 //   rank 0 launches @producer; rank 1 launches @consumer.
 //
diff --git a/test/gpu/symmetric_heap_dma/run.sh b/test/gpu/symmetric_heap_dma/run.sh
deleted file mode 100755
index b99d9598c..000000000
--- a/test/gpu/symmetric_heap_dma/run.sh
+++ /dev/null
@@ -1,136 +0,0 @@
-#!/usr/bin/env bash
-#===- run.sh - Multi-process symmetric-heap DMA e2e test --*-
-#
-# Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
-# SPDX-License-Identifier: MIT
-#
-#===------------------------------------------------------------------===//
-#
-# Compile and run the hand-written symmetric-heap MLIR test as N processes.
-# Each process executes the full IR; processes coordinate via the symmetric
-# heap (XGMI peer-mapped VMem buffers).
-#
-# Usage: run.sh [num_ranks]   (default: 2)
-#
-# Required environment (auto-detected when sourced via env_setup_gpu.sh):
-#   MLIR_AIR_INSTALL_DIR  - path containing lib/libairgpu.so
-#   LLVM_INSTALL_DIR      - path containing bin/mlir-opt + lib/libmlir_*.so
-#
-
-set -e
-
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-NUM_RANKS=${1:-2}
-TMPDIR="${TMPDIR:-/tmp/air_sym_dma}"
-mkdir -p "$TMPDIR"
-
-# Cross-rank symmetric-heap test fundamentally requires a producer + a
-# consumer process. Refuse single-process launches loudly rather than
-# letting the kernel silently no-op or hang.
-if [ "$NUM_RANKS" -lt 2 ]; then
-  echo "ERROR: NUM_RANKS=$NUM_RANKS; this test requires >= 2 ranks (producer + consumer)." >&2
-  exit 1
-fi
-
-# Refuse to run if there aren't enough physically distinct GPUs for one
-# rank per GPU. Colocating ranks on a single GPU would make XGMI/peer-VA
-# transparently fall back to local memory and produce false-positive PASSes.
-if [ -n "${HIP_VISIBLE_DEVICES:-}" ]; then
-  NUM_GPUS=$(echo "$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .)
-else
-  NUM_GPUS=$(grep -l '^simd_count [1-9]' /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l)
-fi
-if [ "$NUM_GPUS" -lt "$NUM_RANKS" ]; then
-  echo "ERROR: need >= $NUM_RANKS GPUs to validate cross-rank XGMI traffic; found $NUM_GPUS." >&2
-  echo "       This test refuses to colocate ranks on a single GPU because it would" >&2
-  echo "       silently bypass the symmetric-heap path and report false PASSes." >&2
-  exit 1
-fi
-
-LLVM_LIB_DIR="${LLVM_INSTALL_DIR:-$(dirname "$(which mlir-opt)")/..}/lib"
-AIRGPU_LIB="${MLIR_AIR_INSTALL_DIR:-$(dirname "$(which air-opt)")/..}/lib/libairgpu.so"
-
-# Four parallel kernel-driven examples — same outer test harness:
-#   atomic    — producer/consumer (1-to-1), LLVM atomicrmw release /
-#               atomic load acquire with syncscope("") (= AMDGPUUsage
-#               System scope = cross-device). Spec-defined ordering
-#               contract; pinned by sym_atomic_syncscope.mlir.
-#   cacheline — producer/consumer (1-to-1), cache-line atomicity:
-#               producer writes 32 i32 (one 128-byte line) in a single
-#               vec store with the flag in-band at lane 31; consumer
-#               spins via gpu.shuffle of lane 31.
-#   allgather — many-to-many SIMD: every rank runs the SAME kernel and
-#               writes its slice into slot[my_rank] of every peer's
-#               output, then spins on each peer's slot. Cache-line
-#               atomicity (same mechanism as 'cacheline'), generalized.
-#   rank_cacheline — Phase 3: air.rank wrap of cacheline; lowered by
-#                    air-rank-to-mgpu before the GPU compilation chain.
-#   rank_allgather — Phase 3: air.rank wrap of allgather; same lowering.
-INPUT="${INPUT:-cacheline}"
-case "$INPUT" in
-  atomic|cacheline|allgather)
-    SRC_MLIR="$SCRIPT_DIR/air_sym_handwritten_${INPUT}.mlir"
-    echo "Step 1a: Expand air.translate ops ($INPUT variant)"
-    air-opt "$SRC_MLIR" --air-translate-to-llvm \
-        -o "$TMPDIR/sym_post_translate.mlir"
-    echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
-    mlir-opt "$TMPDIR/sym_post_translate.mlir" \
-        --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
-        -o "$TMPDIR/sym_lowered.mlir"
-    ;;
-  rank_cacheline|rank_allgather)
-    # High-level air.rank wrap of the cacheline / allgather handwritten
-    # test. Lower air.rank to mgpu* runtime + expand air.translate to
-    # memref rebase, then run the same GPU compilation chain as the
-    # corresponding handwritten variant.
-    SRC_MLIR="$SCRIPT_DIR/air_sym_with_${INPUT}.mlir"
-    echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($INPUT)"
-    air-opt "$SRC_MLIR" \
-        -air-rank-to-mgpu --air-translate-to-llvm \
-        -o "$TMPDIR/post_rank.mlir"
-    echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
-    mlir-opt "$TMPDIR/post_rank.mlir" \
-        --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
-        -o "$TMPDIR/sym_lowered.mlir"
-    ;;
-  *)
-    echo "Unknown INPUT=$INPUT; expected 'atomic', 'cacheline', 'allgather', 'rank_cacheline', or 'rank_allgather'" >&2
-    exit 1
-    ;;
-esac
-
-echo "Step 2: Run as ${NUM_RANKS} processes"
-export AIRGPU_JOB_ID="${AIRGPU_JOB_ID:-$$}"
-
-PIDS=()
-PASS=1
-
-for i in $(seq 0 $((NUM_RANKS - 1))); do
-  (set -o pipefail
-   # Pin each process to its own GPU at the OS / HIP-visibility level.
-   # mlir-runner's built-in gpu.launch_func handler (and any nested call
-   # into libmlir_rocm_runtime.so) only ever sees one device, so it can't
-   # accidentally launch on the wrong one. Every rank still sees device 0
-   # internally, so airgpu uses LOCAL_RANK=0.
-   RANK=$i WORLD_SIZE=$NUM_RANKS LOCAL_RANK=0 HIP_VISIBLE_DEVICES=$i \
-   mlir-runner --entry-point-result=void \
-       --shared-libs="$LLVM_LIB_DIR/libmlir_rocm_runtime.so" \
-       --shared-libs="$AIRGPU_LIB" \
-       --shared-libs="$LLVM_LIB_DIR/libmlir_runner_utils.so" \
-       --shared-libs="$LLVM_LIB_DIR/libmlir_c_runner_utils.so" \
-       "$TMPDIR/sym_lowered.mlir" 2>&1 | sed "s/^/[rank $i] /") &
-  PIDS+=($!)
-done
-
-for pid in "${PIDS[@]}"; do
-  if ! wait "$pid"; then
-    PASS=0
-  fi
-done
-
-if [ $PASS -eq 1 ]; then
-  echo "=== ALL ${NUM_RANKS} RANKS PASSED ==="
-else
-  echo "=== SOME RANKS FAILED ==="
-  exit 1
-fi