diff --git a/test/gpu/multi_gpu/README.md b/test/gpu/multi_gpu/README.md new file mode 100644 index 000000000..ecad1dee1 --- /dev/null +++ b/test/gpu/multi_gpu/README.md @@ -0,0 +1,78 @@ +# `multi_gpu` — symmetric-heap multi-GPU end-to-end tests + +End-to-end tests for the symmetric-heap multi-GPU stack. Each test launches +N processes — one per physical GPU — that coordinate via the symmetric heap +(XGMI peer-mapped VMem buffers). + +The `mlir/test/Conversion/AIR*ToMgpu/` lit tests pin pass-level invariants +with FileCheck. The tests in this directory are the e2e counterparts: they +build through the full lowering chain and run on real hardware. + +## Layout + +Tests are organized by IR-abstraction level. Each subdirectory holds tests +written at one level. Lower levels (closer to LLVM dialect) are the lowering +targets that higher levels reduce to. + +| Subdir | Phase | Abstraction added | +|---|---|---| +| `handwritten/` | 2 | none — raw MLIR with hand-written GPU kernels and direct `mgpuSymmetricAlloc` / `mgpuGetRank` calls. The reference target. Variants: `cacheline`, `atomic`, `allgather`. | +| `air_rank/` | 3 | `air.rank` declares the multi-process world; replaces hand-written `mgpuGetRank` / heap init/destroy plumbing. Lowered by `air-rank-to-mgpu`. Variants: `cacheline`, `allgather` — each a 1:1 wrap of the corresponding `handwritten/` test. | +| `air_alloc/` | 4 (TBD) | `memref.alloc {air.symmetric}` declares symmetric-heap allocations. Lowered by `air-symmetric-alloc-to-mgpu`. | +| `air_dma/` | 5 (TBD) | `air.dma_memcpy_nd {src_rank/dst_rank}` declares cross-rank DMAs. Lowered by `air-cross-rank-dma-to-mgpu`. | +| `air_channel/` | 6 (TBD) | `air.channel {channel_type = "gpu_symmetric_heap"}` declares cross-rank channels. Lowered by `air-gpu-channel-to-mgpu`. | + +A higher-level test should produce — after running its phase's lowering pass +— IR functionally equivalent to one of the `handwritten/` references. + +## Running + +Each subdirectory has its own self-contained `Makefile`. There is no shared +include or sourced helper — duplication is intentional, so that each phase's +PR touches only its own subdir and there's no cross-phase coupling that can +rot. + +Default invocation forks 2 processes: + + make -C test/gpu/multi_gpu/handwritten + +Inside a subdirectory, common knobs: + + make -C test/gpu/multi_gpu/handwritten INPUT=cacheline # default + make -C test/gpu/multi_gpu/handwritten INPUT=atomic + make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4 + make -C test/gpu/multi_gpu/handwritten clean + +Each `Makefile` documents its own `INPUT` choices in the header comment. + +## Preconditions + +Each `Makefile`'s `check-preconditions` target refuses to launch if either: + +- `NUM_RANKS < 2` — the cross-rank symmetric-heap test fundamentally needs + a peer; a single-process launch has nothing to talk to. +- Fewer physical GPUs than `NUM_RANKS` — colocating ranks on one GPU would + silently bypass XGMI/peer-VA (transparently falling back to local memory) + and report false-positive PASSes. + +## Required environment + +The Makefiles invoke `air-opt`, `mlir-opt`, and `mlir-runner` via PATH, plus dlopen `libairgpu.so` and the `libmlir_*.so` runtime libraries. There are three ways to satisfy this: + +1. **Source `utils/env_setup_gpu.sh`** (recommended) — sets `PATH`, `LD_LIBRARY_PATH`, `MLIR_AIR_INSTALL_DIR`, and `LLVM_INSTALL_DIR` in one go. +2. **Pass install dirs on the make command line**: + ``` + make MLIR_AIR_INSTALL_DIR=… LLVM_INSTALL_DIR=… + ``` + (PATH must still contain the binaries — these vars only affect `--shared-libs` paths.) +3. **Have the binaries in `PATH` already** — the Makefile derives `LLVM_INSTALL_DIR` / `MLIR_AIR_INSTALL_DIR` from `dirname $(dirname $(command -v mlir-opt))` etc. + +The `check-preconditions` target validates that the resolved `LLVM_LIB_DIR` and `AIRGPU_LIB` paths actually exist before launching, so a missing env shows a clear error rather than a `dlopen` failure deep inside `mlir-runner`. + +## Why duplicated boilerplate per subdir + +A shared `_common.mk` or `_common.sh` would let one phase's edit silently +break another phase's tests. The boilerplate is small (~30 lines of +preconditions + driver per Makefile) and stable — phases differ in their +compile pipeline, not in the multi-process driver. Duplication is the +cheaper failure mode. diff --git a/test/gpu/multi_gpu/air_rank/Makefile b/test/gpu/multi_gpu/air_rank/Makefile new file mode 100644 index 000000000..6d9b4bd66 --- /dev/null +++ b/test/gpu/multi_gpu/air_rank/Makefile @@ -0,0 +1,149 @@ +# Multi-process symmetric-heap multi-GPU e2e — air.rank wrapped tests. +# +# These tests express the multi-process world declaratively via +# `air.rank (%rid) in (%rsize = %c2) { ... }`. The air-rank-to-mgpu +# pass (Phase 3) replaces the air.rank op with body-inlined IR that +# resolves %rid from mgpuGetRank() at runtime and brackets the +# enclosing function with mgpuSymmetricHeapInit / Destroy. +# +# Each variant in this dir is a 1:1 wrap of the corresponding test in +# ../handwritten/. After lowering through air-rank-to-mgpu the IR is +# functionally equivalent to the handwritten reference. +# +# Variants (selected via INPUT): +# cacheline Wrap of ../handwritten/cacheline.mlir (producer/consumer, +# 1-to-1, cache-line atomicity). +# allgather Wrap of ../handwritten/allgather.mlir (many-to-many SIMD, +# cache-line atomicity). +# +# Usage: +# make # default: INPUT=cacheline NUM_RANKS=2 +# make INPUT=allgather +# make NUM_RANKS=4 +# make clean +# +# Required environment (auto-detected when sourced via env_setup_gpu.sh): +# MLIR_AIR_INSTALL_DIR — path containing lib/libairgpu.so +# LLVM_INSTALL_DIR — path containing bin/mlir-opt + lib/libmlir_*.so +# +# This Makefile is intentionally self-contained — no included files, no +# sourced helpers. Other multi_gpu// subdirs each have their own +# complete Makefile so that each phase's PR touches only its own dir. + +SHELL := /bin/bash +.SHELLFLAGS := -eu -o pipefail -c + +INPUT ?= cacheline +NUM_RANKS ?= 2 +TMPDIR ?= /tmp/air_multi_gpu_air_rank + +SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST))))) + +# Derive install dirs from PATH if not explicitly provided. Matches the +# original run.sh fallback (`dirname $(dirname $(which mlir-opt))`). +LLVM_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null) +MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null) +LLVM_LIB_DIR ?= $(LLVM_INSTALL_DIR)/lib +AIRGPU_LIB ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so + +ifeq ($(filter $(INPUT),cacheline allgather),) +$(error Unknown INPUT=$(INPUT); expected 'cacheline' or 'allgather') +endif + +SRC_MLIR := $(SCRIPT_DIR)/$(INPUT).mlir +POST_RANK := $(TMPDIR)/$(INPUT)_post_rank.mlir +LOWERED := $(TMPDIR)/$(INPUT)_lowered.mlir + +.PHONY: run clean check-preconditions +.DEFAULT_GOAL := run + +$(TMPDIR): + @mkdir -p $@ + +# Step 1a: lower air.rank to mgpu* runtime + expand air.translate. +$(POST_RANK): $(SRC_MLIR) | $(TMPDIR) + @echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($(INPUT))" + air-opt $< -air-rank-to-mgpu --air-translate-to-llvm -o $@ + +# Step 1b: compile gpu.module to AMDGPU binary + finalize host. Same +# pipeline as ../handwritten/Makefile (the lowered output is structurally +# a superset of the corresponding handwritten test). +$(LOWERED): $(POST_RANK) + @echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host" + mlir-opt $< \ + --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \ + -o $@ + +# Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer +# physical GPUs than NUM_RANKS (would silently bypass XGMI and report +# false-positive PASSes by colocating ranks on one GPU), or if the +# install paths are missing (mlir-runner would fail at dlopen with a +# more cryptic message). +check-preconditions: + @if [ ! -d "$(LLVM_LIB_DIR)" ]; then \ + echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2; \ + echo " Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR." \ + >&2; \ + exit 1; \ + fi + @if [ ! -f "$(AIRGPU_LIB)" ]; then \ + echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2; \ + echo " Source utils/env_setup_gpu.sh or set" \ + "MLIR_AIR_INSTALL_DIR." >&2; \ + exit 1; \ + fi + @if [ "$(NUM_RANKS)" -lt 2 ]; then \ + echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +" \ + "consumer)." >&2; \ + exit 1; \ + fi + @if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then \ + NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .); \ + else \ + NUM_GPUS=$$(grep -l '^simd_count [1-9]' \ + /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \ + fi; \ + if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then \ + echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI" \ + "traffic; found $$NUM_GPUS." >&2; \ + echo " This test refuses to colocate ranks on a single GPU" \ + "because it would silently" >&2; \ + echo " bypass the symmetric-heap path and report false PASSes." \ + >&2; \ + exit 1; \ + fi + +# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via +# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any +# nested call into libmlir_rocm_runtime.so) only ever sees one device, +# so it can't accidentally launch on the wrong one. Every rank still +# sees device 0 internally, so airgpu uses LOCAL_RANK=0. +run: check-preconditions $(LOWERED) + @echo "Step 2: Run as $(NUM_RANKS) processes" + @export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}"; \ + PIDS=(); \ + PASS=1; \ + for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \ + ( set -o pipefail; \ + RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \ + HIP_VISIBLE_DEVICES=$$i \ + mlir-runner --entry-point-result=void \ + --shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so" \ + --shared-libs="$(AIRGPU_LIB)" \ + --shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so" \ + --shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so" \ + $(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") & \ + PIDS+=($$!); \ + done; \ + for pid in "$${PIDS[@]}"; do \ + if ! wait "$$pid"; then PASS=0; fi; \ + done; \ + if [ $$PASS -eq 1 ]; then \ + echo "=== ALL $(NUM_RANKS) RANKS PASSED ==="; \ + else \ + echo "=== SOME RANKS FAILED ==="; \ + exit 1; \ + fi + +clean: + rm -rf $(TMPDIR) diff --git a/test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir b/test/gpu/multi_gpu/air_rank/allgather.mlir similarity index 97% rename from test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir rename to test/gpu/multi_gpu/air_rank/allgather.mlir index 46861105c..5ae93d50d 100644 --- a/test/gpu/symmetric_heap_dma/air_sym_with_rank_allgather.mlir +++ b/test/gpu/multi_gpu/air_rank/allgather.mlir @@ -1,11 +1,11 @@ -//===- air_sym_with_rank_allgather.mlir - air.rank wrap of allgather -----===// +//===- air_rank/allgather.mlir - air.rank wrap of handwritten allgather --===// // // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved. // SPDX-License-Identifier: MIT // //===-----------------------------------------------------------------------===// // -// High-level version of air_sym_handwritten_allgather.mlir. +// High-level version of handwritten/allgather.mlir. // // This file is a 1:1 wrap of the SIMD-across-ranks all-gather test inside // an `air.rank` op: @@ -20,9 +20,9 @@ // - mgpuSymmetricHeapDestroy before each func.return // // After lowering the IR is functionally equivalent to -// air_sym_handwritten_allgather.mlir (same kernel, same launch dispatch, -// same validation). Sister file: air_sym_with_rank_cacheline.mlir does -// the analogous wrap of the producer/consumer cacheline test. +// handwritten/allgather.mlir (same kernel, same launch dispatch, same +// validation). Sister file: air_rank/cacheline.mlir does the analogous +// wrap of the producer/consumer cacheline test. // // The kernel and helpers (gpu.module @sym_kernels, @wrap_bytes) are // duplicated verbatim from the handwritten allgather. Only @main differs @@ -33,7 +33,7 @@ // source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as // the handwritten allgather. // -// Launcher: run.sh with INPUT=rank_allgather forks 2 processes. +// Launcher: `make INPUT=allgather` from this subdir forks 2 processes. // //===-----------------------------------------------------------------------===// diff --git a/test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir b/test/gpu/multi_gpu/air_rank/cacheline.mlir similarity index 96% rename from test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir rename to test/gpu/multi_gpu/air_rank/cacheline.mlir index e4cec3f8f..2950b0d0b 100644 --- a/test/gpu/symmetric_heap_dma/air_sym_with_rank_cacheline.mlir +++ b/test/gpu/multi_gpu/air_rank/cacheline.mlir @@ -1,11 +1,11 @@ -//===- air_sym_with_rank_cacheline.mlir - air.rank wrap of cacheline -----===// +//===- air_rank/cacheline.mlir - air.rank wrap of handwritten cacheline --===// // // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved. // SPDX-License-Identifier: MIT // //===-----------------------------------------------------------------------===// // -// High-level version of air_sym_handwritten_cacheline.mlir. +// High-level version of handwritten/cacheline.mlir. // // This file is a 1:1 wrap of the cacheline producer/consumer test inside // an `air.rank` op: @@ -20,10 +20,10 @@ // - mgpuSymmetricHeapDestroy before each func.return // // After lowering the IR is functionally equivalent to -// air_sym_handwritten_cacheline.mlir (same kernels, same launch -// dispatch, same validation). This file's job is to demonstrate that -// the user can write the multi-process world declaratively via air.rank -// and have the pass produce the handwritten reference. +// handwritten/cacheline.mlir (same kernels, same launch dispatch, same +// validation). This file's job is to demonstrate that the user can +// write the multi-process world declaratively via air.rank and have +// the pass produce the handwritten reference. // // The kernels and helpers (gpu.module @sym_kernels, @wrap_bytes) are // duplicated verbatim from the cacheline test. Only @main differs in @@ -34,8 +34,8 @@ // source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as // the handwritten cacheline test. // -// Launcher: run.sh with INPUT=rank forks 2 processes. The -// air-rank-to-mgpu pass converts air.rank to runtime dispatch. +// Launcher: `make INPUT=cacheline` from this subdir forks 2 processes. +// The air-rank-to-mgpu pass converts air.rank to runtime dispatch. // //===-----------------------------------------------------------------------===// diff --git a/test/gpu/multi_gpu/handwritten/Makefile b/test/gpu/multi_gpu/handwritten/Makefile new file mode 100644 index 000000000..245592a71 --- /dev/null +++ b/test/gpu/multi_gpu/handwritten/Makefile @@ -0,0 +1,154 @@ +# Multi-process symmetric-heap multi-GPU e2e — hand-written reference. +# +# Compiles and runs the hand-written MLIR test as N processes. Each +# process executes the full IR; processes coordinate via the symmetric +# heap (XGMI peer-mapped VMem buffers). +# +# Variants (selected via INPUT): +# cacheline Producer/consumer (1-to-1) using cache-line atomicity: +# producer writes 32 i32 (one 128-byte line) in a single vec +# store with the flag in-band at lane 31; consumer spins via +# gpu.shuffle of lane 31. Trades a spec-defined LLVM contract +# for a microarchitectural one (relies on the XGMI fabric +# publishing peer cache lines whole on gfx940 / MI300). +# atomic Producer/consumer (1-to-1) using LLVM atomicrmw release / +# atomic load acquire with syncscope("") (= AMDGPUUsage +# System scope = cross-device). Spec-defined ordering +# contract; pinned by +# mlir/test/Conversion/AIRToROCDL/sym_atomic_syncscope.mlir. +# allgather Many-to-many SIMD: every rank runs the SAME kernel and +# writes its slice into slot[my_rank] of every peer's output, +# then spins on each peer's slot. Cache-line atomicity (same +# mechanism as 'cacheline'), generalized. +# +# Usage: +# make # default: INPUT=cacheline NUM_RANKS=2 +# make INPUT=atomic +# make NUM_RANKS=4 +# make clean +# +# Required environment (auto-detected when sourced via env_setup_gpu.sh): +# MLIR_AIR_INSTALL_DIR — path containing lib/libairgpu.so +# LLVM_INSTALL_DIR — path containing bin/mlir-opt + lib/libmlir_*.so +# +# This Makefile is intentionally self-contained — no included files, no +# sourced helpers. Other multi_gpu// subdirs each have their own +# complete Makefile so that each phase's PR touches only its own dir. + +SHELL := /bin/bash +.SHELLFLAGS := -eu -o pipefail -c + +INPUT ?= cacheline +NUM_RANKS ?= 2 +TMPDIR ?= /tmp/air_multi_gpu_handwritten + +SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST))))) + +# Derive install dirs from PATH if not explicitly provided. Matches the +# original run.sh fallback (`dirname $(dirname $(which mlir-opt))`). +LLVM_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null) +MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null) +LLVM_LIB_DIR ?= $(LLVM_INSTALL_DIR)/lib +AIRGPU_LIB ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so + +# Reject unknown INPUT at parse time so a typo errors immediately +# instead of running through a half-broken pipeline. +ifeq ($(filter $(INPUT),cacheline atomic allgather),) +$(error Unknown INPUT=$(INPUT); expected 'cacheline', 'atomic', or 'allgather') +endif + +SRC_MLIR := $(SCRIPT_DIR)/$(INPUT).mlir +POST_TRANSLATE := $(TMPDIR)/$(INPUT)_post_translate.mlir +LOWERED := $(TMPDIR)/$(INPUT)_lowered.mlir + +.PHONY: run clean check-preconditions +.DEFAULT_GOAL := run + +$(TMPDIR): + @mkdir -p $@ + +# Step 1a: expand air.translate ops to memref descriptor rebases. +$(POST_TRANSLATE): $(SRC_MLIR) | $(TMPDIR) + @echo "Step 1a: Expand air.translate ops ($(INPUT) variant)" + air-opt $< --air-translate-to-llvm -o $@ + +# Step 1b: compile gpu.module to AMDGPU binary + finalize host. +$(LOWERED): $(POST_TRANSLATE) + @echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host" + mlir-opt $< \ + --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \ + -o $@ + +# Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer +# physical GPUs than NUM_RANKS (would silently bypass XGMI and report +# false-positive PASSes by colocating ranks on one GPU), or if the +# install paths are missing (mlir-runner would fail at dlopen with a +# more cryptic message). +check-preconditions: + @if [ ! -d "$(LLVM_LIB_DIR)" ]; then \ + echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2; \ + echo " Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR." \ + >&2; \ + exit 1; \ + fi + @if [ ! -f "$(AIRGPU_LIB)" ]; then \ + echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2; \ + echo " Source utils/env_setup_gpu.sh or set" \ + "MLIR_AIR_INSTALL_DIR." >&2; \ + exit 1; \ + fi + @if [ "$(NUM_RANKS)" -lt 2 ]; then \ + echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +" \ + "consumer)." >&2; \ + exit 1; \ + fi + @if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then \ + NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .); \ + else \ + NUM_GPUS=$$(grep -l '^simd_count [1-9]' \ + /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \ + fi; \ + if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then \ + echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI" \ + "traffic; found $$NUM_GPUS." >&2; \ + echo " This test refuses to colocate ranks on a single GPU" \ + "because it would silently" >&2; \ + echo " bypass the symmetric-heap path and report false PASSes." \ + >&2; \ + exit 1; \ + fi + +# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via +# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any +# nested call into libmlir_rocm_runtime.so) only ever sees one device, +# so it can't accidentally launch on the wrong one. Every rank still +# sees device 0 internally, so airgpu uses LOCAL_RANK=0. +run: check-preconditions $(LOWERED) + @echo "Step 2: Run as $(NUM_RANKS) processes" + @export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}"; \ + PIDS=(); \ + PASS=1; \ + for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \ + ( set -o pipefail; \ + RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \ + HIP_VISIBLE_DEVICES=$$i \ + mlir-runner --entry-point-result=void \ + --shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so" \ + --shared-libs="$(AIRGPU_LIB)" \ + --shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so" \ + --shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so" \ + $(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") & \ + PIDS+=($$!); \ + done; \ + for pid in "$${PIDS[@]}"; do \ + if ! wait "$$pid"; then PASS=0; fi; \ + done; \ + if [ $$PASS -eq 1 ]; then \ + echo "=== ALL $(NUM_RANKS) RANKS PASSED ==="; \ + else \ + echo "=== SOME RANKS FAILED ==="; \ + exit 1; \ + fi + +clean: + rm -rf $(TMPDIR) diff --git a/test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir b/test/gpu/multi_gpu/handwritten/allgather.mlir similarity index 98% rename from test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir rename to test/gpu/multi_gpu/handwritten/allgather.mlir index 120a9b46d..e067d7ed9 100644 --- a/test/gpu/symmetric_heap_dma/air_sym_handwritten_allgather.mlir +++ b/test/gpu/multi_gpu/handwritten/allgather.mlir @@ -1,4 +1,4 @@ -//===- air_sym_handwritten_allgather.mlir - multi-GPU all-gather (cache line) ===// +//===- handwritten/allgather.mlir - multi-GPU all-gather (cache-line) ----===// // // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved. // SPDX-License-Identifier: MIT @@ -10,9 +10,9 @@ // rank ID, which determines both (a) the payload value and (b) which // slot of each peer's output buffer to write into. // -// Sister file: air_sym_handwritten_cacheline.mlir is the producer/ -// consumer (1-to-1) version of the same cache-line atomicity mechanism. -// This file generalizes it to a many-to-many collective. +// Sister file: handwritten/cacheline.mlir is the producer/consumer +// (1-to-1) version of the same cache-line atomicity mechanism. This +// file generalizes it to a many-to-many collective. // // Layout // ====== diff --git a/test/gpu/symmetric_heap_dma/air_sym_handwritten_atomic.mlir b/test/gpu/multi_gpu/handwritten/atomic.mlir similarity index 99% rename from test/gpu/symmetric_heap_dma/air_sym_handwritten_atomic.mlir rename to test/gpu/multi_gpu/handwritten/atomic.mlir index a0743e60c..f6c54640a 100644 --- a/test/gpu/symmetric_heap_dma/air_sym_handwritten_atomic.mlir +++ b/test/gpu/multi_gpu/handwritten/atomic.mlir @@ -1,4 +1,4 @@ -//===- air_sym_handwritten_atomic.mlir - multi-GPU e2e (atomic flag) ------===// +//===- handwritten/atomic.mlir - multi-GPU e2e (atomic-flag variant) ------===// // // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved. // SPDX-License-Identifier: MIT @@ -6,7 +6,7 @@ //===------------------------------------------------------------------===// // // Symmetric-heap producer/consumer e2e (WORLD_SIZE=2), atomic-flag variant. -// Sister file: air_sym_handwritten_cacheline.mlir uses cache-line atomicity +// Sister file: handwritten/cacheline.mlir uses cache-line atomicity // instead of LLVM atomics for the cross-rank handoff. // // rank 0 launches @producer; rank 1 launches @consumer. diff --git a/test/gpu/symmetric_heap_dma/air_sym_handwritten_cacheline.mlir b/test/gpu/multi_gpu/handwritten/cacheline.mlir similarity index 98% rename from test/gpu/symmetric_heap_dma/air_sym_handwritten_cacheline.mlir rename to test/gpu/multi_gpu/handwritten/cacheline.mlir index 5c65a6bd0..ee70b14e3 100644 --- a/test/gpu/symmetric_heap_dma/air_sym_handwritten_cacheline.mlir +++ b/test/gpu/multi_gpu/handwritten/cacheline.mlir @@ -1,4 +1,4 @@ -//===- air_sym_handwritten_cacheline.mlir - multi-GPU e2e (cache line) ----===// +//===- handwritten/cacheline.mlir - multi-GPU e2e (cache-line variant) ----===// // // Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved. // SPDX-License-Identifier: MIT @@ -6,8 +6,8 @@ //===------------------------------------------------------------------===// // // Symmetric-heap producer/consumer e2e (WORLD_SIZE=2), cache-line variant. -// Sister file: air_sym_handwritten_atomic.mlir uses LLVM atomicrmw / atomic -// load with syncscope("") for the cross-rank handoff. +// Sister file: handwritten/atomic.mlir uses LLVM atomicrmw / atomic load +// with syncscope("") for the cross-rank handoff. // // rank 0 launches @producer; rank 1 launches @consumer. // diff --git a/test/gpu/symmetric_heap_dma/run.sh b/test/gpu/symmetric_heap_dma/run.sh deleted file mode 100755 index b99d9598c..000000000 --- a/test/gpu/symmetric_heap_dma/run.sh +++ /dev/null @@ -1,136 +0,0 @@ -#!/usr/bin/env bash -#===- run.sh - Multi-process symmetric-heap DMA e2e test --*- -# -# Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved. -# SPDX-License-Identifier: MIT -# -#===------------------------------------------------------------------===// -# -# Compile and run the hand-written symmetric-heap MLIR test as N processes. -# Each process executes the full IR; processes coordinate via the symmetric -# heap (XGMI peer-mapped VMem buffers). -# -# Usage: run.sh [num_ranks] (default: 2) -# -# Required environment (auto-detected when sourced via env_setup_gpu.sh): -# MLIR_AIR_INSTALL_DIR - path containing lib/libairgpu.so -# LLVM_INSTALL_DIR - path containing bin/mlir-opt + lib/libmlir_*.so -# - -set -e - -SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -NUM_RANKS=${1:-2} -TMPDIR="${TMPDIR:-/tmp/air_sym_dma}" -mkdir -p "$TMPDIR" - -# Cross-rank symmetric-heap test fundamentally requires a producer + a -# consumer process. Refuse single-process launches loudly rather than -# letting the kernel silently no-op or hang. -if [ "$NUM_RANKS" -lt 2 ]; then - echo "ERROR: NUM_RANKS=$NUM_RANKS; this test requires >= 2 ranks (producer + consumer)." >&2 - exit 1 -fi - -# Refuse to run if there aren't enough physically distinct GPUs for one -# rank per GPU. Colocating ranks on a single GPU would make XGMI/peer-VA -# transparently fall back to local memory and produce false-positive PASSes. -if [ -n "${HIP_VISIBLE_DEVICES:-}" ]; then - NUM_GPUS=$(echo "$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .) -else - NUM_GPUS=$(grep -l '^simd_count [1-9]' /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l) -fi -if [ "$NUM_GPUS" -lt "$NUM_RANKS" ]; then - echo "ERROR: need >= $NUM_RANKS GPUs to validate cross-rank XGMI traffic; found $NUM_GPUS." >&2 - echo " This test refuses to colocate ranks on a single GPU because it would" >&2 - echo " silently bypass the symmetric-heap path and report false PASSes." >&2 - exit 1 -fi - -LLVM_LIB_DIR="${LLVM_INSTALL_DIR:-$(dirname "$(which mlir-opt)")/..}/lib" -AIRGPU_LIB="${MLIR_AIR_INSTALL_DIR:-$(dirname "$(which air-opt)")/..}/lib/libairgpu.so" - -# Four parallel kernel-driven examples — same outer test harness: -# atomic — producer/consumer (1-to-1), LLVM atomicrmw release / -# atomic load acquire with syncscope("") (= AMDGPUUsage -# System scope = cross-device). Spec-defined ordering -# contract; pinned by sym_atomic_syncscope.mlir. -# cacheline — producer/consumer (1-to-1), cache-line atomicity: -# producer writes 32 i32 (one 128-byte line) in a single -# vec store with the flag in-band at lane 31; consumer -# spins via gpu.shuffle of lane 31. -# allgather — many-to-many SIMD: every rank runs the SAME kernel and -# writes its slice into slot[my_rank] of every peer's -# output, then spins on each peer's slot. Cache-line -# atomicity (same mechanism as 'cacheline'), generalized. -# rank_cacheline — Phase 3: air.rank wrap of cacheline; lowered by -# air-rank-to-mgpu before the GPU compilation chain. -# rank_allgather — Phase 3: air.rank wrap of allgather; same lowering. -INPUT="${INPUT:-cacheline}" -case "$INPUT" in - atomic|cacheline|allgather) - SRC_MLIR="$SCRIPT_DIR/air_sym_handwritten_${INPUT}.mlir" - echo "Step 1a: Expand air.translate ops ($INPUT variant)" - air-opt "$SRC_MLIR" --air-translate-to-llvm \ - -o "$TMPDIR/sym_post_translate.mlir" - echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host" - mlir-opt "$TMPDIR/sym_post_translate.mlir" \ - --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \ - -o "$TMPDIR/sym_lowered.mlir" - ;; - rank_cacheline|rank_allgather) - # High-level air.rank wrap of the cacheline / allgather handwritten - # test. Lower air.rank to mgpu* runtime + expand air.translate to - # memref rebase, then run the same GPU compilation chain as the - # corresponding handwritten variant. - SRC_MLIR="$SCRIPT_DIR/air_sym_with_${INPUT}.mlir" - echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($INPUT)" - air-opt "$SRC_MLIR" \ - -air-rank-to-mgpu --air-translate-to-llvm \ - -o "$TMPDIR/post_rank.mlir" - echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host" - mlir-opt "$TMPDIR/post_rank.mlir" \ - --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \ - -o "$TMPDIR/sym_lowered.mlir" - ;; - *) - echo "Unknown INPUT=$INPUT; expected 'atomic', 'cacheline', 'allgather', 'rank_cacheline', or 'rank_allgather'" >&2 - exit 1 - ;; -esac - -echo "Step 2: Run as ${NUM_RANKS} processes" -export AIRGPU_JOB_ID="${AIRGPU_JOB_ID:-$$}" - -PIDS=() -PASS=1 - -for i in $(seq 0 $((NUM_RANKS - 1))); do - (set -o pipefail - # Pin each process to its own GPU at the OS / HIP-visibility level. - # mlir-runner's built-in gpu.launch_func handler (and any nested call - # into libmlir_rocm_runtime.so) only ever sees one device, so it can't - # accidentally launch on the wrong one. Every rank still sees device 0 - # internally, so airgpu uses LOCAL_RANK=0. - RANK=$i WORLD_SIZE=$NUM_RANKS LOCAL_RANK=0 HIP_VISIBLE_DEVICES=$i \ - mlir-runner --entry-point-result=void \ - --shared-libs="$LLVM_LIB_DIR/libmlir_rocm_runtime.so" \ - --shared-libs="$AIRGPU_LIB" \ - --shared-libs="$LLVM_LIB_DIR/libmlir_runner_utils.so" \ - --shared-libs="$LLVM_LIB_DIR/libmlir_c_runner_utils.so" \ - "$TMPDIR/sym_lowered.mlir" 2>&1 | sed "s/^/[rank $i] /") & - PIDS+=($!) -done - -for pid in "${PIDS[@]}"; do - if ! wait "$pid"; then - PASS=0 - fi -done - -if [ $PASS -eq 1 ]; then - echo "=== ALL ${NUM_RANKS} RANKS PASSED ===" -else - echo "=== SOME RANKS FAILED ===" - exit 1 -fi