Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions test/gpu/multi_gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# `multi_gpu` — symmetric-heap multi-GPU end-to-end tests

End-to-end tests for the symmetric-heap multi-GPU stack. Each test launches
N processes — one per physical GPU — that coordinate via the symmetric heap
(XGMI peer-mapped VMem buffers).

The `mlir/test/Conversion/AIR*ToMgpu/` lit tests pin pass-level invariants
with FileCheck. The tests in this directory are the e2e counterparts: they
build through the full lowering chain and run on real hardware.

## Layout

Tests are organized by IR-abstraction level. Each subdirectory holds tests
written at one level. Lower levels (closer to LLVM dialect) are the lowering
targets that higher levels reduce to.

| Subdir | Phase | Abstraction added |
|---|---|---|
| `handwritten/` | 2 | none — raw MLIR with hand-written GPU kernels and direct `mgpuSymmetricAlloc` / `mgpuGetRank` calls. The reference target. Variants: `cacheline`, `atomic`, `allgather`. |
| `air_rank/` | 3 | `air.rank` declares the multi-process world; replaces hand-written `mgpuGetRank` / heap init/destroy plumbing. Lowered by `air-rank-to-mgpu`. Variants: `cacheline`, `allgather` — each a 1:1 wrap of the corresponding `handwritten/` test. |
| `air_alloc/` | 4 (TBD) | `memref.alloc {air.symmetric}` declares symmetric-heap allocations. Lowered by `air-symmetric-alloc-to-mgpu`. |
| `air_dma/` | 5 (TBD) | `air.dma_memcpy_nd {src_rank/dst_rank}` declares cross-rank DMAs. Lowered by `air-cross-rank-dma-to-mgpu`. |
| `air_channel/` | 6 (TBD) | `air.channel {channel_type = "gpu_symmetric_heap"}` declares cross-rank channels. Lowered by `air-gpu-channel-to-mgpu`. |

A higher-level test should produce — after running its phase's lowering pass
— IR functionally equivalent to one of the `handwritten/` references.

## Running

Each subdirectory has its own self-contained `Makefile`. There is no shared
include or sourced helper — duplication is intentional, so that each phase's
PR touches only its own subdir and there's no cross-phase coupling that can
rot.

Default invocation forks 2 processes:

make -C test/gpu/multi_gpu/handwritten

Inside a subdirectory, common knobs:

make -C test/gpu/multi_gpu/handwritten INPUT=cacheline # default
make -C test/gpu/multi_gpu/handwritten INPUT=atomic
make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4
make -C test/gpu/multi_gpu/handwritten clean

Each `Makefile` documents its own `INPUT` choices in the header comment.

## Preconditions

Each `Makefile`'s `check-preconditions` target refuses to launch if either:

- `NUM_RANKS < 2` — the cross-rank symmetric-heap test fundamentally needs
a peer; a single-process launch has nothing to talk to.
- Fewer physical GPUs than `NUM_RANKS` — colocating ranks on one GPU would
silently bypass XGMI/peer-VA (transparently falling back to local memory)
and report false-positive PASSes.

## Required environment

The Makefiles invoke `air-opt`, `mlir-opt`, and `mlir-runner` via PATH, plus dlopen `libairgpu.so` and the `libmlir_*.so` runtime libraries. There are three ways to satisfy this:

1. **Source `utils/env_setup_gpu.sh`** (recommended) — sets `PATH`, `LD_LIBRARY_PATH`, `MLIR_AIR_INSTALL_DIR`, and `LLVM_INSTALL_DIR` in one go.
2. **Pass install dirs on the make command line**:
```
make MLIR_AIR_INSTALL_DIR=… LLVM_INSTALL_DIR=…
```
(PATH must still contain the binaries — these vars only affect `--shared-libs` paths.)
3. **Have the binaries in `PATH` already** — the Makefile derives `LLVM_INSTALL_DIR` / `MLIR_AIR_INSTALL_DIR` from `dirname $(dirname $(command -v mlir-opt))` etc.

The `check-preconditions` target validates that the resolved `LLVM_LIB_DIR` and `AIRGPU_LIB` paths actually exist before launching, so a missing env shows a clear error rather than a `dlopen` failure deep inside `mlir-runner`.

## Why duplicated boilerplate per subdir

A shared `_common.mk` or `_common.sh` would let one phase's edit silently
break another phase's tests. The boilerplate is small (~30 lines of
preconditions + driver per Makefile) and stable — phases differ in their
compile pipeline, not in the multi-process driver. Duplication is the
cheaper failure mode.
149 changes: 149 additions & 0 deletions test/gpu/multi_gpu/air_rank/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Multi-process symmetric-heap multi-GPU e2e — air.rank wrapped tests.
#
# These tests express the multi-process world declaratively via
# `air.rank (%rid) in (%rsize = %c2) { ... }`. The air-rank-to-mgpu
# pass (Phase 3) replaces the air.rank op with body-inlined IR that
# resolves %rid from mgpuGetRank() at runtime and brackets the
# enclosing function with mgpuSymmetricHeapInit / Destroy.
#
# Each variant in this dir is a 1:1 wrap of the corresponding test in
# ../handwritten/. After lowering through air-rank-to-mgpu the IR is
# functionally equivalent to the handwritten reference.
#
# Variants (selected via INPUT):
# cacheline Wrap of ../handwritten/cacheline.mlir (producer/consumer,
# 1-to-1, cache-line atomicity).
# allgather Wrap of ../handwritten/allgather.mlir (many-to-many SIMD,
# cache-line atomicity).
#
# Usage:
# make # default: INPUT=cacheline NUM_RANKS=2
# make INPUT=allgather
# make NUM_RANKS=4
# make clean
#
# Required environment (auto-detected when sourced via env_setup_gpu.sh):
# MLIR_AIR_INSTALL_DIR — path containing lib/libairgpu.so
# LLVM_INSTALL_DIR — path containing bin/mlir-opt + lib/libmlir_*.so
#
# This Makefile is intentionally self-contained — no included files, no
# sourced helpers. Other multi_gpu/<level>/ subdirs each have their own
# complete Makefile so that each phase's PR touches only its own dir.

SHELL := /bin/bash
.SHELLFLAGS := -eu -o pipefail -c

INPUT ?= cacheline
NUM_RANKS ?= 2
TMPDIR ?= /tmp/air_multi_gpu_air_rank

SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST)))))

# Derive install dirs from PATH if not explicitly provided. Matches the
# original run.sh fallback (`dirname $(dirname $(which mlir-opt))`).
LLVM_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null)
MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null)
LLVM_LIB_DIR ?= $(LLVM_INSTALL_DIR)/lib
AIRGPU_LIB ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so

Comment thread
erwei-xilinx marked this conversation as resolved.
ifeq ($(filter $(INPUT),cacheline allgather),)
$(error Unknown INPUT=$(INPUT); expected 'cacheline' or 'allgather')
endif

SRC_MLIR := $(SCRIPT_DIR)/$(INPUT).mlir
POST_RANK := $(TMPDIR)/$(INPUT)_post_rank.mlir
LOWERED := $(TMPDIR)/$(INPUT)_lowered.mlir

.PHONY: run clean check-preconditions
.DEFAULT_GOAL := run

$(TMPDIR):
@mkdir -p $@

# Step 1a: lower air.rank to mgpu* runtime + expand air.translate.
$(POST_RANK): $(SRC_MLIR) | $(TMPDIR)
@echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($(INPUT))"
air-opt $< -air-rank-to-mgpu --air-translate-to-llvm -o $@

# Step 1b: compile gpu.module to AMDGPU binary + finalize host. Same
# pipeline as ../handwritten/Makefile (the lowered output is structurally
# a superset of the corresponding handwritten test).
$(LOWERED): $(POST_RANK)
@echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host"
mlir-opt $< \
--pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \
-o $@

# Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer
# physical GPUs than NUM_RANKS (would silently bypass XGMI and report
# false-positive PASSes by colocating ranks on one GPU), or if the
# install paths are missing (mlir-runner would fail at dlopen with a
# more cryptic message).
check-preconditions:
@if [ ! -d "$(LLVM_LIB_DIR)" ]; then \
echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2; \
echo " Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR." \
>&2; \
exit 1; \
fi
@if [ ! -f "$(AIRGPU_LIB)" ]; then \
echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2; \
echo " Source utils/env_setup_gpu.sh or set" \
"MLIR_AIR_INSTALL_DIR." >&2; \
exit 1; \
fi
@if [ "$(NUM_RANKS)" -lt 2 ]; then \
echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +" \
"consumer)." >&2; \
exit 1; \
fi
@if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then \
NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .); \
else \
NUM_GPUS=$$(grep -l '^simd_count [1-9]' \
/sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \
fi; \
if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then \
echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI" \
"traffic; found $$NUM_GPUS." >&2; \
echo " This test refuses to colocate ranks on a single GPU" \
"because it would silently" >&2; \
echo " bypass the symmetric-heap path and report false PASSes." \
>&2; \
exit 1; \
fi

# Step 2: fork NUM_RANKS processes, each pinned to its own GPU via
# HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any
# nested call into libmlir_rocm_runtime.so) only ever sees one device,
# so it can't accidentally launch on the wrong one. Every rank still
# sees device 0 internally, so airgpu uses LOCAL_RANK=0.
run: check-preconditions $(LOWERED)
@echo "Step 2: Run as $(NUM_RANKS) processes"
@export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}"; \
PIDS=(); \
PASS=1; \
for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \
( set -o pipefail; \
RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \
HIP_VISIBLE_DEVICES=$$i \
mlir-runner --entry-point-result=void \
--shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so" \
--shared-libs="$(AIRGPU_LIB)" \
--shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so" \
--shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so" \
$(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") & \
PIDS+=($$!); \
done; \
for pid in "$${PIDS[@]}"; do \
if ! wait "$$pid"; then PASS=0; fi; \
done; \
if [ $$PASS -eq 1 ]; then \
echo "=== ALL $(NUM_RANKS) RANKS PASSED ==="; \
else \
echo "=== SOME RANKS FAILED ==="; \
exit 1; \
fi

clean:
rm -rf $(TMPDIR)
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
//===- air_sym_with_rank_allgather.mlir - air.rank wrap of allgather -----===//
//===- air_rank/allgather.mlir - air.rank wrap of handwritten allgather --===//
//
// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
// SPDX-License-Identifier: MIT
//
//===-----------------------------------------------------------------------===//
//
// High-level version of air_sym_handwritten_allgather.mlir.
// High-level version of handwritten/allgather.mlir.
//
// This file is a 1:1 wrap of the SIMD-across-ranks all-gather test inside
// an `air.rank` op:
Expand All @@ -20,9 +20,9 @@
// - mgpuSymmetricHeapDestroy before each func.return
//
// After lowering the IR is functionally equivalent to
// air_sym_handwritten_allgather.mlir (same kernel, same launch dispatch,
// same validation). Sister file: air_sym_with_rank_cacheline.mlir does
// the analogous wrap of the producer/consumer cacheline test.
// handwritten/allgather.mlir (same kernel, same launch dispatch, same
// validation). Sister file: air_rank/cacheline.mlir does the analogous
// wrap of the producer/consumer cacheline test.
//
// The kernel and helpers (gpu.module @sym_kernels, @wrap_bytes) are
// duplicated verbatim from the handwritten allgather. Only @main differs
Expand All @@ -33,7 +33,7 @@
// source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
// the handwritten allgather.
//
// Launcher: run.sh with INPUT=rank_allgather forks 2 processes.
// Launcher: `make INPUT=allgather` from this subdir forks 2 processes.
//
//===-----------------------------------------------------------------------===//

Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
//===- air_sym_with_rank_cacheline.mlir - air.rank wrap of cacheline -----===//
//===- air_rank/cacheline.mlir - air.rank wrap of handwritten cacheline --===//
//
// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
// SPDX-License-Identifier: MIT
//
//===-----------------------------------------------------------------------===//
//
// High-level version of air_sym_handwritten_cacheline.mlir.
// High-level version of handwritten/cacheline.mlir.
//
// This file is a 1:1 wrap of the cacheline producer/consumer test inside
// an `air.rank` op:
Expand All @@ -20,10 +20,10 @@
// - mgpuSymmetricHeapDestroy before each func.return
//
// After lowering the IR is functionally equivalent to
// air_sym_handwritten_cacheline.mlir (same kernels, same launch
// dispatch, same validation). This file's job is to demonstrate that
// the user can write the multi-process world declaratively via air.rank
// and have the pass produce the handwritten reference.
// handwritten/cacheline.mlir (same kernels, same launch dispatch, same
// validation). This file's job is to demonstrate that the user can
// write the multi-process world declaratively via air.rank and have
// the pass produce the handwritten reference.
//
// The kernels and helpers (gpu.module @sym_kernels, @wrap_bytes) are
// duplicated verbatim from the cacheline test. Only @main differs in
Expand All @@ -34,8 +34,8 @@
// source memref (see AIRTranslateToLLVMPass.cpp). Same constraint as
// the handwritten cacheline test.
//
// Launcher: run.sh with INPUT=rank forks 2 processes. The
// air-rank-to-mgpu pass converts air.rank to runtime dispatch.
// Launcher: `make INPUT=cacheline` from this subdir forks 2 processes.
// The air-rank-to-mgpu pass converts air.rank to runtime dispatch.
//
//===-----------------------------------------------------------------------===//

Expand Down
Loading
Loading