-
Notifications
You must be signed in to change notification settings - Fork 48
[multi-gpu] restructure tests: rename symmetric_heap_dma → multi_gpu, group by IR level #1613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
erwei-xilinx
merged 1 commit into
Xilinx:main
from
erwei-xilinx:multigpu-test-restructure
May 12, 2026
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| # `multi_gpu` — symmetric-heap multi-GPU end-to-end tests | ||
|
|
||
| End-to-end tests for the symmetric-heap multi-GPU stack. Each test launches | ||
| N processes — one per physical GPU — that coordinate via the symmetric heap | ||
| (XGMI peer-mapped VMem buffers). | ||
|
|
||
| The `mlir/test/Conversion/AIR*ToMgpu/` lit tests pin pass-level invariants | ||
| with FileCheck. The tests in this directory are the e2e counterparts: they | ||
| build through the full lowering chain and run on real hardware. | ||
|
|
||
| ## Layout | ||
|
|
||
| Tests are organized by IR-abstraction level. Each subdirectory holds tests | ||
| written at one level. Lower levels (closer to LLVM dialect) are the lowering | ||
| targets that higher levels reduce to. | ||
|
|
||
| | Subdir | Phase | Abstraction added | | ||
| |---|---|---| | ||
| | `handwritten/` | 2 | none — raw MLIR with hand-written GPU kernels and direct `mgpuSymmetricAlloc` / `mgpuGetRank` calls. The reference target. Variants: `cacheline`, `atomic`, `allgather`. | | ||
| | `air_rank/` | 3 | `air.rank` declares the multi-process world; replaces hand-written `mgpuGetRank` / heap init/destroy plumbing. Lowered by `air-rank-to-mgpu`. Variants: `cacheline`, `allgather` — each a 1:1 wrap of the corresponding `handwritten/` test. | | ||
| | `air_alloc/` | 4 (TBD) | `memref.alloc {air.symmetric}` declares symmetric-heap allocations. Lowered by `air-symmetric-alloc-to-mgpu`. | | ||
| | `air_dma/` | 5 (TBD) | `air.dma_memcpy_nd {src_rank/dst_rank}` declares cross-rank DMAs. Lowered by `air-cross-rank-dma-to-mgpu`. | | ||
| | `air_channel/` | 6 (TBD) | `air.channel {channel_type = "gpu_symmetric_heap"}` declares cross-rank channels. Lowered by `air-gpu-channel-to-mgpu`. | | ||
|
|
||
| A higher-level test should produce — after running its phase's lowering pass | ||
| — IR functionally equivalent to one of the `handwritten/` references. | ||
|
|
||
| ## Running | ||
|
|
||
| Each subdirectory has its own self-contained `Makefile`. There is no shared | ||
| include or sourced helper — duplication is intentional, so that each phase's | ||
| PR touches only its own subdir and there's no cross-phase coupling that can | ||
| rot. | ||
|
|
||
| Default invocation forks 2 processes: | ||
|
|
||
| make -C test/gpu/multi_gpu/handwritten | ||
|
|
||
| Inside a subdirectory, common knobs: | ||
|
|
||
| make -C test/gpu/multi_gpu/handwritten INPUT=cacheline # default | ||
| make -C test/gpu/multi_gpu/handwritten INPUT=atomic | ||
| make -C test/gpu/multi_gpu/handwritten NUM_RANKS=4 | ||
| make -C test/gpu/multi_gpu/handwritten clean | ||
|
|
||
| Each `Makefile` documents its own `INPUT` choices in the header comment. | ||
|
|
||
| ## Preconditions | ||
|
|
||
| Each `Makefile`'s `check-preconditions` target refuses to launch if either: | ||
|
|
||
| - `NUM_RANKS < 2` — the cross-rank symmetric-heap test fundamentally needs | ||
| a peer; a single-process launch has nothing to talk to. | ||
| - Fewer physical GPUs than `NUM_RANKS` — colocating ranks on one GPU would | ||
| silently bypass XGMI/peer-VA (transparently falling back to local memory) | ||
| and report false-positive PASSes. | ||
|
|
||
| ## Required environment | ||
|
|
||
| The Makefiles invoke `air-opt`, `mlir-opt`, and `mlir-runner` via PATH, plus dlopen `libairgpu.so` and the `libmlir_*.so` runtime libraries. There are three ways to satisfy this: | ||
|
|
||
| 1. **Source `utils/env_setup_gpu.sh`** (recommended) — sets `PATH`, `LD_LIBRARY_PATH`, `MLIR_AIR_INSTALL_DIR`, and `LLVM_INSTALL_DIR` in one go. | ||
| 2. **Pass install dirs on the make command line**: | ||
| ``` | ||
| make MLIR_AIR_INSTALL_DIR=… LLVM_INSTALL_DIR=… | ||
| ``` | ||
| (PATH must still contain the binaries — these vars only affect `--shared-libs` paths.) | ||
| 3. **Have the binaries in `PATH` already** — the Makefile derives `LLVM_INSTALL_DIR` / `MLIR_AIR_INSTALL_DIR` from `dirname $(dirname $(command -v mlir-opt))` etc. | ||
|
|
||
| The `check-preconditions` target validates that the resolved `LLVM_LIB_DIR` and `AIRGPU_LIB` paths actually exist before launching, so a missing env shows a clear error rather than a `dlopen` failure deep inside `mlir-runner`. | ||
|
|
||
| ## Why duplicated boilerplate per subdir | ||
|
|
||
| A shared `_common.mk` or `_common.sh` would let one phase's edit silently | ||
| break another phase's tests. The boilerplate is small (~30 lines of | ||
| preconditions + driver per Makefile) and stable — phases differ in their | ||
| compile pipeline, not in the multi-process driver. Duplication is the | ||
| cheaper failure mode. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,149 @@ | ||
| # Multi-process symmetric-heap multi-GPU e2e — air.rank wrapped tests. | ||
| # | ||
| # These tests express the multi-process world declaratively via | ||
| # `air.rank (%rid) in (%rsize = %c2) { ... }`. The air-rank-to-mgpu | ||
| # pass (Phase 3) replaces the air.rank op with body-inlined IR that | ||
| # resolves %rid from mgpuGetRank() at runtime and brackets the | ||
| # enclosing function with mgpuSymmetricHeapInit / Destroy. | ||
| # | ||
| # Each variant in this dir is a 1:1 wrap of the corresponding test in | ||
| # ../handwritten/. After lowering through air-rank-to-mgpu the IR is | ||
| # functionally equivalent to the handwritten reference. | ||
| # | ||
| # Variants (selected via INPUT): | ||
| # cacheline Wrap of ../handwritten/cacheline.mlir (producer/consumer, | ||
| # 1-to-1, cache-line atomicity). | ||
| # allgather Wrap of ../handwritten/allgather.mlir (many-to-many SIMD, | ||
| # cache-line atomicity). | ||
| # | ||
| # Usage: | ||
| # make # default: INPUT=cacheline NUM_RANKS=2 | ||
| # make INPUT=allgather | ||
| # make NUM_RANKS=4 | ||
| # make clean | ||
| # | ||
| # Required environment (auto-detected when sourced via env_setup_gpu.sh): | ||
| # MLIR_AIR_INSTALL_DIR — path containing lib/libairgpu.so | ||
| # LLVM_INSTALL_DIR — path containing bin/mlir-opt + lib/libmlir_*.so | ||
| # | ||
| # This Makefile is intentionally self-contained — no included files, no | ||
| # sourced helpers. Other multi_gpu/<level>/ subdirs each have their own | ||
| # complete Makefile so that each phase's PR touches only its own dir. | ||
|
|
||
| SHELL := /bin/bash | ||
| .SHELLFLAGS := -eu -o pipefail -c | ||
|
|
||
| INPUT ?= cacheline | ||
| NUM_RANKS ?= 2 | ||
| TMPDIR ?= /tmp/air_multi_gpu_air_rank | ||
|
|
||
| SCRIPT_DIR := $(patsubst %/,%,$(dir $(realpath $(firstword $(MAKEFILE_LIST))))) | ||
|
|
||
| # Derive install dirs from PATH if not explicitly provided. Matches the | ||
| # original run.sh fallback (`dirname $(dirname $(which mlir-opt))`). | ||
| LLVM_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v mlir-opt)")" 2>/dev/null) | ||
| MLIR_AIR_INSTALL_DIR ?= $(shell dirname "$$(dirname "$$(command -v air-opt)")" 2>/dev/null) | ||
| LLVM_LIB_DIR ?= $(LLVM_INSTALL_DIR)/lib | ||
| AIRGPU_LIB ?= $(MLIR_AIR_INSTALL_DIR)/lib/libairgpu.so | ||
|
|
||
| ifeq ($(filter $(INPUT),cacheline allgather),) | ||
| $(error Unknown INPUT=$(INPUT); expected 'cacheline' or 'allgather') | ||
| endif | ||
|
|
||
| SRC_MLIR := $(SCRIPT_DIR)/$(INPUT).mlir | ||
| POST_RANK := $(TMPDIR)/$(INPUT)_post_rank.mlir | ||
| LOWERED := $(TMPDIR)/$(INPUT)_lowered.mlir | ||
|
|
||
| .PHONY: run clean check-preconditions | ||
| .DEFAULT_GOAL := run | ||
|
|
||
| $(TMPDIR): | ||
| @mkdir -p $@ | ||
|
|
||
| # Step 1a: lower air.rank to mgpu* runtime + expand air.translate. | ||
| $(POST_RANK): $(SRC_MLIR) | $(TMPDIR) | ||
| @echo "Step 1a: Lower air.rank to mgpu* + expand air.translate ($(INPUT))" | ||
| air-opt $< -air-rank-to-mgpu --air-translate-to-llvm -o $@ | ||
|
|
||
| # Step 1b: compile gpu.module to AMDGPU binary + finalize host. Same | ||
| # pipeline as ../handwritten/Makefile (the lowered output is structurally | ||
| # a superset of the corresponding handwritten test). | ||
| $(LOWERED): $(POST_RANK) | ||
| @echo "Step 1b: Compile gpu.module to AMDGPU binary + finalize host" | ||
| mlir-opt $< \ | ||
| --pass-pipeline='builtin.module(rocdl-attach-target{chip=gfx942 O=3},gpu.module(convert-scf-to-cf,convert-gpu-to-rocdl{chipset=gfx942 runtime=HIP},reconcile-unrealized-casts),gpu-module-to-binary,func.func(gpu-async-region,convert-scf-to-cf),gpu-to-llvm,convert-to-llvm,reconcile-unrealized-casts)' \ | ||
| -o $@ | ||
|
|
||
| # Refuse to launch if NUM_RANKS < 2 (no peer to talk to), if fewer | ||
| # physical GPUs than NUM_RANKS (would silently bypass XGMI and report | ||
| # false-positive PASSes by colocating ranks on one GPU), or if the | ||
| # install paths are missing (mlir-runner would fail at dlopen with a | ||
| # more cryptic message). | ||
| check-preconditions: | ||
| @if [ ! -d "$(LLVM_LIB_DIR)" ]; then \ | ||
| echo "ERROR: LLVM_LIB_DIR=$(LLVM_LIB_DIR) does not exist." >&2; \ | ||
| echo " Source utils/env_setup_gpu.sh or set LLVM_INSTALL_DIR." \ | ||
| >&2; \ | ||
| exit 1; \ | ||
| fi | ||
| @if [ ! -f "$(AIRGPU_LIB)" ]; then \ | ||
| echo "ERROR: AIRGPU_LIB=$(AIRGPU_LIB) does not exist." >&2; \ | ||
| echo " Source utils/env_setup_gpu.sh or set" \ | ||
| "MLIR_AIR_INSTALL_DIR." >&2; \ | ||
| exit 1; \ | ||
| fi | ||
| @if [ "$(NUM_RANKS)" -lt 2 ]; then \ | ||
| echo "ERROR: NUM_RANKS=$(NUM_RANKS); requires >= 2 ranks (producer +" \ | ||
| "consumer)." >&2; \ | ||
| exit 1; \ | ||
| fi | ||
| @if [ -n "$${HIP_VISIBLE_DEVICES:-}" ]; then \ | ||
| NUM_GPUS=$$(echo "$$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .); \ | ||
| else \ | ||
| NUM_GPUS=$$(grep -l '^simd_count [1-9]' \ | ||
| /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l); \ | ||
| fi; \ | ||
| if [ "$$NUM_GPUS" -lt "$(NUM_RANKS)" ]; then \ | ||
| echo "ERROR: need >= $(NUM_RANKS) GPUs to validate cross-rank XGMI" \ | ||
| "traffic; found $$NUM_GPUS." >&2; \ | ||
| echo " This test refuses to colocate ranks on a single GPU" \ | ||
| "because it would silently" >&2; \ | ||
| echo " bypass the symmetric-heap path and report false PASSes." \ | ||
| >&2; \ | ||
| exit 1; \ | ||
| fi | ||
|
|
||
| # Step 2: fork NUM_RANKS processes, each pinned to its own GPU via | ||
| # HIP_VISIBLE_DEVICES. mlir-runner's gpu.launch_func handler (and any | ||
| # nested call into libmlir_rocm_runtime.so) only ever sees one device, | ||
| # so it can't accidentally launch on the wrong one. Every rank still | ||
| # sees device 0 internally, so airgpu uses LOCAL_RANK=0. | ||
| run: check-preconditions $(LOWERED) | ||
| @echo "Step 2: Run as $(NUM_RANKS) processes" | ||
| @export AIRGPU_JOB_ID="$${AIRGPU_JOB_ID:-$$$$}"; \ | ||
| PIDS=(); \ | ||
| PASS=1; \ | ||
| for i in $$(seq 0 $$(($(NUM_RANKS) - 1))); do \ | ||
| ( set -o pipefail; \ | ||
| RANK=$$i WORLD_SIZE=$(NUM_RANKS) LOCAL_RANK=0 \ | ||
| HIP_VISIBLE_DEVICES=$$i \ | ||
| mlir-runner --entry-point-result=void \ | ||
| --shared-libs="$(LLVM_LIB_DIR)/libmlir_rocm_runtime.so" \ | ||
| --shared-libs="$(AIRGPU_LIB)" \ | ||
| --shared-libs="$(LLVM_LIB_DIR)/libmlir_runner_utils.so" \ | ||
| --shared-libs="$(LLVM_LIB_DIR)/libmlir_c_runner_utils.so" \ | ||
| $(LOWERED) 2>&1 | sed "s/^/[rank $$i] /") & \ | ||
| PIDS+=($$!); \ | ||
| done; \ | ||
| for pid in "$${PIDS[@]}"; do \ | ||
| if ! wait "$$pid"; then PASS=0; fi; \ | ||
| done; \ | ||
| if [ $$PASS -eq 1 ]; then \ | ||
| echo "=== ALL $(NUM_RANKS) RANKS PASSED ==="; \ | ||
| else \ | ||
| echo "=== SOME RANKS FAILED ==="; \ | ||
| exit 1; \ | ||
| fi | ||
|
|
||
| clean: | ||
| rm -rf $(TMPDIR) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.