Skip to content

Commit a1ee757

Browse files
erwei-xilinxclaude
andcommitted
[multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU
Before writing any lowering pass, prove the symmetric-heap runtime works end-to-end from MLIR by hand-writing the IR that future passes should emit. This locks down the lowered shape, surfaces ABI gaps early, and provides a reference oracle for diff-testing the upcoming air-rank-to-mgpu / cross-rank-DMA / channel-on-GPU passes. ## Files - `test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir` — hand-written reference IR. Each rank: init heap, alloc symmetric buffer, fill with (rank+1).0, barrier, read peer's buffer via `mgpuGetHeapBases()[peer]`, D2D into local copy, D2H readback, verify, print PASS/FAIL. - `test/gpu/symmetric_heap_dma/run.sh` — driver that lowers the IR with `mlir-opt`, then forks N processes with RANK/WORLD_SIZE/LOCAL_RANK env vars set and runs `mlir-runner` in each. `SHARE_GPU=1` env makes all ranks share GPU 0 for testing on single-GPU hosts. ## Validation - ✅ Verified end-to-end on rad-mi300a-sh5-1 (1×MI300A, ROCm 7.1.1) with `SHARE_GPU=1` and 2 ranks: rank 0 sees `2.0` from rank 1, rank 1 sees `1.0` from rank 0. - ⚠️ rad-mi300x-1 (8×MI300X, ROCm 6.4.0) hits a runtime-side crash inside libamdhip64.so during `establishPeerAccess()`. Same crash reproduces with the existing C++ baseline `test/gpu/test_symmetric_heap.cpp` — pre-existing runtime/HIP issue unrelated to this change. ## Findings No runtime ABI gaps for Phases 3-7. The full lowering pipeline can be built using only existing exports: `mgpuSymmetricHeapInit/Destroy`, `mgpuGetRank/WorldSize`, `mgpuSymmetricAlloc/Free`, `mgpuGetHeapBases`, `mgpuBarrier`, `mgpuMemcpy` (D2D for cross-rank reads — direct kernel read from peer-VA isn't supported on some chipsets, so D2D-to-local-then- read is the required pattern). `docs/MultiGPUPlan.md` updated with Phase 2 status section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent abbc586 commit a1ee757

3 files changed

Lines changed: 325 additions & 13 deletions

File tree

docs/MultiGPUPlan.md

Lines changed: 66 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -97,15 +97,64 @@ LOCAL_RANK=i` set, all loading the same compiled binary linked against `libairgp
9797

9898
## Implementation Phases
9999

100-
### Phase 1 — Op extensions
101-
- Add `channel_type = "symmetric_heap"` to `ChannelOp` verifier
102-
(`AIRDialect.cpp:3296`)
103-
- Add optional `src_rank` / `dst_rank` operand or attribute to
100+
### Phase 1 — Op extensions ✅ (landed in PR #1576)
101+
- Add `channel_type = "gpu_symmetric_heap"` to `ChannelOp` verifier
102+
(`AIRDialect.cpp:3296`); rename existing values to `npu_*`
103+
- Add optional `src_rank` / `dst_rank` integer attributes to
104104
`air.dma_memcpy_nd` (`AIR.td:458-501`)
105-
- Add `air.symmetric` memref attribute / address-space tag
105+
- Add `air.symmetric` memref attribute convention
106106
- Document semantics in `docs/AIRComputeModel.md`
107107

108-
### Phase 2 — `air-rank-to-mgpu` pass
108+
### Phase 2 — Hand-written reference IR + e2e validation 🆕
109+
110+
**Rationale:** before writing any lowering pass, prove the runtime works
111+
end-to-end from MLIR by writing the lowered IR by hand. This:
112+
1. Locks down the **exact** lowered shape that future passes must produce
113+
2. Surfaces any runtime ABI gaps **before** sinking time into a pass that
114+
targets a broken target
115+
3. Provides a reference oracle for diff-testing the lowering passes
116+
117+
Add `test/gpu/symmetric_heap_dma/`:
118+
- `air_sym_handwritten.mlir` — already-lowered IR (no `air.rank` /
119+
`air.symmetric` / cross-rank DMA — just direct `func.call` to the
120+
`mgpu*` runtime ABI plus inline pointer arithmetic for cross-rank
121+
load/store)
122+
- `run.sh` — driver that forks N processes with `RANK` / `WORLD_SIZE` /
123+
`LOCAL_RANK` set, runs `mlir-runner` in each, waits for all to finish
124+
125+
**Behavior under test:**
126+
1. `mgpuSymmetricHeapInit(heap_size)`
127+
2. `mgpuSymmetricAlloc(size, stream)` → symmetric buffer at offset O
128+
3. Each rank writes its rank value to its own buffer
129+
4. `mgpuBarrier()`
130+
5. Each rank reads from peer's buffer via `mgpuGetHeapBases()[peer] + O`
131+
6. Verify expected value, print PASS/FAIL
132+
7. `mgpuSymmetricHeapDestroy()`
133+
134+
If any step requires an ABI extension (e.g., per-channel notify-flag for
135+
`gpu_symmetric_heap` channel synchronization beyond the global barrier),
136+
add it to `runtime_lib/airgpu/gpu_runtime.cpp` and validate first via the
137+
existing C++ test (`test/gpu/test_symmetric_heap.cpp`) before exposing it
138+
to MLIR.
139+
140+
#### Phase 2 status
141+
142+
-**Single-GPU multi-process** (`SHARE_GPU=1`) verified end-to-end on
143+
`rad-mi300a-sh5-1` (1× MI300A, ROCm 7.1.1): both ranks PASS — rank 0 reads
144+
rank 1's buffer (sees `2.0`), rank 1 reads rank 0's (sees `1.0`).
145+
- ⚠️ **Multi-GPU multi-process** on `rad-mi300x-1` (8× MI300X, ROCm 6.4.0)
146+
is currently blocked by a runtime-side crash inside `libamdhip64.so`
147+
during `SymmetricHeap::establishPeerAccess()` (HIP VMem
148+
`hipMemImportFromShareableHandle` / `hipMemMap` path). The **same crash
149+
reproduces with the existing C++ baseline** `test/gpu/test_symmetric_heap.cpp`
150+
on this node, so it is a pre-existing runtime/HIP issue unrelated to the
151+
MLIR layer. Likely fixes:
152+
1. Upgrade the multi-GPU node to ROCm 7.x where the path is known to work.
153+
2. Or root-cause and patch `runtime_lib/airgpu/symmetric_heap.cpp`
154+
`establishPeerAccess()` against ROCm 6.4 HIP VMem behavior.
155+
Either is independent of Phases 3-7 (which only call the existing ABI).
156+
157+
### Phase 3 — `air-rank-to-mgpu` pass
109158
Replaces `air-rank-to-launch` in the GPU pipeline. New file
110159
`mlir/lib/Conversion/AIRRankToMgpuPass.cpp`:
111160
- Lowers `air.rank.id``mgpuGetRank()`
@@ -114,14 +163,14 @@ Replaces `air-rank-to-launch` in the GPU pipeline. New file
114163
`mgpuSymmetricHeapDestroy()` at exit (or assumes the launcher does this)
115164
- Body of `air.rank` is moved to per-process function (no `scf.for` wrapping)
116165

117-
### Phase 3 — Symmetric alloc lowering
166+
### Phase 4 — Symmetric alloc lowering
118167
- Extend `hoistAlloc` (or add a sibling pass) in `AIRToROCDLPass.cpp:570-601` to
119168
recognize `air.symmetric`-tagged `memref.alloc` and lower them to
120169
`mgpuSymmetricAlloc` calls (host-side, not GPU workgroup attribution)
121170
- Memrefs from symmetric alloc remain in address space 0 (global) but carry an
122171
attribute that downstream lowering uses to detect peer-access addressing
123172

124-
### Phase 4 — Cross-rank DMA lowering
173+
### Phase 5 — Cross-rank DMA lowering
125174
Extend `convertDMAToGPUMemcpy` in `AIRToROCDLPass.cpp:737-853`:
126175
- Detect peer-tagged operand (rank attribute / op operand)
127176
- Resolve `mgpuGetHeapBases()` once at kernel launch (host-side), pass the peer
@@ -131,25 +180,29 @@ Extend `convertDMAToGPUMemcpy` in `AIRToROCDLPass.cpp:737-853`:
131180
- Insert `mgpuBarrier()` (host-side) at synchronization points before/after the
132181
cross-rank transfer
133182

134-
### Phase 5`air.channel` on GPU
183+
### Phase 6`air.channel` on GPU
135184
Add a new pattern (parallel to `convertDMAToGPUMemcpy`) for `air.channel.put` /
136-
`air.channel.get` with `channel_type = "symmetric_heap"`:
185+
`air.channel.get` with `channel_type = "gpu_symmetric_heap"`:
137186
- Producer (put): same thread-cooperative loop as DMA, writing into the
138187
symmetric-heap slot at `bases[my_rank] + slot_offset`
139188
- Consumer (get): cooperative loop reading from `bases[peer_rank] + slot_offset`
140189
- Synchronization: per-channel notify-flag word in the symmetric heap, polled by
141190
the consumer, set by the producer (matches `depth = 1` rendezvous semantics)
142191
- For `depth > 1`, allocate `depth` slots and a head/tail index per channel
143192

144-
### Phase 6`aircc` launcher integration
193+
### Phase 7`aircc` launcher integration
145194
- Add a runner mode (e.g. `--multi-rank=N`) that forks `N` processes with
146195
`RANK`/`WORLD_SIZE`/`LOCAL_RANK` env vars set
147196
- Each child execs the same compiled binary linked with `libairgpu.so`
148197
- Host-side cleanup waits for all children
149198
- Reuse the pattern from `test/gpu/run_symmetric_heap_test.sh`
150199

151-
### Phase 7 — End-to-end test
152-
Add `test/gpu/symmetric_heap_dma/`:
200+
### Phase 8 — High-level e2e test (lowering parity)
201+
202+
After Phase 3-7 land, write a high-level `air_rank.mlir`-style test (one
203+
that uses `air.rank`, `air.symmetric`, `src_rank`/`dst_rank`) and assert
204+
the post-lowering IR matches the hand-written reference from Phase 2. This
205+
closes the loop:
153206

154207
```mlir
155208
%c2 = arith.constant 2 : index
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
//===- air_sym_handwritten.mlir - hand-written multi-GPU e2e test --------===//
2+
//
3+
// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
4+
// SPDX-License-Identifier: MIT
5+
//
6+
//===------------------------------------------------------------------===//
7+
//
8+
// Hand-written reference IR exercising the symmetric-heap multi-GPU runtime
9+
// from MLIR. This is what the (future) air-rank-to-mgpu + cross-rank-DMA
10+
// lowering passes should produce.
11+
//
12+
// Each process executes this main once. With WORLD_SIZE=2:
13+
// 1. Init symmetric heap.
14+
// 2. Allocate a 1024xf32 symmetric buffer.
15+
// 3. Each rank fills its buffer with (rank + 1).0 from host.
16+
// 4. Barrier.
17+
// 5. Each rank reads peer's buffer via mgpuGetHeapBases()[peer]+offset,
18+
// copies it D2D into a local hipMalloc-style buffer, then D2H into a
19+
// host buffer, and verifies every element == (peer + 1).0.
20+
// 6. Print PASS / FAIL.
21+
//
22+
// Launcher: run.sh forks N processes with RANK / WORLD_SIZE / LOCAL_RANK.
23+
//
24+
//===------------------------------------------------------------------===//
25+
26+
module {
27+
// ---- mgpu* C ABI declarations -----------------------------------------
28+
func.func private @mgpuSymmetricHeapInit(i64)
29+
func.func private @mgpuSymmetricHeapDestroy()
30+
func.func private @mgpuGetRank() -> i32
31+
func.func private @mgpuGetWorldSize() -> i32
32+
func.func private @mgpuSymmetricAlloc(i64, !llvm.ptr) -> !llvm.ptr
33+
func.func private @mgpuSymmetricFree(!llvm.ptr, !llvm.ptr)
34+
func.func private @mgpuGetHeapBase(i32) -> !llvm.ptr
35+
func.func private @mgpuGetHeapBases() -> !llvm.ptr
36+
func.func private @mgpuBarrier()
37+
func.func private @mgpuMemAlloc(i64, !llvm.ptr, i1) -> !llvm.ptr
38+
func.func private @mgpuMemFree(!llvm.ptr, !llvm.ptr)
39+
func.func private @mgpuMemcpy(!llvm.ptr, !llvm.ptr, i64, !llvm.ptr)
40+
41+
// libc helpers
42+
func.func private @malloc(i64) -> !llvm.ptr
43+
func.func private @free(!llvm.ptr)
44+
llvm.func @printf(!llvm.ptr, ...) -> i32
45+
46+
llvm.mlir.global internal constant @msg_init("[mlir] rank %d / world %d, init OK\0A\00") {addr_space = 0 : i32}
47+
llvm.mlir.global internal constant @msg_pass("[mlir] rank %d: cross-rank read PASS (peer=%d, expected=%.1f)\0A\00") {addr_space = 0 : i32}
48+
llvm.mlir.global internal constant @msg_fail("[mlir] rank %d: MISMATCH at idx=%ld got=%.1f expected=%.1f\0A\00") {addr_space = 0 : i32}
49+
llvm.mlir.global internal constant @msg_only1("[mlir] rank %d: world_size=1, skipping cross-rank read\0A\00") {addr_space = 0 : i32}
50+
llvm.mlir.global internal constant @msg_done("[mlir] rank %d: ALL PASSED\0A\00") {addr_space = 0 : i32}
51+
52+
// ---- main -------------------------------------------------------------
53+
func.func @main() {
54+
// Constants
55+
%c0_i32 = arith.constant 0 : i32
56+
%c1_i32 = arith.constant 1 : i32
57+
%c0_i64 = arith.constant 0 : i64
58+
%c1_i64 = arith.constant 1 : i64
59+
%c4_i64 = arith.constant 4 : i64 // sizeof(f32)
60+
%c1024_i64 = arith.constant 1024 : i64 // N
61+
%c4096_i64 = arith.constant 4096 : i64 // N * sizeof(f32)
62+
%heap_size = arith.constant 268435456 : i64 // 256 MB
63+
%nullptr = llvm.mlir.zero : !llvm.ptr
64+
%false = arith.constant false
65+
66+
// Init symmetric heap (collective)
67+
func.call @mgpuSymmetricHeapInit(%heap_size) : (i64) -> ()
68+
%rank = func.call @mgpuGetRank() : () -> i32
69+
%world = func.call @mgpuGetWorldSize() : () -> i32
70+
71+
// printf("[mlir] rank %d / world %d, init OK\n", rank, world)
72+
%fmt_init = llvm.mlir.addressof @msg_init : !llvm.ptr
73+
llvm.call @printf(%fmt_init, %rank, %world) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32, i32) -> i32
74+
75+
// Symmetric alloc 1024 floats
76+
%buf = func.call @mgpuSymmetricAlloc(%c4096_i64, %nullptr) : (i64, !llvm.ptr) -> !llvm.ptr
77+
78+
// Allocate host buffer of 1024 floats and fill with (rank + 1).0
79+
%hostbuf = func.call @malloc(%c4096_i64) : (i64) -> !llvm.ptr
80+
%rank_plus1_i32 = arith.addi %rank, %c1_i32 : i32
81+
%rank_plus1_f32 = arith.sitofp %rank_plus1_i32 : i32 to f32
82+
%c0 = arith.constant 0 : index
83+
%c1 = arith.constant 1 : index
84+
%c1024 = arith.constant 1024 : index
85+
scf.for %i = %c0 to %c1024 step %c1 {
86+
%i_i64 = arith.index_cast %i : index to i64
87+
%addr = llvm.getelementptr %hostbuf[%i_i64] : (!llvm.ptr, i64) -> !llvm.ptr, f32
88+
llvm.store %rank_plus1_f32, %addr : f32, !llvm.ptr
89+
}
90+
91+
// mgpuMemcpy(buf, hostbuf, 4096, nullptr) // H2D
92+
func.call @mgpuMemcpy(%buf, %hostbuf, %c4096_i64, %nullptr) : (!llvm.ptr, !llvm.ptr, i64, !llvm.ptr) -> ()
93+
94+
// Barrier so all ranks have written before any reads
95+
func.call @mgpuBarrier() : () -> ()
96+
97+
// If world_size > 1, read from peer = (rank + 1) % world
98+
%is_multi = arith.cmpi sgt, %world, %c1_i32 : i32
99+
scf.if %is_multi {
100+
%sum = arith.addi %rank, %c1_i32 : i32
101+
%peer = arith.remsi %sum, %world : i32
102+
103+
// bases = mgpuGetHeapBases()
104+
%bases = func.call @mgpuGetHeapBases() : () -> !llvm.ptr
105+
106+
// peer_base = bases[peer]
107+
%peer_i64 = arith.extsi %peer : i32 to i64
108+
%peer_base_addr = llvm.getelementptr %bases[%peer_i64] : (!llvm.ptr, i64) -> !llvm.ptr, !llvm.ptr
109+
%peer_base = llvm.load %peer_base_addr : !llvm.ptr -> !llvm.ptr
110+
111+
// local_base = bases[rank]
112+
%rank_i64 = arith.extsi %rank : i32 to i64
113+
%local_base_addr = llvm.getelementptr %bases[%rank_i64] : (!llvm.ptr, i64) -> !llvm.ptr, !llvm.ptr
114+
%local_base = llvm.load %local_base_addr : !llvm.ptr -> !llvm.ptr
115+
116+
// local_offset = (uintptr_t)buf - (uintptr_t)local_base
117+
%buf_int = llvm.ptrtoint %buf : !llvm.ptr to i64
118+
%local_base_int = llvm.ptrtoint %local_base : !llvm.ptr to i64
119+
%offset = arith.subi %buf_int, %local_base_int : i64
120+
121+
// peer_buf = (char*)peer_base + offset
122+
%peer_buf = llvm.getelementptr %peer_base[%offset] : (!llvm.ptr, i64) -> !llvm.ptr, i8
123+
124+
// Allocate a local D2D-target buffer via mgpuMemAlloc(N*sizeof(f32))
125+
%local_copy = func.call @mgpuMemAlloc(%c4096_i64, %nullptr, %false) : (i64, !llvm.ptr, i1) -> !llvm.ptr
126+
127+
// mgpuMemcpy(local_copy, peer_buf, 4096, nullptr) // D2D
128+
func.call @mgpuMemcpy(%local_copy, %peer_buf, %c4096_i64, %nullptr) : (!llvm.ptr, !llvm.ptr, i64, !llvm.ptr) -> ()
129+
130+
// Allocate host readback and copy D2H
131+
%host_rb = func.call @malloc(%c4096_i64) : (i64) -> !llvm.ptr
132+
func.call @mgpuMemcpy(%host_rb, %local_copy, %c4096_i64, %nullptr) : (!llvm.ptr, !llvm.ptr, i64, !llvm.ptr) -> ()
133+
134+
// Verify: every element == (peer + 1).0
135+
%peer_plus1_i32 = arith.addi %peer, %c1_i32 : i32
136+
%expected = arith.sitofp %peer_plus1_i32 : i32 to f32
137+
138+
%nfail_init = arith.constant 0 : i32
139+
%nfail = scf.for %i = %c0 to %c1024 step %c1
140+
iter_args(%nfail_acc = %nfail_init) -> (i32) {
141+
%i_i64 = arith.index_cast %i : index to i64
142+
%addr = llvm.getelementptr %host_rb[%i_i64] : (!llvm.ptr, i64) -> !llvm.ptr, f32
143+
%v = llvm.load %addr : !llvm.ptr -> f32
144+
%ne = arith.cmpf une, %v, %expected : f32
145+
%new_nfail = scf.if %ne -> i32 {
146+
// Print first few mismatches
147+
%fmt_fail = llvm.mlir.addressof @msg_fail : !llvm.ptr
148+
%v64 = arith.extf %v : f32 to f64
149+
%e64 = arith.extf %expected : f32 to f64
150+
llvm.call @printf(%fmt_fail, %rank, %i_i64, %v64, %e64) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32, i64, f64, f64) -> i32
151+
%inc = arith.addi %nfail_acc, %c1_i32 : i32
152+
scf.yield %inc : i32
153+
} else {
154+
scf.yield %nfail_acc : i32
155+
}
156+
scf.yield %new_nfail : i32
157+
}
158+
159+
// If no failures, print PASS
160+
%ok = arith.cmpi eq, %nfail, %c0_i32 : i32
161+
scf.if %ok {
162+
%fmt_pass = llvm.mlir.addressof @msg_pass : !llvm.ptr
163+
%e64 = arith.extf %expected : f32 to f64
164+
llvm.call @printf(%fmt_pass, %rank, %peer, %e64) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32, i32, f64) -> i32
165+
}
166+
167+
// Cleanup
168+
func.call @free(%host_rb) : (!llvm.ptr) -> ()
169+
func.call @mgpuMemFree(%local_copy, %nullptr) : (!llvm.ptr, !llvm.ptr) -> ()
170+
} else {
171+
%fmt_only1 = llvm.mlir.addressof @msg_only1 : !llvm.ptr
172+
llvm.call @printf(%fmt_only1, %rank) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32) -> i32
173+
}
174+
175+
func.call @mgpuBarrier() : () -> ()
176+
177+
// Cleanup
178+
func.call @free(%hostbuf) : (!llvm.ptr) -> ()
179+
func.call @mgpuSymmetricFree(%buf, %nullptr) : (!llvm.ptr, !llvm.ptr) -> ()
180+
func.call @mgpuSymmetricHeapDestroy() : () -> ()
181+
182+
%fmt_done = llvm.mlir.addressof @msg_done : !llvm.ptr
183+
llvm.call @printf(%fmt_done, %rank) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32) -> i32
184+
185+
return
186+
}
187+
}

test/gpu/symmetric_heap_dma/run.sh

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
#!/usr/bin/env bash
2+
#===- run.sh - Multi-process symmetric-heap DMA e2e test --*-
3+
#
4+
# Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
5+
# SPDX-License-Identifier: MIT
6+
#
7+
#===------------------------------------------------------------------===//
8+
#
9+
# Compile and run the hand-written symmetric-heap MLIR test as N processes.
10+
# Each process executes the full IR; processes coordinate via the symmetric
11+
# heap (XGMI peer-mapped VMem buffers).
12+
#
13+
# Usage: run.sh [num_ranks] (default: 2)
14+
#
15+
# Required environment (auto-detected when sourced via env_setup_gpu.sh):
16+
# MLIR_AIR_INSTALL_DIR - path containing lib/libairgpu.so
17+
# LLVM_INSTALL_DIR - path containing bin/mlir-opt + lib/libmlir_*.so
18+
#
19+
20+
set -e
21+
22+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
23+
NUM_RANKS=${1:-2}
24+
# Set SHARE_GPU=1 to make all ranks use GPU 0 (single-GPU test machines).
25+
# Default: each rank uses its own GPU (LOCAL_RANK=$i).
26+
SHARE_GPU=${SHARE_GPU:-0}
27+
TMPDIR="${TMPDIR:-/tmp/air_sym_dma}"
28+
mkdir -p "$TMPDIR"
29+
30+
LLVM_LIB_DIR="${LLVM_INSTALL_DIR:-$(dirname "$(which mlir-opt)")/..}/lib"
31+
AIRGPU_LIB="${MLIR_AIR_INSTALL_DIR:-$(dirname "$(which air-opt)")/..}/lib/libairgpu.so"
32+
33+
echo "Step 1: Lower hand-written IR to LLVM dialect"
34+
mlir-opt "$SCRIPT_DIR/air_sym_handwritten.mlir" \
35+
--pass-pipeline='builtin.module(func.func(convert-scf-to-cf),convert-to-llvm,reconcile-unrealized-casts)' \
36+
-o "$TMPDIR/sym_lowered.mlir"
37+
38+
echo "Step 2: Run as ${NUM_RANKS} processes"
39+
export AIRGPU_JOB_ID="${AIRGPU_JOB_ID:-$$}"
40+
41+
PIDS=()
42+
PASS=1
43+
44+
for i in $(seq 0 $((NUM_RANKS - 1))); do
45+
if [ "$SHARE_GPU" = "1" ]; then
46+
LR=0
47+
else
48+
LR=$i
49+
fi
50+
(set -o pipefail
51+
RANK=$i WORLD_SIZE=$NUM_RANKS LOCAL_RANK=$LR \
52+
mlir-runner --entry-point-result=void \
53+
--shared-libs="$LLVM_LIB_DIR/libmlir_rocm_runtime.so" \
54+
--shared-libs="$AIRGPU_LIB" \
55+
--shared-libs="$LLVM_LIB_DIR/libmlir_runner_utils.so" \
56+
--shared-libs="$LLVM_LIB_DIR/libmlir_c_runner_utils.so" \
57+
"$TMPDIR/sym_lowered.mlir" 2>&1 | sed "s/^/[rank $i] /") &
58+
PIDS+=($!)
59+
done
60+
61+
for pid in "${PIDS[@]}"; do
62+
if ! wait "$pid"; then
63+
PASS=0
64+
fi
65+
done
66+
67+
if [ $PASS -eq 1 ]; then
68+
echo "=== ALL ${NUM_RANKS} RANKS PASSED ==="
69+
else
70+
echo "=== SOME RANKS FAILED ==="
71+
exit 1
72+
fi

0 commit comments

Comments
 (0)