[multi-gpu] Phase 2: hand-written e2e test for symmetric-heap multi-GPU

erwei-xilinx · claude · erwei-xilinx · commit a1ee757dc8b8 · 2026-05-03T18:02:48.000Z
Before writing any lowering pass, prove the symmetric-heap runtime works end-to-end from MLIR by hand-writing the IR that future passes should emit. This locks down the lowered shape, surfaces ABI gaps early, and provides a reference oracle for diff-testing the upcoming air-rank-to-mgpu / cross-rank-DMA / channel-on-GPU passes. ## Files - `test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir` — hand-written reference IR. Each rank: init heap, alloc symmetric buffer, fill with (rank+1).0, barrier, read peer's buffer via `mgpuGetHeapBases()[peer]`, D2D into local copy, D2H readback, verify, print PASS/FAIL. - `test/gpu/symmetric_heap_dma/run.sh` — driver that lowers the IR with `mlir-opt`, then forks N processes with RANK/WORLD_SIZE/LOCAL_RANK env vars set and runs `mlir-runner` in each. `SHARE_GPU=1` env makes all ranks share GPU 0 for testing on single-GPU hosts. ## Validation - ✅ Verified end-to-end on rad-mi300a-sh5-1 (1×MI300A, ROCm 7.1.1) with `SHARE_GPU=1` and 2 ranks: rank 0 sees `2.0` from rank 1, rank 1 sees `1.0` from rank 0. - ⚠️ rad-mi300x-1 (8×MI300X, ROCm 6.4.0) hits a runtime-side crash inside libamdhip64.so during `establishPeerAccess()`. Same crash reproduces with the existing C++ baseline `test/gpu/test_symmetric_heap.cpp` — pre-existing runtime/HIP issue unrelated to this change. ## Findings No runtime ABI gaps for Phases 3-7. The full lowering pipeline can be built using only existing exports: `mgpuSymmetricHeapInit/Destroy`, `mgpuGetRank/WorldSize`, `mgpuSymmetricAlloc/Free`, `mgpuGetHeapBases`, `mgpuBarrier`, `mgpuMemcpy` (D2D for cross-rank reads — direct kernel read from peer-VA isn't supported on some chipsets, so D2D-to-local-then- read is the required pattern). `docs/MultiGPUPlan.md` updated with Phase 2 status section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/docs/MultiGPUPlan.md b/docs/MultiGPUPlan.md
@@ -97,15 +97,64 @@ LOCAL_RANK=i` set, all loading the same compiled binary linked against `libairgp
 
 ## Implementation Phases
 
-### Phase 1 — Op extensions
-- Add `channel_type = "symmetric_heap"` to `ChannelOp` verifier
-  (`AIRDialect.cpp:3296`)
-- Add optional `src_rank` / `dst_rank` operand or attribute to
+### Phase 1 — Op extensions ✅ (landed in PR #1576)
+- Add `channel_type = "gpu_symmetric_heap"` to `ChannelOp` verifier
+  (`AIRDialect.cpp:3296`); rename existing values to `npu_*`
+- Add optional `src_rank` / `dst_rank` integer attributes to
   `air.dma_memcpy_nd` (`AIR.td:458-501`)
-- Add `air.symmetric` memref attribute / address-space tag
+- Add `air.symmetric` memref attribute convention
 - Document semantics in `docs/AIRComputeModel.md`
 
-### Phase 2 — `air-rank-to-mgpu` pass
+### Phase 2 — Hand-written reference IR + e2e validation 🆕
+
+**Rationale:** before writing any lowering pass, prove the runtime works
+end-to-end from MLIR by writing the lowered IR by hand. This:
+1. Locks down the **exact** lowered shape that future passes must produce
+2. Surfaces any runtime ABI gaps **before** sinking time into a pass that
+   targets a broken target
+3. Provides a reference oracle for diff-testing the lowering passes
+
+Add `test/gpu/symmetric_heap_dma/`:
+- `air_sym_handwritten.mlir` — already-lowered IR (no `air.rank` /
+  `air.symmetric` / cross-rank DMA — just direct `func.call` to the
+  `mgpu*` runtime ABI plus inline pointer arithmetic for cross-rank
+  load/store)
+- `run.sh` — driver that forks N processes with `RANK` / `WORLD_SIZE` /
+  `LOCAL_RANK` set, runs `mlir-runner` in each, waits for all to finish
+
+**Behavior under test:**
+1. `mgpuSymmetricHeapInit(heap_size)`
+2. `mgpuSymmetricAlloc(size, stream)` → symmetric buffer at offset O
+3. Each rank writes its rank value to its own buffer
+4. `mgpuBarrier()`
+5. Each rank reads from peer's buffer via `mgpuGetHeapBases()[peer] + O`
+6. Verify expected value, print PASS/FAIL
+7. `mgpuSymmetricHeapDestroy()`
+
+If any step requires an ABI extension (e.g., per-channel notify-flag for
+`gpu_symmetric_heap` channel synchronization beyond the global barrier),
+add it to `runtime_lib/airgpu/gpu_runtime.cpp` and validate first via the
+existing C++ test (`test/gpu/test_symmetric_heap.cpp`) before exposing it
+to MLIR.
+
+#### Phase 2 status
+
+- ✅ **Single-GPU multi-process** (`SHARE_GPU=1`) verified end-to-end on
+  `rad-mi300a-sh5-1` (1× MI300A, ROCm 7.1.1): both ranks PASS — rank 0 reads
+  rank 1's buffer (sees `2.0`), rank 1 reads rank 0's (sees `1.0`).
+- ⚠️ **Multi-GPU multi-process** on `rad-mi300x-1` (8× MI300X, ROCm 6.4.0)
+  is currently blocked by a runtime-side crash inside `libamdhip64.so`
+  during `SymmetricHeap::establishPeerAccess()` (HIP VMem
+  `hipMemImportFromShareableHandle` / `hipMemMap` path). The **same crash
+  reproduces with the existing C++ baseline** `test/gpu/test_symmetric_heap.cpp`
+  on this node, so it is a pre-existing runtime/HIP issue unrelated to the
+  MLIR layer. Likely fixes:
+  1. Upgrade the multi-GPU node to ROCm 7.x where the path is known to work.
+  2. Or root-cause and patch `runtime_lib/airgpu/symmetric_heap.cpp`
+     `establishPeerAccess()` against ROCm 6.4 HIP VMem behavior.
+  Either is independent of Phases 3-7 (which only call the existing ABI).
+
+### Phase 3 — `air-rank-to-mgpu` pass
 Replaces `air-rank-to-launch` in the GPU pipeline. New file
 `mlir/lib/Conversion/AIRRankToMgpuPass.cpp`:
 - Lowers `air.rank.id` → `mgpuGetRank()`
@@ -114,14 +163,14 @@ Replaces `air-rank-to-launch` in the GPU pipeline. New file
   `mgpuSymmetricHeapDestroy()` at exit (or assumes the launcher does this)
 - Body of `air.rank` is moved to per-process function (no `scf.for` wrapping)
 
-### Phase 3 — Symmetric alloc lowering
+### Phase 4 — Symmetric alloc lowering
 - Extend `hoistAlloc` (or add a sibling pass) in `AIRToROCDLPass.cpp:570-601` to
   recognize `air.symmetric`-tagged `memref.alloc` and lower them to
   `mgpuSymmetricAlloc` calls (host-side, not GPU workgroup attribution)
 - Memrefs from symmetric alloc remain in address space 0 (global) but carry an
   attribute that downstream lowering uses to detect peer-access addressing
 
-### Phase 4 — Cross-rank DMA lowering
+### Phase 5 — Cross-rank DMA lowering
 Extend `convertDMAToGPUMemcpy` in `AIRToROCDLPass.cpp:737-853`:
 - Detect peer-tagged operand (rank attribute / op operand)
 - Resolve `mgpuGetHeapBases()` once at kernel launch (host-side), pass the peer
@@ -131,25 +180,29 @@ Extend `convertDMAToGPUMemcpy` in `AIRToROCDLPass.cpp:737-853`:
 - Insert `mgpuBarrier()` (host-side) at synchronization points before/after the
   cross-rank transfer
 
-### Phase 5 — `air.channel` on GPU
+### Phase 6 — `air.channel` on GPU
 Add a new pattern (parallel to `convertDMAToGPUMemcpy`) for `air.channel.put` /
-`air.channel.get` with `channel_type = "symmetric_heap"`:
+`air.channel.get` with `channel_type = "gpu_symmetric_heap"`:
 - Producer (put): same thread-cooperative loop as DMA, writing into the
   symmetric-heap slot at `bases[my_rank] + slot_offset`
 - Consumer (get): cooperative loop reading from `bases[peer_rank] + slot_offset`
 - Synchronization: per-channel notify-flag word in the symmetric heap, polled by
   the consumer, set by the producer (matches `depth = 1` rendezvous semantics)
 - For `depth > 1`, allocate `depth` slots and a head/tail index per channel
 
-### Phase 6 — `aircc` launcher integration
+### Phase 7 — `aircc` launcher integration
 - Add a runner mode (e.g. `--multi-rank=N`) that forks `N` processes with
   `RANK`/`WORLD_SIZE`/`LOCAL_RANK` env vars set
 - Each child execs the same compiled binary linked with `libairgpu.so`
 - Host-side cleanup waits for all children
 - Reuse the pattern from `test/gpu/run_symmetric_heap_test.sh`
 
-### Phase 7 — End-to-end test
-Add `test/gpu/symmetric_heap_dma/`:
+### Phase 8 — High-level e2e test (lowering parity)
+
+After Phase 3-7 land, write a high-level `air_rank.mlir`-style test (one
+that uses `air.rank`, `air.symmetric`, `src_rank`/`dst_rank`) and assert
+the post-lowering IR matches the hand-written reference from Phase 2. This
+closes the loop:
 
 ```mlir
 %c2 = arith.constant 2 : index
diff --git a/test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir b/test/gpu/symmetric_heap_dma/air_sym_handwritten.mlir
@@ -0,0 +1,187 @@
+//===- air_sym_handwritten.mlir - hand-written multi-GPU e2e test --------===//
+//
+// Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
+// SPDX-License-Identifier: MIT
+//
+//===------------------------------------------------------------------===//
+//
+// Hand-written reference IR exercising the symmetric-heap multi-GPU runtime
+// from MLIR. This is what the (future) air-rank-to-mgpu + cross-rank-DMA
+// lowering passes should produce.
+//
+// Each process executes this main once. With WORLD_SIZE=2:
+//   1. Init symmetric heap.
+//   2. Allocate a 1024xf32 symmetric buffer.
+//   3. Each rank fills its buffer with (rank + 1).0 from host.
+//   4. Barrier.
+//   5. Each rank reads peer's buffer via mgpuGetHeapBases()[peer]+offset,
+//      copies it D2D into a local hipMalloc-style buffer, then D2H into a
+//      host buffer, and verifies every element == (peer + 1).0.
+//   6. Print PASS / FAIL.
+//
+// Launcher: run.sh forks N processes with RANK / WORLD_SIZE / LOCAL_RANK.
+//
+//===------------------------------------------------------------------===//
+
+module {
+  // ---- mgpu* C ABI declarations -----------------------------------------
+  func.func private @mgpuSymmetricHeapInit(i64)
+  func.func private @mgpuSymmetricHeapDestroy()
+  func.func private @mgpuGetRank() -> i32
+  func.func private @mgpuGetWorldSize() -> i32
+  func.func private @mgpuSymmetricAlloc(i64, !llvm.ptr) -> !llvm.ptr
+  func.func private @mgpuSymmetricFree(!llvm.ptr, !llvm.ptr)
+  func.func private @mgpuGetHeapBase(i32) -> !llvm.ptr
+  func.func private @mgpuGetHeapBases() -> !llvm.ptr
+  func.func private @mgpuBarrier()
+  func.func private @mgpuMemAlloc(i64, !llvm.ptr, i1) -> !llvm.ptr
+  func.func private @mgpuMemFree(!llvm.ptr, !llvm.ptr)
+  func.func private @mgpuMemcpy(!llvm.ptr, !llvm.ptr, i64, !llvm.ptr)
+
+  // libc helpers
+  func.func private @malloc(i64) -> !llvm.ptr
+  func.func private @free(!llvm.ptr)
+  llvm.func @printf(!llvm.ptr, ...) -> i32
+
+  llvm.mlir.global internal constant @msg_init("[mlir] rank %d / world %d, init OK\0A\00") {addr_space = 0 : i32}
+  llvm.mlir.global internal constant @msg_pass("[mlir] rank %d: cross-rank read PASS (peer=%d, expected=%.1f)\0A\00") {addr_space = 0 : i32}
+  llvm.mlir.global internal constant @msg_fail("[mlir] rank %d: MISMATCH at idx=%ld got=%.1f expected=%.1f\0A\00") {addr_space = 0 : i32}
+  llvm.mlir.global internal constant @msg_only1("[mlir] rank %d: world_size=1, skipping cross-rank read\0A\00") {addr_space = 0 : i32}
+  llvm.mlir.global internal constant @msg_done("[mlir] rank %d: ALL PASSED\0A\00") {addr_space = 0 : i32}
+
+  // ---- main -------------------------------------------------------------
+  func.func @main() {
+    // Constants
+    %c0_i32 = arith.constant 0 : i32
+    %c1_i32 = arith.constant 1 : i32
+    %c0_i64 = arith.constant 0 : i64
+    %c1_i64 = arith.constant 1 : i64
+    %c4_i64 = arith.constant 4 : i64                    // sizeof(f32)
+    %c1024_i64 = arith.constant 1024 : i64              // N
+    %c4096_i64 = arith.constant 4096 : i64              // N * sizeof(f32)
+    %heap_size = arith.constant 268435456 : i64         // 256 MB
+    %nullptr = llvm.mlir.zero : !llvm.ptr
+    %false = arith.constant false
+
+    // Init symmetric heap (collective)
+    func.call @mgpuSymmetricHeapInit(%heap_size) : (i64) -> ()
+    %rank = func.call @mgpuGetRank() : () -> i32
+    %world = func.call @mgpuGetWorldSize() : () -> i32
+
+    // printf("[mlir] rank %d / world %d, init OK\n", rank, world)
+    %fmt_init = llvm.mlir.addressof @msg_init : !llvm.ptr
+    llvm.call @printf(%fmt_init, %rank, %world) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32, i32) -> i32
+
+    // Symmetric alloc 1024 floats
+    %buf = func.call @mgpuSymmetricAlloc(%c4096_i64, %nullptr) : (i64, !llvm.ptr) -> !llvm.ptr
+
+    // Allocate host buffer of 1024 floats and fill with (rank + 1).0
+    %hostbuf = func.call @malloc(%c4096_i64) : (i64) -> !llvm.ptr
+    %rank_plus1_i32 = arith.addi %rank, %c1_i32 : i32
+    %rank_plus1_f32 = arith.sitofp %rank_plus1_i32 : i32 to f32
+    %c0 = arith.constant 0 : index
+    %c1 = arith.constant 1 : index
+    %c1024 = arith.constant 1024 : index
+    scf.for %i = %c0 to %c1024 step %c1 {
+      %i_i64 = arith.index_cast %i : index to i64
+      %addr = llvm.getelementptr %hostbuf[%i_i64] : (!llvm.ptr, i64) -> !llvm.ptr, f32
+      llvm.store %rank_plus1_f32, %addr : f32, !llvm.ptr
+    }
+
+    // mgpuMemcpy(buf, hostbuf, 4096, nullptr)  // H2D
+    func.call @mgpuMemcpy(%buf, %hostbuf, %c4096_i64, %nullptr) : (!llvm.ptr, !llvm.ptr, i64, !llvm.ptr) -> ()
+
+    // Barrier so all ranks have written before any reads
+    func.call @mgpuBarrier() : () -> ()
+
+    // If world_size > 1, read from peer = (rank + 1) % world
+    %is_multi = arith.cmpi sgt, %world, %c1_i32 : i32
+    scf.if %is_multi {
+      %sum = arith.addi %rank, %c1_i32 : i32
+      %peer = arith.remsi %sum, %world : i32
+
+      // bases = mgpuGetHeapBases()
+      %bases = func.call @mgpuGetHeapBases() : () -> !llvm.ptr
+
+      // peer_base = bases[peer]
+      %peer_i64 = arith.extsi %peer : i32 to i64
+      %peer_base_addr = llvm.getelementptr %bases[%peer_i64] : (!llvm.ptr, i64) -> !llvm.ptr, !llvm.ptr
+      %peer_base = llvm.load %peer_base_addr : !llvm.ptr -> !llvm.ptr
+
+      // local_base = bases[rank]
+      %rank_i64 = arith.extsi %rank : i32 to i64
+      %local_base_addr = llvm.getelementptr %bases[%rank_i64] : (!llvm.ptr, i64) -> !llvm.ptr, !llvm.ptr
+      %local_base = llvm.load %local_base_addr : !llvm.ptr -> !llvm.ptr
+
+      // local_offset = (uintptr_t)buf - (uintptr_t)local_base
+      %buf_int = llvm.ptrtoint %buf : !llvm.ptr to i64
+      %local_base_int = llvm.ptrtoint %local_base : !llvm.ptr to i64
+      %offset = arith.subi %buf_int, %local_base_int : i64
+
+      // peer_buf = (char*)peer_base + offset
+      %peer_buf = llvm.getelementptr %peer_base[%offset] : (!llvm.ptr, i64) -> !llvm.ptr, i8
+
+      // Allocate a local D2D-target buffer via mgpuMemAlloc(N*sizeof(f32))
+      %local_copy = func.call @mgpuMemAlloc(%c4096_i64, %nullptr, %false) : (i64, !llvm.ptr, i1) -> !llvm.ptr
+
+      // mgpuMemcpy(local_copy, peer_buf, 4096, nullptr)  // D2D
+      func.call @mgpuMemcpy(%local_copy, %peer_buf, %c4096_i64, %nullptr) : (!llvm.ptr, !llvm.ptr, i64, !llvm.ptr) -> ()
+
+      // Allocate host readback and copy D2H
+      %host_rb = func.call @malloc(%c4096_i64) : (i64) -> !llvm.ptr
+      func.call @mgpuMemcpy(%host_rb, %local_copy, %c4096_i64, %nullptr) : (!llvm.ptr, !llvm.ptr, i64, !llvm.ptr) -> ()
+
+      // Verify: every element == (peer + 1).0
+      %peer_plus1_i32 = arith.addi %peer, %c1_i32 : i32
+      %expected = arith.sitofp %peer_plus1_i32 : i32 to f32
+
+      %nfail_init = arith.constant 0 : i32
+      %nfail = scf.for %i = %c0 to %c1024 step %c1
+                      iter_args(%nfail_acc = %nfail_init) -> (i32) {
+        %i_i64 = arith.index_cast %i : index to i64
+        %addr = llvm.getelementptr %host_rb[%i_i64] : (!llvm.ptr, i64) -> !llvm.ptr, f32
+        %v = llvm.load %addr : !llvm.ptr -> f32
+        %ne = arith.cmpf une, %v, %expected : f32
+        %new_nfail = scf.if %ne -> i32 {
+          // Print first few mismatches
+          %fmt_fail = llvm.mlir.addressof @msg_fail : !llvm.ptr
+          %v64 = arith.extf %v : f32 to f64
+          %e64 = arith.extf %expected : f32 to f64
+          llvm.call @printf(%fmt_fail, %rank, %i_i64, %v64, %e64) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32, i64, f64, f64) -> i32
+          %inc = arith.addi %nfail_acc, %c1_i32 : i32
+          scf.yield %inc : i32
+        } else {
+          scf.yield %nfail_acc : i32
+        }
+        scf.yield %new_nfail : i32
+      }
+
+      // If no failures, print PASS
+      %ok = arith.cmpi eq, %nfail, %c0_i32 : i32
+      scf.if %ok {
+        %fmt_pass = llvm.mlir.addressof @msg_pass : !llvm.ptr
+        %e64 = arith.extf %expected : f32 to f64
+        llvm.call @printf(%fmt_pass, %rank, %peer, %e64) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32, i32, f64) -> i32
+      }
+
+      // Cleanup
+      func.call @free(%host_rb) : (!llvm.ptr) -> ()
+      func.call @mgpuMemFree(%local_copy, %nullptr) : (!llvm.ptr, !llvm.ptr) -> ()
+    } else {
+      %fmt_only1 = llvm.mlir.addressof @msg_only1 : !llvm.ptr
+      llvm.call @printf(%fmt_only1, %rank) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32) -> i32
+    }
+
+    func.call @mgpuBarrier() : () -> ()
+
+    // Cleanup
+    func.call @free(%hostbuf) : (!llvm.ptr) -> ()
+    func.call @mgpuSymmetricFree(%buf, %nullptr) : (!llvm.ptr, !llvm.ptr) -> ()
+    func.call @mgpuSymmetricHeapDestroy() : () -> ()
+
+    %fmt_done = llvm.mlir.addressof @msg_done : !llvm.ptr
+    llvm.call @printf(%fmt_done, %rank) vararg(!llvm.func<i32 (ptr, ...)>) : (!llvm.ptr, i32) -> i32
+
+    return
+  }
+}
diff --git a/test/gpu/symmetric_heap_dma/run.sh b/test/gpu/symmetric_heap_dma/run.sh
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+#===- run.sh - Multi-process symmetric-heap DMA e2e test --*-
+#
+# Copyright (C) 2026, Advanced Micro Devices, Inc. All rights reserved.
+# SPDX-License-Identifier: MIT
+#
+#===------------------------------------------------------------------===//
+#
+# Compile and run the hand-written symmetric-heap MLIR test as N processes.
+# Each process executes the full IR; processes coordinate via the symmetric
+# heap (XGMI peer-mapped VMem buffers).
+#
+# Usage: run.sh [num_ranks]   (default: 2)
+#
+# Required environment (auto-detected when sourced via env_setup_gpu.sh):
+#   MLIR_AIR_INSTALL_DIR  - path containing lib/libairgpu.so
+#   LLVM_INSTALL_DIR      - path containing bin/mlir-opt + lib/libmlir_*.so
+#
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+NUM_RANKS=${1:-2}
+# Set SHARE_GPU=1 to make all ranks use GPU 0 (single-GPU test machines).
+# Default: each rank uses its own GPU (LOCAL_RANK=$i).
+SHARE_GPU=${SHARE_GPU:-0}
+TMPDIR="${TMPDIR:-/tmp/air_sym_dma}"
+mkdir -p "$TMPDIR"
+
+LLVM_LIB_DIR="${LLVM_INSTALL_DIR:-$(dirname "$(which mlir-opt)")/..}/lib"
+AIRGPU_LIB="${MLIR_AIR_INSTALL_DIR:-$(dirname "$(which air-opt)")/..}/lib/libairgpu.so"
+
+echo "Step 1: Lower hand-written IR to LLVM dialect"
+mlir-opt "$SCRIPT_DIR/air_sym_handwritten.mlir" \
+    --pass-pipeline='builtin.module(func.func(convert-scf-to-cf),convert-to-llvm,reconcile-unrealized-casts)' \
+    -o "$TMPDIR/sym_lowered.mlir"
+
+echo "Step 2: Run as ${NUM_RANKS} processes"
+export AIRGPU_JOB_ID="${AIRGPU_JOB_ID:-$$}"
+
+PIDS=()
+PASS=1
+
+for i in $(seq 0 $((NUM_RANKS - 1))); do
+  if [ "$SHARE_GPU" = "1" ]; then
+    LR=0
+  else
+    LR=$i
+  fi
+  (set -o pipefail
+   RANK=$i WORLD_SIZE=$NUM_RANKS LOCAL_RANK=$LR \
+   mlir-runner --entry-point-result=void \
+       --shared-libs="$LLVM_LIB_DIR/libmlir_rocm_runtime.so" \
+       --shared-libs="$AIRGPU_LIB" \
+       --shared-libs="$LLVM_LIB_DIR/libmlir_runner_utils.so" \
+       --shared-libs="$LLVM_LIB_DIR/libmlir_c_runner_utils.so" \
+       "$TMPDIR/sym_lowered.mlir" 2>&1 | sed "s/^/[rank $i] /") &
+  PIDS+=($!)
+done
+
+for pid in "${PIDS[@]}"; do
+  if ! wait "$pid"; then
+    PASS=0
+  fi
+done
+
+if [ $PASS -eq 1 ]; then
+  echo "=== ALL ${NUM_RANKS} RANKS PASSED ==="
+else
+  echo "=== SOME RANKS FAILED ==="
+  exit 1
+fi