Skip to content

Commit 7e629e1

Browse files
erwei-xilinxclaude
andcommitted
[multi-gpu] Phase 2: remove SHARE_GPU; fail-fast precondition
Drop the SHARE_GPU=1 escape hatch from run.sh. Colocating ranks on a single GPU silently bypasses the symmetric-heap / XGMI path and reports false-positive PASSes — exactly what the test exists to validate. Replace with a precondition check that exits non-zero when fewer GPUs are visible than ranks were requested. Validated on rad-mi325x-1 (8x MI325X) at WORLD_SIZE=2,4,8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 904d279 commit 7e629e1

1 file changed

Lines changed: 16 additions & 9 deletions

File tree

  • test/gpu/symmetric_heap_dma

test/gpu/symmetric_heap_dma/run.sh

Lines changed: 16 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,24 @@ set -e
2121

2222
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
2323
NUM_RANKS=${1:-2}
24-
# Set SHARE_GPU=1 to make all ranks use GPU 0 (single-GPU test machines).
25-
# Default: each rank uses its own GPU (LOCAL_RANK=$i).
26-
SHARE_GPU=${SHARE_GPU:-0}
2724
TMPDIR="${TMPDIR:-/tmp/air_sym_dma}"
2825
mkdir -p "$TMPDIR"
2926

27+
# Refuse to run if there aren't enough physically distinct GPUs for one
28+
# rank per GPU. Colocating ranks on a single GPU would make XGMI/peer-VA
29+
# transparently fall back to local memory and produce false-positive PASSes.
30+
if [ -n "${HIP_VISIBLE_DEVICES:-}" ]; then
31+
NUM_GPUS=$(echo "$HIP_VISIBLE_DEVICES" | tr ',' '\n' | grep -c .)
32+
else
33+
NUM_GPUS=$(grep -l '^simd_count [1-9]' /sys/class/kfd/kfd/topology/nodes/*/properties 2>/dev/null | wc -l)
34+
fi
35+
if [ "$NUM_GPUS" -lt "$NUM_RANKS" ]; then
36+
echo "ERROR: need >= $NUM_RANKS GPUs to validate cross-rank XGMI traffic; found $NUM_GPUS." >&2
37+
echo " This test refuses to colocate ranks on a single GPU because it would" >&2
38+
echo " silently bypass the symmetric-heap path and report false PASSes." >&2
39+
exit 1
40+
fi
41+
3042
LLVM_LIB_DIR="${LLVM_INSTALL_DIR:-$(dirname "$(which mlir-opt)")/..}/lib"
3143
AIRGPU_LIB="${MLIR_AIR_INSTALL_DIR:-$(dirname "$(which air-opt)")/..}/lib/libairgpu.so"
3244

@@ -42,13 +54,8 @@ PIDS=()
4254
PASS=1
4355

4456
for i in $(seq 0 $((NUM_RANKS - 1))); do
45-
if [ "$SHARE_GPU" = "1" ]; then
46-
LR=0
47-
else
48-
LR=$i
49-
fi
5057
(set -o pipefail
51-
RANK=$i WORLD_SIZE=$NUM_RANKS LOCAL_RANK=$LR \
58+
RANK=$i WORLD_SIZE=$NUM_RANKS LOCAL_RANK=$i \
5259
mlir-runner --entry-point-result=void \
5360
--shared-libs="$LLVM_LIB_DIR/libmlir_rocm_runtime.so" \
5461
--shared-libs="$AIRGPU_LIB" \

0 commit comments

Comments
 (0)