Commit 7e629e1
[multi-gpu] Phase 2: remove SHARE_GPU; fail-fast precondition
Drop the SHARE_GPU=1 escape hatch from run.sh. Colocating ranks on a
single GPU silently bypasses the symmetric-heap / XGMI path and reports
false-positive PASSes — exactly what the test exists to validate.
Replace with a precondition check that exits non-zero when fewer GPUs
are visible than ranks were requested. Validated on rad-mi325x-1
(8x MI325X) at WORLD_SIZE=2,4,8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 904d279 commit 7e629e1
1 file changed
Lines changed: 16 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
25 | | - | |
26 | | - | |
27 | 24 | | |
28 | 25 | | |
29 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
30 | 42 | | |
31 | 43 | | |
32 | 44 | | |
| |||
42 | 54 | | |
43 | 55 | | |
44 | 56 | | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | 57 | | |
51 | | - | |
| 58 | + | |
52 | 59 | | |
53 | 60 | | |
54 | 61 | | |
| |||
0 commit comments