Commit d79a2a8
committed
AZP: TEMP DEBUG - pytest -n 1 to avoid GPU Exclusive_Process contention
Run M (MPS bypass via CUDA_MPS_PIPE_DIRECTORY override) was a big
breakthrough: the 240s hangs are gone, tests now finish in ~50s and
fail with legible errors.
wheel-tests summary line: `30 failed, 236 passed, 2 skipped, 13
warnings in 49.94s`. All 30 failures share one error:
cupy_backends.cuda.api.runtime.CUDARuntimeError:
cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or
unavailable
Diagnosis: the MLNX A40 GPUs are in `Exclusive_Process` compute mode.
With MPS in front, multiple processes can share the device. With MPS
bypassed, only one process at a time can claim it. xdist `-n 4`
spawns four pytest workers; each tries `cupy.empty()` -> `cudaMalloc`;
three lose with `cudaErrorDevicesUnavailable`.
Fix: re-introduce the `sed -i 's/pytest -n 4/pytest -n 1/g'
ci/run_python.sh` patch we had in earlier runs (dropped in Run L
during cleanup). Single worker -> no GPU race.
Applied to:
- test_python phase (under IS_GPU)
- test_wheel_ucxx / test_wheel_distributed_ucxx phases (run_python.sh
is the script that hardcodes -n 4; sed-patching there covers both
the cython and wheel test paths).
- test_python_distributed phase: NOT patched - run_python_distributed.sh
invocations don't pass -n at all (single process by default).
Distributed-ucxx failed Run M at test_ucxx_localcluster[True-ucx]
with "UCX is not initialized". Separate bug, not addressed here -
expect that leg to still fail.1 parent c3346a7 commit d79a2a8
1 file changed
Lines changed: 8 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
98 | 104 | | |
99 | 105 | | |
100 | 106 | | |
| |||
143 | 149 | | |
144 | 150 | | |
145 | 151 | | |
146 | | - | |
| 152 | + | |
147 | 153 | | |
148 | 154 | | |
| 155 | + | |
149 | 156 | | |
150 | 157 | | |
151 | 158 | | |
| |||
0 commit comments