Skip to content

Commit d79a2a8

Browse files
committed
AZP: TEMP DEBUG - pytest -n 1 to avoid GPU Exclusive_Process contention
Run M (MPS bypass via CUDA_MPS_PIPE_DIRECTORY override) was a big breakthrough: the 240s hangs are gone, tests now finish in ~50s and fail with legible errors. wheel-tests summary line: `30 failed, 236 passed, 2 skipped, 13 warnings in 49.94s`. All 30 failures share one error: cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or unavailable Diagnosis: the MLNX A40 GPUs are in `Exclusive_Process` compute mode. With MPS in front, multiple processes can share the device. With MPS bypassed, only one process at a time can claim it. xdist `-n 4` spawns four pytest workers; each tries `cupy.empty()` -> `cudaMalloc`; three lose with `cudaErrorDevicesUnavailable`. Fix: re-introduce the `sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh` patch we had in earlier runs (dropped in Run L during cleanup). Single worker -> no GPU race. Applied to: - test_python phase (under IS_GPU) - test_wheel_ucxx / test_wheel_distributed_ucxx phases (run_python.sh is the script that hardcodes -n 4; sed-patching there covers both the cython and wheel test paths). - test_python_distributed phase: NOT patched - run_python_distributed.sh invocations don't pass -n at all (single process by default). Distributed-ucxx failed Run M at test_ucxx_localcluster[True-ucx] with "UCX is not initialized". Separate bug, not addressed here - expect that leg to still fail.
1 parent c3346a7 commit d79a2a8

1 file changed

Lines changed: 8 additions & 1 deletion

File tree

buildlib/tools/test_ucxx.sh

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,12 @@ EOF
9595
# regardless of MPS.
9696
export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
9797
echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
98+
# Run M with MPS bypass produced 30 real test failures with:
99+
# `cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorDevicesUnavailable`
100+
# GPUs are in Exclusive_Process compute mode; with MPS bypassed, only
101+
# one process at a time can claim the device. xdist -n 4 means four
102+
# workers race for cupy.empty() and three lose. Force single-worker.
103+
sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
98104
export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
99105
export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
100106
export UCX_LOG_FILE=/tmp/ucx_%P.log
@@ -143,9 +149,10 @@ EOF
143149
fi
144150
chmod +x "$HOME/.local/bin/rapids-download-from-github"
145151
diag_container_identity
146-
# MPS bypass for diagnostic - see test_python phase comment.
152+
# MPS bypass + single-worker pytest (see test_python phase comments).
147153
export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
148154
echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
155+
sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
149156
export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
150157
export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
151158
export UCX_LOG_FILE=/tmp/ucx_%P.log

0 commit comments

Comments
 (0)