AZP: TEMP DEBUG - pytest -n 1 to avoid GPU Exclusive_Process contention

Alexey-Rivkin · Alexey-Rivkin · commit d79a2a8b5f65 · 2026-05-28T21:07:14.000+03:00
Run M (MPS bypass via CUDA_MPS_PIPE_DIRECTORY override) was a big
breakthrough: the 240s hangs are gone, tests now finish in ~50s and
fail with legible errors.

wheel-tests summary line: `30 failed, 236 passed, 2 skipped, 13
warnings in 49.94s`. All 30 failures share one error:

  cupy_backends.cuda.api.runtime.CUDARuntimeError:
  cudaErrorDevicesUnavailable: CUDA-capable device(s) is/are busy or
  unavailable

Diagnosis: the MLNX A40 GPUs are in `Exclusive_Process` compute mode.
With MPS in front, multiple processes can share the device. With MPS
bypassed, only one process at a time can claim it. xdist `-n 4`
spawns four pytest workers; each tries `cupy.empty()` -&gt; `cudaMalloc`;
three lose with `cudaErrorDevicesUnavailable`.

Fix: re-introduce the `sed -i 's/pytest -n 4/pytest -n 1/g'
ci/run_python.sh` patch we had in earlier runs (dropped in Run L
during cleanup). Single worker -&gt; no GPU race.

Applied to:
- test_python phase (under IS_GPU)
- test_wheel_ucxx / test_wheel_distributed_ucxx phases (run_python.sh
  is the script that hardcodes -n 4; sed-patching there covers both
  the cython and wheel test paths).
- test_python_distributed phase: NOT patched - run_python_distributed.sh
  invocations don't pass -n at all (single process by default).

Distributed-ucxx failed Run M at test_ucxx_localcluster[True-ucx]
with "UCX is not initialized". Separate bug, not addressed here -
expect that leg to still fail.
diff --git a/buildlib/tools/test_ucxx.sh b/buildlib/tools/test_ucxx.sh
@@ -95,6 +95,12 @@ EOF
       # regardless of MPS.
       export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
       echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
+      # Run M with MPS bypass produced 30 real test failures with:
+      # `cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorDevicesUnavailable`
+      # GPUs are in Exclusive_Process compute mode; with MPS bypassed, only
+      # one process at a time can claim the device. xdist -n 4 means four
+      # workers race for cupy.empty() and three lose. Force single-worker.
+      sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
       export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
       export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
       export UCX_LOG_FILE=/tmp/ucx_%P.log
@@ -143,9 +149,10 @@ EOF
     fi
     chmod +x "$HOME/.local/bin/rapids-download-from-github"
     diag_container_identity
-    # MPS bypass for diagnostic - see test_python phase comment.
+    # MPS bypass + single-worker pytest (see test_python phase comments).
     export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
     echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
+    sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
     export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
     export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
     export UCX_LOG_FILE=/tmp/ucx_%P.log