AZP: TEMP DEBUG - probe hypothesis A: single-process pytest under MPS

Alexey-Rivkin · Alexey-Rivkin · commit a06846fe9dec · 2026-05-28T21:07:14.000+03:00
Run O probes whether the UCXX-Python GPU failures are caused by xdist spawning multiple worker processes that each cuInit through MPS in quick succession (overflowing the daemon's accept queue / per-uid client limit). control.log on the host shows `Unable to accept connection` in a loop, which matches an MPS server unable to keep up with rapid-fire client connects. Evidence base for this hypothesis: - UCX C++ gtest works on the same MLNX hosts under MPS - single process, one cuInit, one MPS handshake. - UCXX test_python uses pytest + xdist `-n 4`: pytest_main + 4 worker processes = 5 cuInit attempts in seconds. - Build 123304 (first UCXX_tests GPU run on PR #11473, day one): same failure fingerprint as today - multiple workers go "node down: Not properly terminated" simultaneously on cupy variants of test_arr.py + test_ucx_address; numpy variants pass in parallel. - Earlier `-n 1` runs (Run C) still failed: but `-n 1` is pytest_main + 1 worker = 2 cuInit attempts, still multi-process. Probe: - Remove the MPS bypass (CUDA_MPS_PIPE_DIRECTORY override) from test_python, test_python_distributed, and the wheel test phases. MPS is active in this run. - Sed-patch `ci/run_python.sh`: replace `pytest -n 4` with `pytest -p no:xdist` (disables the xdist plugin entirely, single process pytest, single cuInit). - run_python_distributed.sh already invokes pytest without `-n`, no patch needed there. Pass criteria: - If single-process pytest under MPS gets through the test_arr cupy variants + test_ucx_address without 240s hang, hypothesis A is confirmed. The real fix becomes: keep pytest single-process for GPU slices, or set CUDA_VISIBLE_DEVICES so workers don't all cuInit through MPS. - If the same hang pattern repeats with one pytest process, A is wrong - look at the rapidsai-ci-conda image / cuda stack.
diff --git a/buildlib/tools/test_ucxx.sh b/buildlib/tools/test_ucxx.sh
@@ -84,23 +84,14 @@ EOF
   test_python)
     if [ "$IS_GPU" = "true" ]; then
       diag_container_identity
-      # Run L proved container uid=61206 already (Azure auto-injects --user
-      # mapping the host agent uid). MPS uid match is satisfied yet cuMalloc
-      # on cupy buffers hangs 240s while non-cupy/numpy tests in the same
-      # session pass. Suspect host MPS daemon is stuck. Diagnostic: point
-      # the CUDA client at a non-existent MPS pipe dir; per CUDA docs the
-      # client then falls back to direct GPU access (no MPS). If tests pass
-      # under bypass, MPS daemon on the host is the culprit and needs a
-      # restart (host-side fix). If they still hang, cupy/conda is broken
-      # regardless of MPS.
-      export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
-      echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
-      # Run M with MPS bypass produced 30 real test failures with:
-      # `cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorDevicesUnavailable`
-      # GPUs are in Exclusive_Process compute mode; with MPS bypassed, only
-      # one process at a time can claim the device. xdist -n 4 means four
-      # workers race for cupy.empty() and three lose. Force single-worker.
-      sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
+      # Run O probe (hypothesis A): MPS active (no CUDA_MPS_PIPE_DIRECTORY
+      # override). Drop xdist entirely so pytest runs in a single process,
+      # makes one cuInit through MPS, no fork/spawn fan-out. UCX gtest works
+      # under MPS as a single-process client; if single-process pytest also
+      # works, hypothesis A is confirmed: MPS daemon can't sustain
+      # multi-process cuInit fan-out from xdist workers (control.log shows
+      # `Unable to accept connection` loop).
+      sed -i 's/pytest -n 4/pytest -p no:xdist/g' ci/run_python.sh
       export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
       export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
       export UCX_LOG_FILE=/tmp/ucx_%P.log
@@ -115,9 +106,10 @@ EOF
     printf '#!/bin/bash\necho "%s"\n' "$PYTHON_CHANNEL_DIR" > "$HOME/.local/bin/rapids-download-from-github"
     chmod +x "$HOME/.local/bin/rapids-download-conda-from-github" "$HOME/.local/bin/rapids-download-from-github"
     diag_container_identity
-    # MPS bypass for diagnostic - see test_python phase comment.
-    export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
-    echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
+    # No MPS bypass for Run O. run_python_distributed.sh does not use xdist
+    # in the test invocation - already single-process - so no patch needed
+    # here. Failure under MPS for the distributed leg is a separate signal
+    # (test_ucxx_localcluster spawns dask workers; those are extra procs).
     export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
     export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
     export UCX_LOG_FILE=/tmp/ucx_%P.log
@@ -149,10 +141,9 @@ EOF
     fi
     chmod +x "$HOME/.local/bin/rapids-download-from-github"
     diag_container_identity
-    # MPS bypass + single-worker pytest (see test_python phase comments).
-    export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
-    echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
-    sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
+    # Run O probe (hypothesis A): MPS active, drop xdist for single-process
+    # pytest. See test_python phase comment for rationale.
+    sed -i 's/pytest -n 4/pytest -p no:xdist/g' ci/run_python.sh
     export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
     export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
     export UCX_LOG_FILE=/tmp/ucx_%P.log