Skip to content

Commit a06846f

Browse files
committed
AZP: TEMP DEBUG - probe hypothesis A: single-process pytest under MPS
Run O probes whether the UCXX-Python GPU failures are caused by xdist spawning multiple worker processes that each cuInit through MPS in quick succession (overflowing the daemon's accept queue / per-uid client limit). control.log on the host shows `Unable to accept connection` in a loop, which matches an MPS server unable to keep up with rapid-fire client connects. Evidence base for this hypothesis: - UCX C++ gtest works on the same MLNX hosts under MPS - single process, one cuInit, one MPS handshake. - UCXX test_python uses pytest + xdist `-n 4`: pytest_main + 4 worker processes = 5 cuInit attempts in seconds. - Build 123304 (first UCXX_tests GPU run on PR #11473, day one): same failure fingerprint as today - multiple workers go "node down: Not properly terminated" simultaneously on cupy variants of test_arr.py + test_ucx_address; numpy variants pass in parallel. - Earlier `-n 1` runs (Run C) still failed: but `-n 1` is pytest_main + 1 worker = 2 cuInit attempts, still multi-process. Probe: - Remove the MPS bypass (CUDA_MPS_PIPE_DIRECTORY override) from test_python, test_python_distributed, and the wheel test phases. MPS is active in this run. - Sed-patch `ci/run_python.sh`: replace `pytest -n 4` with `pytest -p no:xdist` (disables the xdist plugin entirely, single process pytest, single cuInit). - run_python_distributed.sh already invokes pytest without `-n`, no patch needed there. Pass criteria: - If single-process pytest under MPS gets through the test_arr cupy variants + test_ucx_address without 240s hang, hypothesis A is confirmed. The real fix becomes: keep pytest single-process for GPU slices, or set CUDA_VISIBLE_DEVICES so workers don't all cuInit through MPS. - If the same hang pattern repeats with one pytest process, A is wrong - look at the rapidsai-ci-conda image / cuda stack.
1 parent d79a2a8 commit a06846f

1 file changed

Lines changed: 15 additions & 24 deletions

File tree

buildlib/tools/test_ucxx.sh

Lines changed: 15 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -84,23 +84,14 @@ EOF
8484
test_python)
8585
if [ "$IS_GPU" = "true" ]; then
8686
diag_container_identity
87-
# Run L proved container uid=61206 already (Azure auto-injects --user
88-
# mapping the host agent uid). MPS uid match is satisfied yet cuMalloc
89-
# on cupy buffers hangs 240s while non-cupy/numpy tests in the same
90-
# session pass. Suspect host MPS daemon is stuck. Diagnostic: point
91-
# the CUDA client at a non-existent MPS pipe dir; per CUDA docs the
92-
# client then falls back to direct GPU access (no MPS). If tests pass
93-
# under bypass, MPS daemon on the host is the culprit and needs a
94-
# restart (host-side fix). If they still hang, cupy/conda is broken
95-
# regardless of MPS.
96-
export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
97-
echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
98-
# Run M with MPS bypass produced 30 real test failures with:
99-
# `cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorDevicesUnavailable`
100-
# GPUs are in Exclusive_Process compute mode; with MPS bypassed, only
101-
# one process at a time can claim the device. xdist -n 4 means four
102-
# workers race for cupy.empty() and three lose. Force single-worker.
103-
sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
87+
# Run O probe (hypothesis A): MPS active (no CUDA_MPS_PIPE_DIRECTORY
88+
# override). Drop xdist entirely so pytest runs in a single process,
89+
# makes one cuInit through MPS, no fork/spawn fan-out. UCX gtest works
90+
# under MPS as a single-process client; if single-process pytest also
91+
# works, hypothesis A is confirmed: MPS daemon can't sustain
92+
# multi-process cuInit fan-out from xdist workers (control.log shows
93+
# `Unable to accept connection` loop).
94+
sed -i 's/pytest -n 4/pytest -p no:xdist/g' ci/run_python.sh
10495
export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
10596
export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
10697
export UCX_LOG_FILE=/tmp/ucx_%P.log
@@ -115,9 +106,10 @@ EOF
115106
printf '#!/bin/bash\necho "%s"\n' "$PYTHON_CHANNEL_DIR" > "$HOME/.local/bin/rapids-download-from-github"
116107
chmod +x "$HOME/.local/bin/rapids-download-conda-from-github" "$HOME/.local/bin/rapids-download-from-github"
117108
diag_container_identity
118-
# MPS bypass for diagnostic - see test_python phase comment.
119-
export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
120-
echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
109+
# No MPS bypass for Run O. run_python_distributed.sh does not use xdist
110+
# in the test invocation - already single-process - so no patch needed
111+
# here. Failure under MPS for the distributed leg is a separate signal
112+
# (test_ucxx_localcluster spawns dask workers; those are extra procs).
121113
export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
122114
export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
123115
export UCX_LOG_FILE=/tmp/ucx_%P.log
@@ -149,10 +141,9 @@ EOF
149141
fi
150142
chmod +x "$HOME/.local/bin/rapids-download-from-github"
151143
diag_container_identity
152-
# MPS bypass + single-worker pytest (see test_python phase comments).
153-
export CUDA_MPS_PIPE_DIRECTORY="/tmp/no-mps-svc-${RANDOM}"
154-
echo "[diag] CUDA_MPS_PIPE_DIRECTORY=${CUDA_MPS_PIPE_DIRECTORY} (MPS bypass)"
155-
sed -i 's/pytest -n 4/pytest -n 1/g' ci/run_python.sh
144+
# Run O probe (hypothesis A): MPS active, drop xdist for single-process
145+
# pytest. See test_python phase comment for rationale.
146+
sed -i 's/pytest -n 4/pytest -p no:xdist/g' ci/run_python.sh
156147
export UCX_LOG_LEVEL=${UCX_LOG_LEVEL:-DEBUG}
157148
export UCXPY_LOG_LEVEL=${UCXPY_LOG_LEVEL:-DEBUG}
158149
export UCX_LOG_FILE=/tmp/ucx_%P.log

0 commit comments

Comments
 (0)