Commit a06846f
committed
AZP: TEMP DEBUG - probe hypothesis A: single-process pytest under MPS
Run O probes whether the UCXX-Python GPU failures are caused by xdist
spawning multiple worker processes that each cuInit through MPS in
quick succession (overflowing the daemon's accept queue / per-uid
client limit). control.log on the host shows
`Unable to accept connection` in a loop, which matches an MPS server
unable to keep up with rapid-fire client connects.
Evidence base for this hypothesis:
- UCX C++ gtest works on the same MLNX hosts under MPS - single
process, one cuInit, one MPS handshake.
- UCXX test_python uses pytest + xdist `-n 4`: pytest_main +
4 worker processes = 5 cuInit attempts in seconds.
- Build 123304 (first UCXX_tests GPU run on PR #11473, day one):
same failure fingerprint as today - multiple workers go
"node down: Not properly terminated" simultaneously on cupy
variants of test_arr.py + test_ucx_address; numpy variants pass
in parallel.
- Earlier `-n 1` runs (Run C) still failed: but `-n 1` is
pytest_main + 1 worker = 2 cuInit attempts, still multi-process.
Probe:
- Remove the MPS bypass (CUDA_MPS_PIPE_DIRECTORY override) from
test_python, test_python_distributed, and the wheel test phases.
MPS is active in this run.
- Sed-patch `ci/run_python.sh`: replace `pytest -n 4` with
`pytest -p no:xdist` (disables the xdist plugin entirely, single
process pytest, single cuInit).
- run_python_distributed.sh already invokes pytest without `-n`, no
patch needed there.
Pass criteria:
- If single-process pytest under MPS gets through the test_arr cupy
variants + test_ucx_address without 240s hang, hypothesis A is
confirmed. The real fix becomes: keep pytest single-process for
GPU slices, or set CUDA_VISIBLE_DEVICES so workers don't all
cuInit through MPS.
- If the same hang pattern repeats with one pytest process, A is
wrong - look at the rapidsai-ci-conda image / cuda stack.1 parent d79a2a8 commit a06846f
1 file changed
Lines changed: 15 additions & 24 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
98 | | - | |
99 | | - | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
104 | 95 | | |
105 | 96 | | |
106 | 97 | | |
| |||
115 | 106 | | |
116 | 107 | | |
117 | 108 | | |
118 | | - | |
119 | | - | |
120 | | - | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
121 | 113 | | |
122 | 114 | | |
123 | 115 | | |
| |||
149 | 141 | | |
150 | 142 | | |
151 | 143 | | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
156 | 147 | | |
157 | 148 | | |
158 | 149 | | |
| |||
0 commit comments