Description
In some bizarre combination of circumstances, exactly one job fails since merging #318. The last passing run was dfadf15 after merging #331.
stacktrace
+ OMP_NUM_THREADS=4
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_reentrant_parent_error_on_cpu_cuda) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
============================= test session starts ==============================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: $SRC_DIR
plugins: rerunfailures-15.0, hypothesis-6.125.1, flakefinder-1.1.0, xdist-3.6.1
created: 2/2 workers
workers [8992 items]
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/_pytest/main.py", line 283, in wrap_session
INTERNALERROR> session.exitstatus = doit(config, session) or 0
INTERNALERROR> ^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/_pytest/main.py", line 337, in _main
INTERNALERROR> config.hook.pytest_runtestloop(session=session)
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
INTERNALERROR> return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
INTERNALERROR> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
INTERNALERROR> return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 139, in _multicall
INTERNALERROR> raise exception.with_traceback(exception.__traceback__)
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 122, in _multicall
INTERNALERROR> teardown.throw(exception) # type: ignore[union-attr]
INTERNALERROR> ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/_pytest/logging.py", line 803, in pytest_runtestloop
INTERNALERROR> return (yield) # Run all the tests.
INTERNALERROR> ^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 122, in _multicall
INTERNALERROR> teardown.throw(exception) # type: ignore[union-attr]
INTERNALERROR> ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/_pytest/terminal.py", line 673, in pytest_runtestloop
INTERNALERROR> result = yield
INTERNALERROR> ^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
INTERNALERROR> res = hook_impl.function(*args)
INTERNALERROR> ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
INTERNALERROR> self.loop_once()
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/xdist/dsession.py", line 163, in loop_once
INTERNALERROR> call(**kwargs)
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/xdist/dsession.py", line 306, in worker_collectionfinish
INTERNALERROR> self.sched.schedule()
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/xdist/scheduler/load.py", line 295, in schedule
INTERNALERROR> self._send_tests(node, node_chunksize)
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/xdist/scheduler/load.py", line 307, in _send_tests
INTERNALERROR> node.send_runtest_some(tests_per_node)
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/xdist/workermanage.py", line 355, in send_runtest_some
INTERNALERROR> self.sendcommand("runtests", indices=indices)
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/xdist/workermanage.py", line 374, in sendcommand
INTERNALERROR> self.channel.send((name, kwargs))
INTERNALERROR> File "$PREFIX/lib/python3.11/site-packages/execnet/gateway_base.py", line 911, in send
INTERNALERROR> raise OSError(f"cannot send to {self!r}")
INTERNALERROR> OSError: cannot send to <Channel id=3 closed>
In the meantime, several other fixes have landed, and various attempts at unbreaking the build have been made.
- Go green #344: reducing the diff to dfadf15 by reverting 9fcb3a7 & e1f50ac -- didn't work ❌
- Test if last passing run can be reproduced #345: go back to dfadf15 entirely -- worked ✅
- Test if last passing run can be reproduced, including test environment #347: go back to dfadf15 and avoid new dependencies -- worked ✅ (but by the previous point, the ambient dependency changes are not directly related to the failure anyway).
- [v2.5.x] Fix stray bracket breaking pytest; fix include-patch for cross-compilation #346: disable pytest-xdist -- ❔ but should work
In the meantime I'm planning to work around this, but opening this issue to concentrate discussion in one place.
I had double-checked for any changes in the pytest versions, but both passing and failing runs had:
execnet: 2.1.1-pyhd8ed1ab_1 conda-forge
[...]
pytest: 8.3.4-pyhd8ed1ab_1 conda-forge
pytest-flakefinder: 1.1.0-pyh29332c3_2 conda-forge
pytest-rerunfailures: 15.0-pyhd8ed1ab_1 conda-forge
pytest-xdist: 3.6.1-pyhd8ed1ab_1 conda-forge
python: 3.12.8-h9e4cc4f_1_cpython conda-forge
python-dateutil: 2.9.0.post0-pyhff2d567_1 conda-forge
python_abi: 3.12-5_cp312 conda-forge
pytorch: 2.5.1-cuda126_mkl_py312_hdbe889e_310 local
Another observation
Incredibly, this really seems to be MKL-specific somehow, as the openblas builds in #326 passed, while the MKL builds ran into the pytest error (same situation as the CI after merging #340).
Whatever "Channels" execnet is trying to use might somehow be getting occupied by MKL?
OSError: cannot send to <Channel id=3 closed>
(Win+CUDA+MKL is fine, linux+CUDA+openblas is fine, linux+CPU+MKL is fine)