Skip to content

pytest crashes for linux64+CUDA+MKL #348

Closed
@h-vetinari

Description

@h-vetinari

In some bizarre combination of circumstances, exactly one job fails since merging #318. The last passing run was dfadf15 after merging #331.

stacktrace
+ OMP_NUM_THREADS=4
+ python -m pytest -n 2 test/test_autograd.py test/test_autograd_fallback.py test/test_custom_ops.py test/test_linalg.py test/test_mkldnn.py test/test_modules.py test/test_nn.py test/test_torch.py test/test_xnnpack_integration.py -k 'not ((TestTorch and test_print) or (TestAutograd and test_profiler_seq_nr) or (TestAutograd and test_profiler_propagation) or test_mutable_custom_op_fixed_layout or test_BCELoss_weights_no_reduce_cuda or test_ctc_loss_cudnn_tensor_cuda  or (TestTorch and test_index_add_correctness) or test_sdpa_inference_mode_aot_compile or (TestNN and test_grid_sample) or test_indirect_device_assert or (GPUTests and test_scatter_reduce2) or (TestLinalgCPU and test_inverse_errors_large_cpu) or test_reentrant_parent_error_on_cpu_cuda) or test_base_does_not_require_grad_mode_nothing or test_base_does_not_require_grad_mode_warn or test_composite_registered_to_cpu_mode_nothing)' -m 'not hypothesis' --durations=50
============================= test session starts ==============================
platform linux -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: $SRC_DIR
plugins: rerunfailures-15.0, hypothesis-6.125.1, flakefinder-1.1.0, xdist-3.6.1
created: 2/2 workers
workers [8992 items]

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/_pytest/main.py", line 283, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>                          ^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/_pytest/main.py", line 337, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/pluggy/_hooks.py", line 513, in __call__
INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/pluggy/_manager.py", line 120, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
INTERNALERROR>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 139, in _multicall
INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 122, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/_pytest/logging.py", line 803, in pytest_runtestloop
INTERNALERROR>     return (yield)  # Run all the tests.
INTERNALERROR>             ^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 122, in _multicall
INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
INTERNALERROR>     ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/_pytest/terminal.py", line 673, in pytest_runtestloop
INTERNALERROR>     result = yield
INTERNALERROR>              ^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/pluggy/_callers.py", line 103, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>           ^^^^^^^^^^^^^^^^^^^^^^^^^
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
INTERNALERROR>     self.loop_once()
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/xdist/dsession.py", line 163, in loop_once
INTERNALERROR>     call(**kwargs)
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/xdist/dsession.py", line 306, in worker_collectionfinish
INTERNALERROR>     self.sched.schedule()
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/xdist/scheduler/load.py", line 295, in schedule
INTERNALERROR>     self._send_tests(node, node_chunksize)
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/xdist/scheduler/load.py", line 307, in _send_tests
INTERNALERROR>     node.send_runtest_some(tests_per_node)
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/xdist/workermanage.py", line 355, in send_runtest_some
INTERNALERROR>     self.sendcommand("runtests", indices=indices)
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/xdist/workermanage.py", line 374, in sendcommand
INTERNALERROR>     self.channel.send((name, kwargs))
INTERNALERROR>   File "$PREFIX/lib/python3.11/site-packages/execnet/gateway_base.py", line 911, in send
INTERNALERROR>     raise OSError(f"cannot send to {self!r}")
INTERNALERROR> OSError: cannot send to <Channel id=3 closed>

In the meantime, several other fixes have landed, and various attempts at unbreaking the build have been made.

In the meantime I'm planning to work around this, but opening this issue to concentrate discussion in one place.

I had double-checked for any changes in the pytest versions, but both passing and failing runs had:

    execnet:                     2.1.1-pyhd8ed1ab_1                   conda-forge
    [...]
    pytest:                      8.3.4-pyhd8ed1ab_1                   conda-forge
    pytest-flakefinder:          1.1.0-pyh29332c3_2                   conda-forge
    pytest-rerunfailures:        15.0-pyhd8ed1ab_1                    conda-forge
    pytest-xdist:                3.6.1-pyhd8ed1ab_1                   conda-forge
    python:                      3.12.8-h9e4cc4f_1_cpython            conda-forge
    python-dateutil:             2.9.0.post0-pyhff2d567_1             conda-forge
    python_abi:                  3.12-5_cp312                         conda-forge
    pytorch:                     2.5.1-cuda126_mkl_py312_hdbe889e_310 local

Another observation

Incredibly, this really seems to be MKL-specific somehow, as the openblas builds in #326 passed, while the MKL builds ran into the pytest error (same situation as the CI after merging #340).

Whatever "Channels" execnet is trying to use might somehow be getting occupied by MKL?

OSError: cannot send to <Channel id=3 closed>

(Win+CUDA+MKL is fine, linux+CUDA+openblas is fine, linux+CPU+MKL is fine)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions