Skip to content

BUG: CI can hang in an obscure way #4114

Open
@ksagiyam

Description

@ksagiyam

Describe the bug
CI times out if I use 3 processes instead of 4 in tests/firedrake/regression/test_matrix_free.py::test_matrix_free_split_communicators

@pytest.mark.parallel(nprocs=4)
.

Error message

See the test PR #4112 and the CI https://github.com/firedrakeproject/firedrake/actions/runs/13809425042/job/38627368088#logs (nprocs=3).

4	Freeing comms in list (length 4)
4	Freeing 75_DUP_COMPILATION, with index 17, which has refcount 2
4	Freeing 75_DUP, with index 16, which has refcount 1
4	Deleting compilationcomm keyval on 75_DUP
4	Traceback (most recent call last):
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 159, in mpi4py.MPI.__pyx_fuse_1PyMPI_attr_delete_cb
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 121, in mpi4py.MPI.PyMPI_attr_delete
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 67, in mpi4py.MPI.PyMPI_attr_call
4	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pyop2/mpi.py", line 243, in delcomm_outer
4	    ocomm = icomm.Get_attr(outercomm_keyval)
4	            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4	  File "src/mpi4py/MPI.src/Comm.pyx", line 1781, in mpi4py.MPI.Comm.Get_attr
4	mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator
4	Freeing 20_DUP_COMPILATION, with index 15, which has refcount 2
4	Freeing 20_DUP, with index 14, which has refcount 1
4	Deleting compilationcomm keyval on 20_DUP
4	Traceback (most recent call last):
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 159, in mpi4py.MPI.__pyx_fuse_1PyMPI_attr_delete_cb
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 121, in mpi4py.MPI.PyMPI_attr_delete
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 67, in mpi4py.MPI.PyMPI_attr_call
4	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pyop2/mpi.py", line 243, in delcomm_outer
4	    ocomm = icomm.Get_attr(outercomm_keyval)
4	            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4	  File "src/mpi4py/MPI.src/Comm.pyx", line 1781, in mpi4py.MPI.Comm.Get_attr
4	mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator
4	STATE2
4	PYOP2 Communicator reference counts:
4	| Communicator name                      | Count |
4	==================================================
4	| 20_DUP                                 |     1 |
4	| 20_DUP_COMPILATION                     |     2 |
4	| 75_DUP                                 |     1 |
4	| 75_DUP_COMPILATION                     |     2 |
4	
4	Freeing comms in list (length 4)
4	Freeing 75_DUP_COMPILATION, with index 18, which has refcount 2
4	Freeing 75_DUP, with index 17, which has refcount 1
4	Deleting compilationcomm keyval on 75_DUP
4	Traceback (most recent call last):
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 159, in mpi4py.MPI.__pyx_fuse_1PyMPI_attr_delete_cb
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 121, in mpi4py.MPI.PyMPI_attr_delete
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 67, in mpi4py.MPI.PyMPI_attr_call
4	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pyop2/mpi.py", line 243, in delcomm_outer
4	    ocomm = icomm.Get_attr(outercomm_keyval)
4	            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4	  File "src/mpi4py/MPI.src/Comm.pyx", line 1781, in mpi4py.MPI.Comm.Get_attr
4	mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator
4	Freeing 20_DUP_COMPILATION, with index 16, which has refcount 2
4	Freeing 20_DUP, with index 15, which has refcount 1
4	Deleting compilationcomm keyval on 20_DUP
4	Traceback (most recent call last):
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 159, in mpi4py.MPI.__pyx_fuse_1PyMPI_attr_delete_cb
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 121, in mpi4py.MPI.PyMPI_attr_delete
4	  File "src/mpi4py/MPI.src/attrimpl.pxi", line 67, in mpi4py.MPI.PyMPI_attr_call
4	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pyop2/mpi.py", line 243, in delcomm_outer
4	    ocomm = icomm.Get_attr(outercomm_keyval)
4	            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4	  File "src/mpi4py/MPI.src/Comm.pyx", line 1781, in mpi4py.MPI.Comm.Get_attr
4	mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator
3	firedrake-repo/tests/firedrake/regression/test_netgen.py::test_firedrake_integral_sphere_high_order_netgen_parallel FAILED [ 15%]+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
3	~~~~~~~~~~~~~~~~~~~~ Stack of MainThread (138454370914432) ~~~~~~~~~~~~~~~~~~~~~
3	+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
3	~~~~~~~~~~~~~~~~~~~~ Stack of MainThread (132152508522624) ~~~~~~~~~~~~~~~~~~~~~
3	  File "<frozen runpy>", line 198, in _run_module_as_main
3	  File "<frozen runpy>", line 88, in _run_code
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pytest/__main__.py", line 9, in <module>
3	    raise SystemExit(pytest.console_main())
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 201, in console_main
3	    code = main()
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 175, in main
3	    ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
3	    return wrap_session(config, _main)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 283, in wrap_session
3	    session.exitstatus = doit(config, session) or 0
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 337, in _main
3	    config.hook.pytest_runtestloop(session=session)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 362, in pytest_runtestloop
3	    item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 113, in pytest_runtest_protocol
3	    runtestprotocol(item, nextitem=nextitem)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 132, in runtestprotocol
3	    reports.append(call_and_report(item, "call", log))
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 241, in call_and_report
3	    call = CallInfo.from_call(
+ set +x
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 341, in from_call
3	    result: TResult | None = func()
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 242, in <lambda>
3	    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", l  File "<frozen runpy>", line 198, in _run_module_as_main
3	  File "<frozen runpy>", line 88, in _run_code
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pytest/__main__.py", line 9, in <module>
3	    raise SystemExit(pytest.console_main())
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 201, in console_main
3	    code = main()
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/config/__init__.py", line 175, in main
3	    ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
3	    return wrap_session(config, _main)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 283, in wrap_session
3	    session.exitstatus = doit(config, session) or 0
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 337, in _main
3	    config.hook.pytest_runtestloop(session=session)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/main.py", line 362, in pytest_runtestloop
3	    item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 113, in pytest_runtest_protocol
3	    runtestprotocol(item, nextitem=nextitem)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 132, in runtestprotocol
3	    reports.append(call_and_report(item, "call", log))
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 241, in call_and_report
3	    call = CallInfo.from_call(
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 341, in from_call
3	    result: TResult | None = func()
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 242, in <lambda>
3	    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 174, in pytest_runtest_call
3	    item.runtest()
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/python.py", line 1627, in runtest
3	    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/python.py", line 159, in pytest_pyfunc_call
3	    result = testfunction(**testargs)
3	  File "/__w/firedrake/firedrake/firedrake-repo/tests/firedrake/regression/test_netgen.py", line 234, in test_firedrake_integral_sphere_high_order_netgen_parallel
3	    homsh = Mesh(msh.curve_field(2))
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/ngsPETSc/utils/firedrake/meshes.py", line 216, in curveField
3	    new_coordinates.dat.data[pyop2_index] = curved_space_points.reshape(-1, geom_dim)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pyop2/mpi.py", line 203, in wrapper
3	    comm.Barrier()
3	+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
3	ine 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/runner.py", line 174, in pytest_runtest_call
3	    item.runtest()
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/python.py", line 1627, in runtest
3	    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
3	    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
3	    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
3	    res = hook_impl.function(*args)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/_pytest/python.py", line 159, in pytest_pyfunc_call
3	    result = testfunction(**testargs)
3	  File "/__w/firedrake/firedrake/firedrake-repo/tests/firedrake/regression/test_netgen.py", line 234, in test_firedrake_integral_sphere_high_order_netgen_parallel
3	    homsh = Mesh(msh.curve_field(2))
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/ngsPETSc/utils/firedrake/meshes.py", line 216, in curveField
3	    new_coordinates.dat.data[pyop2_index] = curved_space_points.reshape(-1, geom_dim)
3	  File "/__w/firedrake/firedrake/venv/lib/python3.12/site-packages/pyop2/mpi.py", line 203, in wrapper
3	    comm.Barrier()
3	+++++++++++++++++++++++++++++++++++ Timeout ++++++++++++++++++++++++++++++++++++
3	--------------------------------------------------------------------------
3	Primary job  terminated normally, but 1 process returned
3	a non-zero exit code. Per user-direction, the job has been aborted.
3	--------------------------------------------------------------------------
3	--------------------------------------------------------------------------
3	mpiexec detected that one or more processes exited with non-zero status, thus causing
3	the job to be terminated. The first process to do so was:
3	
3	  Process name: [[48673,1],2]
3	  Exit code:    1
3	--------------------------------------------------------------------------
Job 1 passed
Job 2 passed
Job 3 failed, inspect the logs in pytest_nprocs3_job3.log
Job 4 passed
Cleaning up
Done
Error: Process completed with exit code 1.

Additional Info
The above does not seem to be the only way to make CI hang.

We seem to have had the following error for some time:

4	    ocomm = icomm.Get_attr(outercomm_keyval)
4	            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4	  File "src/mpi4py/MPI.src/Comm.pyx", line 1781, in mpi4py.MPI.Comm.Get_attr
4	mpi4py.MPI.Exception: MPI_ERR_COMM: invalid communicator

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions