-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
The following test deadlocks for all schedulers:
import pytest_parallel
from time import sleep
@pytest_parallel.mark.parallel(2)
def test_raise_then_coll(comm):
if comm.rank==0:
sleep(1)
raise RuntimeError('my excpetion message')
comm.allreduce(42, comm)What happens:
Rank 0
- during test execution, raise the exception and never calls
allreduce - the exception is caught by pytest, and added to the report
- pytest waits for rank 1 to send its report
Rank 1
- no exception raised
- the tests executes
allreduceand waits for rank 0
How can we solve or at least improve the problem :
- We can crash pytest if a test encounters an exception (through
pytest_exception_interactmaybe?). Not ideal because not all tests will be run. - We can use a timeout parameter when we wait for test reports (suggested by @cbenazet).
- But then we need a mechanism to signal the other ranks that they should cancel their current test
- We still need to handle the case where the proc that does the report gathering is the one stuck in the
allreduce - Note that the rank which raised the exception is not stuck in the test and will send its report (but no garantee it will be received)
- Maybe we can hook mpi4py blocking functions to return an error if they timeout
Metadata
Metadata
Assignees
Labels
No labels