Skip to content

Handle deadlocks #15

@BerengerBerthoul

Description

@BerengerBerthoul

The following test deadlocks for all schedulers:

import pytest_parallel

from time import sleep

@pytest_parallel.mark.parallel(2)
def test_raise_then_coll(comm):
  if comm.rank==0:
    sleep(1)
    raise RuntimeError('my excpetion message')
  comm.allreduce(42, comm)

What happens:
Rank 0

  • during test execution, raise the exception and never calls allreduce
  • the exception is caught by pytest, and added to the report
  • pytest waits for rank 1 to send its report

Rank 1

  • no exception raised
  • the tests executes allreduce and waits for rank 0

How can we solve or at least improve the problem :

  • We can crash pytest if a test encounters an exception (through pytest_exception_interact maybe?). Not ideal because not all tests will be run.
  • We can use a timeout parameter when we wait for test reports (suggested by @cbenazet).
    • But then we need a mechanism to signal the other ranks that they should cancel their current test
    • We still need to handle the case where the proc that does the report gathering is the one stuck in the allreduce
    • Note that the rank which raised the exception is not stuck in the test and will send its report (but no garantee it will be received)
  • Maybe we can hook mpi4py blocking functions to return an error if they timeout

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions