Skip to content

Issue with pytest-xdist Handling Out of Memory Errors(IndexError) #1155

Open
@loveleenamar9

Description

@loveleenamar9

Hi,
I am currently utilizing pytest-xdist to execute a test suite that includes subgraph tests. Sporadically, I encounter an IndexError when attempting to load a large model, which results in the process being terminated due to an Out of Memory (OOM) issue. While pytest-xdist gracefully handles other crashes, it appears to struggle with those caused by OOM errors. The worker crash is expected but the crashed worker is not getting replaced properly in this case leading to IndexError.

Below is an example of the error log:

2024-10-27T21:28:18Z  tensorflow	[gw13] [ 70%] FAILED layerwise/Mistral7b/test_model_layers_0.py::test_model_layers_0 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	replacing crashed worker gw13
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> def worker_internal_error(
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         self, node: WorkerController, formatted_error: str
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     ) -> None:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         pytest_internalerror() was called on the worker.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         pytest_internalerror() arguments are an excinfo and an excrepr, which can't
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         be serialized, so we go with a poor man's solution of raising an exception
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         here ourselves using the formatted message.
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         """
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         self._active_nodes.remove(node)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>         try:
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> >           assert False, formatted_error
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E           AssertionError: Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 271, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 session.exitstatus = doit(config, session) or 0
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/_pytest/main.py", line 325, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 config.hook.pytest_runtestloop(session=session)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 182, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 return outcome.get_result()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_result.py", line 100, in get_result
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 raise exc.with_traceback(exc.__traceback__)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 res = hook_impl.function(*args)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 174, in pytest_runtestloop
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 self.run_one_test()
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E               File "/root/.local/lib/python3.10/site-packages/xdist/remote.py", line 185, in run_one_test
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E                 item = items[self.item_index]
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E             IndexError: list index out of range
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> E           assert False
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> 
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> /root/.local/lib/python3.10/site-packages/xdist/dsession.py:232: AssertionError
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> Traceback (most recent call last):
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 273, in wrap_session
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/main.py", line 327, in _main
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_hooks.py", line 513, in __call__
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
[2024-10-27T21:28:26.324Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_manager.py", line 120, in _hookexec
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 139, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     raise exception.with_traceback(exception.__traceback__)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 122, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     teardown.throw(exception)  # type: ignore[union-attr]
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/_pytest/logging.py", line 796, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/usr/local/lib/python3.10/dist-packages/pluggy/_callers.py", line 103, in _multicall
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     res = hook_impl.function(*args)
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 138, in pytest_runtestloop
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     self.loop_once()
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>   File "/root/.local/lib/python3.10/site-packages/xdist/dsession.py", line 152, in loop_once
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR>     raise RuntimeError("Unexpectedly no active workers available")
[2024-10-27T21:28:26.325Z] 
2024-10-27T21:28:18Z  tensorflow	INTERNALERROR> RuntimeError: Unexpectedly no active workers available

The issue can be reproduced by creating a dummy test that allocates a large amount of memory:

PYTHON

def test_oom():
    large_memory_allocation = []
    for _ in range(175):
        large_memory_allocation.append([0] * (1024**3 // 4))

I suspect that the synchronization between the worker and the master process is not occurring correctly, leading to incomplete communication.

Note: This issue is observed only with a large test suite.

Could you please provide support on what's causing this IndexError and how to resolve this, so that pytest-xdist can handle OOM errors gracefully?

Thanks!
Loveleen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions