Skip to content

Flaky test_upload_directory: assert self._state.pid is not None, "can only join a started process" #6973

Open
@gjoseph92

Description

@gjoseph92

Very related to #6972 (though that won't fix it).

In this test, upload_directory tries to restart the worker. The restart fails for some reason, which then leads to another failure in WorkerProcess.kill. Somehow the AsyncProcess instance gets swapped out with a fresh one that has no PID?

____________________________ test_upload_directory _____________________________

c = <Client: 'tcp://127.0.0.1:55556' processes=0 threads=0, memory=0 B>
s = <Scheduler 'tcp://127.0.0.1:55556', workers: 0, cores: 0, tasks: 0>
a = <Nanny: None, threads: 1>, b = <Nanny: None, threads: 2>
tmp_path = PosixPath('/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/test_upload_directory0')

    @gen_cluster(client=True, Worker=Nanny)
    async def test_upload_directory(c, s, a, b, tmp_path):
        from dask.distributed import UploadDirectory
    
        # Be sure to exclude code coverage reports
        files_start = {f for f in os.listdir() if not f.startswith(".coverage")}
    
        with open(tmp_path / "foo.py", "w") as f:
            f.write("x = 123")
        with open(tmp_path / "bar.py", "w") as f:
            f.write("from foo import x")
    
        plugin = UploadDirectory(tmp_path, restart=True, update_path=True)
        await c.register_worker_plugin(plugin)
    
        [name] = a.plugins
        assert os.path.split(tmp_path)[-1] in name
    
        def f():
            import bar
    
            return bar.x
    
        results = await c.run(f)
>       assert results[a.worker_address] == 123
E       KeyError: 'tcp://127.0.0.1:55563'

distributed/tests/test_client.py:7046: KeyError
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_upload_directory.yaml
----------------------------- Captured stderr call -----------------------------
2022-08-29 22:49:40,038 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:55563
2022-08-29 22:49:40,039 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:55563
2022-08-29 22:49:40,039 - distributed.worker - INFO -           Worker name:                          0
2022-08-29 22:49:40,039 - distributed.worker - INFO -          dashboard at:            127.0.0.1:55565
2022-08-29 22:49:40,039 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,039 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,040 - distributed.worker - INFO -               Threads:                          1
2022-08-29 22:49:40,040 - distributed.worker - INFO -                Memory:                  14.00 GiB
2022-08-29 22:49:40,040 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/dask-worker-space/worker-7tby39oh
2022-08-29 22:49:40,040 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,048 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:55564
2022-08-29 22:49:40,049 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:55564
2022-08-29 22:49:40,049 - distributed.worker - INFO -           Worker name:                          1
2022-08-29 22:49:40,049 - distributed.worker - INFO -          dashboard at:            127.0.0.1:55567
2022-08-29 22:49:40,049 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,049 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,049 - distributed.worker - INFO -               Threads:                          2
2022-08-29 22:49:40,049 - distributed.worker - INFO -                Memory:                  14.00 GiB
2022-08-29 22:49:40,050 - distributed.worker - INFO -       Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/dask-worker-space/worker-xvsbrqvh
2022-08-29 22:49:40,050 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,937 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,938 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,939 - distributed.core - INFO - Starting established connection
2022-08-29 22:49:40,944 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:55556
2022-08-29 22:49:40,945 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,947 - distributed.core - INFO - Starting established connection
2022-08-29 22:49:41,020 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:55563
2022-08-29 22:49:41,021 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:55564
2022-08-29 22:49:41,024 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-682279b9-b660-4e16-9224-900996085633 Address tcp://127.0.0.1:55564 Status: Status.closing
2022-08-29 22:49:41,024 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-afb76a22-adab-416a-9920-40dd76207d8d Address tcp://127.0.0.1:55563 Status: Status.closing
2022-08-29 22:49:42,301 - distributed.nanny - WARNING - Worker process still alive after 1.2799989318847658 seconds, killing
2022-08-29 22:49:44,265 - distributed.nanny - WARNING - Worker process still alive after 1.279999237060547 seconds, killing
2022-08-29 22:49:44,272 - distributed.nanny - ERROR - Restart timed out after 30s; returning before finished
2022-08-29 22:49:44,273 - distributed.nanny - ERROR - Restart timed out after 30s; returning before finished
2022-08-29 22:49:44,365 - distributed.nanny - WARNING - Restarting worker
2022-08-29 22:49:44,368 - distributed.nanny - WARNING - Restarting worker
2022-08-29 22:49:44,453 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 796, in kill
    await process.join(wait_timeout)
  File "/Users/runner/work/distributed/distributed/distributed/process.py", line 311, in join
    assert self._state.pid is not None, "can only join a started process"
AssertionError: can only join a started process
2022-08-29 22:49:44,455 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 595, in close
    await self.kill(timeout=timeout)
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 386, in kill
    await self.process.kill(timeout=0.8 * (deadline - time()))
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 796, in kill
    await process.join(wait_timeout)
  File "/Users/runner/work/distributed/distributed/distributed/process.py", line 311, in join
    assert self._state.pid is not None, "can only join a started process"
AssertionError: can only join a started process

Metadata

Metadata

Assignees

No one assigned

    Labels

    flaky testIntermittent failures on CI.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions