Open
Description
Very related to #6972 (though that won't fix it).
In this test, upload_directory
tries to restart the worker. The restart fails for some reason, which then leads to another failure in WorkerProcess.kill
. Somehow the AsyncProcess
instance gets swapped out with a fresh one that has no PID?
____________________________ test_upload_directory _____________________________
c = <Client: 'tcp://127.0.0.1:55556' processes=0 threads=0, memory=0 B>
s = <Scheduler 'tcp://127.0.0.1:55556', workers: 0, cores: 0, tasks: 0>
a = <Nanny: None, threads: 1>, b = <Nanny: None, threads: 2>
tmp_path = PosixPath('/private/var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/pytest-of-runner/pytest-0/test_upload_directory0')
@gen_cluster(client=True, Worker=Nanny)
async def test_upload_directory(c, s, a, b, tmp_path):
from dask.distributed import UploadDirectory
# Be sure to exclude code coverage reports
files_start = {f for f in os.listdir() if not f.startswith(".coverage")}
with open(tmp_path / "foo.py", "w") as f:
f.write("x = 123")
with open(tmp_path / "bar.py", "w") as f:
f.write("from foo import x")
plugin = UploadDirectory(tmp_path, restart=True, update_path=True)
await c.register_worker_plugin(plugin)
[name] = a.plugins
assert os.path.split(tmp_path)[-1] in name
def f():
import bar
return bar.x
results = await c.run(f)
> assert results[a.worker_address] == 123
E KeyError: 'tcp://127.0.0.1:55563'
distributed/tests/test_client.py:7046: KeyError
----------------------------- Captured stdout call -----------------------------
Dumped cluster state to test_cluster_dump/test_upload_directory.yaml
----------------------------- Captured stderr call -----------------------------
2022-08-29 22:49:40,038 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:55563
2022-08-29 22:49:40,039 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:55563
2022-08-29 22:49:40,039 - distributed.worker - INFO - Worker name: 0
2022-08-29 22:49:40,039 - distributed.worker - INFO - dashboard at: 127.0.0.1:55565
2022-08-29 22:49:40,039 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:55556
2022-08-29 22:49:40,039 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,040 - distributed.worker - INFO - Threads: 1
2022-08-29 22:49:40,040 - distributed.worker - INFO - Memory: 14.00 GiB
2022-08-29 22:49:40,040 - distributed.worker - INFO - Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/dask-worker-space/worker-7tby39oh
2022-08-29 22:49:40,040 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,048 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:55564
2022-08-29 22:49:40,049 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:55564
2022-08-29 22:49:40,049 - distributed.worker - INFO - Worker name: 1
2022-08-29 22:49:40,049 - distributed.worker - INFO - dashboard at: 127.0.0.1:55567
2022-08-29 22:49:40,049 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:55556
2022-08-29 22:49:40,049 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,049 - distributed.worker - INFO - Threads: 2
2022-08-29 22:49:40,049 - distributed.worker - INFO - Memory: 14.00 GiB
2022-08-29 22:49:40,050 - distributed.worker - INFO - Local Directory: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/dask-worker-space/worker-xvsbrqvh
2022-08-29 22:49:40,050 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,937 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:55556
2022-08-29 22:49:40,938 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,939 - distributed.core - INFO - Starting established connection
2022-08-29 22:49:40,944 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:55556
2022-08-29 22:49:40,945 - distributed.worker - INFO - -------------------------------------------------
2022-08-29 22:49:40,947 - distributed.core - INFO - Starting established connection
2022-08-29 22:49:41,020 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:55563
2022-08-29 22:49:41,021 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:55564
2022-08-29 22:49:41,024 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-682279b9-b660-4e16-9224-900996085633 Address tcp://127.0.0.1:55564 Status: Status.closing
2022-08-29 22:49:41,024 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-afb76a22-adab-416a-9920-40dd76207d8d Address tcp://127.0.0.1:55563 Status: Status.closing
2022-08-29 22:49:42,301 - distributed.nanny - WARNING - Worker process still alive after 1.2799989318847658 seconds, killing
2022-08-29 22:49:44,265 - distributed.nanny - WARNING - Worker process still alive after 1.279999237060547 seconds, killing
2022-08-29 22:49:44,272 - distributed.nanny - ERROR - Restart timed out after 30s; returning before finished
2022-08-29 22:49:44,273 - distributed.nanny - ERROR - Restart timed out after 30s; returning before finished
2022-08-29 22:49:44,365 - distributed.nanny - WARNING - Restarting worker
2022-08-29 22:49:44,368 - distributed.nanny - WARNING - Restarting worker
2022-08-29 22:49:44,453 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 595, in close
await self.kill(timeout=timeout)
File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 386, in kill
await self.process.kill(timeout=0.8 * (deadline - time()))
File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 796, in kill
await process.join(wait_timeout)
File "/Users/runner/work/distributed/distributed/distributed/process.py", line 311, in join
assert self._state.pid is not None, "can only join a started process"
AssertionError: can only join a started process
2022-08-29 22:49:44,455 - distributed.nanny - ERROR - Error in Nanny killing Worker subprocess
Traceback (most recent call last):
File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 595, in close
await self.kill(timeout=timeout)
File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 386, in kill
await self.process.kill(timeout=0.8 * (deadline - time()))
File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 796, in kill
await process.join(wait_timeout)
File "/Users/runner/work/distributed/distributed/distributed/process.py", line 311, in join
assert self._state.pid is not None, "can only join a started process"
AssertionError: can only join a started process