-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCPCluster timeout on exit #220
Comments
Did you close the client and cluster before exiting? |
Also sorry about the manual clean up. Trying to address this in dask/distributed#4263, but it's a longer term goal. |
I did not, I just sent a SIGQUIT to the process. FWIW this worked probably ~30 times in the last week or so and only failed once. |
Did you create the cluster within a context manager? |
I did not, though I'm not sure how to do that given how I'm trying to use this. It would be easy to close the cluster though before killing the REPL. |
Thanks. The thing I'm getting at is |
Oh interesting, I just assumed there was some kind of shutdown hook based on sessions like this: > ipython
Python 3.8.6 | packaged by conda-forge | (default, Oct 7 2020, 19:08:05)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from dask_cloudprovider.gcp.instances import GCPCluster
In [2]: cluster = GCPCluster(n_workers=1)
Launching cluster with the following configuration:
Source Image: projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014
Docker Image: eczech/ukb-gwas-pipeline-nealelab:v2.30.0
Machine Type: n1-standard-8
Filesytsem Size: 50
N-GPU Type:
Zone: us-east1-c
Creating scheduler instance
dask-0fcd5080-scheduler
Internal IP: 10.142.0.2
External IP: None
Waiting for scheduler to run
Scheduler is running
/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/contextlib.py:120: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources. Hang tight!
next(self.gen)
Creating worker instance
dask-0fcd5080-worker-11be452e
Internal IP: 10.142.0.4
External IP: None
In [3]: # Press ctrl^d
Do you really want to exit ([y]/n)? y
Closing Instance: dask-0fcd5080-worker-11be452e
Closing Instance: dask-0fcd5080-scheduler
> # back at shell Maybe it's not SIGQUIT? In either case the VMs are cleaned up. |
After poking around a little it appears there are shutdown hooks. I must confess I wasn't aware these had been upstreamed. We chose to explicitly not implement them for the newer From the original error you shared it would appear that the cluster had already shut down and the hooks resulted in a timeout. I'm a little surprised the exceptions are not being suppressed by the shutdown hooks. |
I am able to reproduce this with the following steps. >>> from dask_cloudprovider.gcp.instances import GCPCluster
>>> cluster = GCPCluster(n_workers=1)
>>> # Manually delete instances in GCP control panel
>>> exit
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x11560a940>>, <Task finished name='Task-55' coro=<SpecCluster._correct_state_internal() done, defined at /Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py:320> exception=OSError('Timed out trying to connect to tcp://35.190.158.139:8786 after 10 s')>)
Traceback (most recent call last):
File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 286, in connect
comm = await asyncio.wait_for(
File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 401, in _close
await self._correct_state()
File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 328, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 805, in send_recv_from_rpc
comm = await self.live_comm()
File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 763, in live_comm
comm = await connect(
File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 308, in connect
raise IOError(
OSError: Timed out trying to connect to tcp://35.190.158.139:8786 after 10 s
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 286, in connect
comm = await asyncio.wait_for(
File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 641, in close_clusters
cluster.close(timeout=10)
File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/cluster.py", line 104, in close
return self.sync(self._close, callback_timeout=timeout)
File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/cluster.py", line 183, in sync
return sync(self.loop, func, *args, **kwargs)
File "/Users/jtomlinson/Projects/dask/distributed/distributed/utils.py", line 340, in sync
raise exc.with_traceback(tb)
File "/Users/jtomlinson/Projects/dask/distributed/distributed/utils.py", line 324, in f
result[0] = yield future
File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 401, in _close
await self._correct_state()
File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 328, in _correct_state_internal
await self.scheduler_comm.retire_workers(workers=list(to_close))
File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 805, in send_recv_from_rpc
comm = await self.live_comm()
File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 763, in live_comm
comm = await connect(
File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 308, in connect
raise IOError(
OSError: Timed out trying to connect to tcp://35.190.158.139:8786 after 10 s |
@jacobtomlinson I can experience a similar problem on Fargate. |
@rpanai could you please raise a separate issue as this |
@jacobtomlinson No problem. Do you want me to delete the comments here? |
No youre fine |
I received this error after creating a cluster, running jobs with it for ~5 hours, and then trying to exit out of that process.
This resulted in having to delete all the instances for the cluster myself.
This is similar to #179 but the timeout here occurs when trying to communicate with the cluster, not when using the GCP API.
The text was updated successfully, but these errors were encountered: