Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCPCluster timeout on exit #220

Open
eric-czech opened this issue Dec 13, 2020 · 13 comments
Open

GCPCluster timeout on exit #220

eric-czech opened this issue Dec 13, 2020 · 13 comments
Labels
bug Something isn't working provider/gcp/vm Cluster provider for GCP Instances

Comments

@eric-czech
Copy link
Contributor

I received this error after creating a cluster, running jobs with it for ~5 hours, and then trying to exit out of that process.

This resulted in having to delete all the instances for the cluster myself.

# create cluster, let jobs run, and then ~5 hours later:
In [3]: exit
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fb69c986760>>, <Task finished name='Task-130' coro=<SpecCluster._correct_state_internal() done, defined at /home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py:320> exception=OSError('Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s')>)
Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 768, in live_comm
    comm = await connect(
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/asyncio/tasks.py", line 498, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 641, in close_clusters
    cluster.close(timeout=10)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 104, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 183, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 810, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/core.py", line 768, in live_comm
    comm = await connect(
  File "/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
    raise IOError(
OSError: Timed out during handshake while connecting to tcp://10.142.0.3:8786 after 10 s

This is similar to #179 but the timeout here occurs when trying to communicate with the cluster, not when using the GCP API.

@jacobtomlinson
Copy link
Member

Did you close the client and cluster before exiting?

@jacobtomlinson
Copy link
Member

Also sorry about the manual clean up. Trying to address this in dask/distributed#4263, but it's a longer term goal.

@eric-czech
Copy link
Contributor Author

Did you close the client and cluster before exiting?

I did not, I just sent a SIGQUIT to the process. FWIW this worked probably ~30 times in the last week or so and only failed once.

@jacobtomlinson
Copy link
Member

Did you create the cluster within a context manager?

@eric-czech
Copy link
Contributor Author

I did not, though I'm not sure how to do that given how I'm trying to use this. It would be easy to close the cluster though before killing the REPL.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Dec 15, 2020

Thanks. The thing I'm getting at is GCPCluster never cleans up your GCP resources automatically on SIGQUIT. The only way to clean up is to explicitly call cluster.close() or use within a context manager.

@eric-czech
Copy link
Contributor Author

Oh interesting, I just assumed there was some kind of shutdown hook based on sessions like this:

> ipython
Python 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.19.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from dask_cloudprovider.gcp.instances import GCPCluster

In [2]: cluster = GCPCluster(n_workers=1)
Launching cluster with the following configuration:
  Source Image: projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014
  Docker Image: eczech/ukb-gwas-pipeline-nealelab:v2.30.0
  Machine Type: n1-standard-8
  Filesytsem Size: 50
  N-GPU Type:
  Zone: us-east1-c
Creating scheduler instance
dask-0fcd5080-scheduler
	Internal IP: 10.142.0.2
	External IP: None
Waiting for scheduler to run
Scheduler is running
/home/eczech/miniconda3/envs/cloudprovider/lib/python3.8/contextlib.py:120: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources. Hang tight!
  next(self.gen)
Creating worker instance
dask-0fcd5080-worker-11be452e
	Internal IP: 10.142.0.4
	External IP: None

In [3]: # Press ctrl^d
Do you really want to exit ([y]/n)? y
Closing Instance: dask-0fcd5080-worker-11be452e
Closing Instance: dask-0fcd5080-scheduler
> # back at shell

Maybe it's not SIGQUIT? In either case the VMs are cleaned up.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Dec 15, 2020

After poking around a little it appears there are shutdown hooks. I must confess I wasn't aware these had been upstreamed. We chose to explicitly not implement them for the newer dask_cloudprovider cluster managers as they can be unreliable, and with expensive cloud resources the stakes can be rather high. This is one of my motivations for dask/distributed#4263.

From the original error you shared it would appear that the cluster had already shut down and the hooks resulted in a timeout. I'm a little surprised the exceptions are not being suppressed by the shutdown hooks.

@jacobtomlinson jacobtomlinson added bug Something isn't working provider/gcp/vm Cluster provider for GCP Instances labels Dec 15, 2020
@jacobtomlinson
Copy link
Member

I am able to reproduce this with the following steps.

>>> from dask_cloudprovider.gcp.instances import GCPCluster
>>> cluster = GCPCluster(n_workers=1)

>>> # Manually delete instances in GCP control panel

>>> exit
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x11560a940>>, <Task finished name='Task-55' coro=<SpecCluster._correct_state_internal() done, defined at /Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py:320> exception=OSError('Timed out trying to connect to tcp://35.190.158.139:8786 after 10 s')>)
Traceback (most recent call last):
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 805, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 763, in live_comm
    comm = await connect(
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 308, in connect
    raise IOError(
OSError: Timed out trying to connect to tcp://35.190.158.139:8786 after 10 s
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 286, in connect
    comm = await asyncio.wait_for(
  File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/asyncio/tasks.py", line 490, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 641, in close_clusters
    cluster.close(timeout=10)
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/cluster.py", line 104, in close
    return self.sync(self._close, callback_timeout=timeout)
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/cluster.py", line 183, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/Users/jtomlinson/miniconda3/envs/dask-cloudprovider-test/lib/python3.8/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 401, in _close
    await self._correct_state()
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/deploy/spec.py", line 328, in _correct_state_internal
    await self.scheduler_comm.retire_workers(workers=list(to_close))
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 805, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/core.py", line 763, in live_comm
    comm = await connect(
  File "/Users/jtomlinson/Projects/dask/distributed/distributed/comm/core.py", line 308, in connect
    raise IOError(
OSError: Timed out trying to connect to tcp://35.190.158.139:8786 after 10 s

@rpanai
Copy link

rpanai commented Jun 3, 2021

@jacobtomlinson I can experience a similar problem on Fargate.
Plus I see the error UnboundLocalError: local variable 'status' referenced before assignment

@jacobtomlinson
Copy link
Member

@rpanai could you please raise a separate issue as this FargateCluster and GCPCluster are two separate cluster manager implementations.

@rpanai
Copy link

rpanai commented Jun 3, 2021

@jacobtomlinson No problem. Do you want me to delete the comments here?

@jacobtomlinson
Copy link
Member

No youre fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working provider/gcp/vm Cluster provider for GCP Instances
Projects
None yet
Development

No branches or pull requests

3 participants