Open
Description
I am trying to do data analysis on the 9900 parquet files that in total they have 100GB size.
After 70K garbage collections warning:
distributed.utils_perf - WARNING - full garbage collections took 60% CPU time recently (threshold: 10%)
My job killed and there is the following error.
distributed.utils_perf - WARNING - full garbage collections took 60% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 60% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 62% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 61% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 59% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 61% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 56% CPU time recently (threshold: 10%)
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2ab42ba925d0>>, <Task finished coro=<Worker.heartbeat() done, defined at /galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py:883> exception=CommClosedError('in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer')>)
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 920, in heartbeat
raise e
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 893, in heartbeat
metrics=await self.get_metrics(),
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
operation=operation,
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
return await coro()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
response = await comm.read(deserializers=deserializers)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
convert_stream_closed_error(self, e)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
distributed.worker - INFO - Connection to scheduler broken. Reconnecting...
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2b3465022590>>, <Task finished coro=<Worker.heartbeat() done, defined at /galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py:883> exception=CommClosedError('in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer')>)
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 920, in heartbeat
raise e
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 893, in heartbeat
metrics=await self.get_metrics(),
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
operation=operation,
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
return await coro()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
response = await comm.read(deserializers=deserializers)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
convert_stream_closed_error(self, e)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2adcf6fabb50>>, <Task finished coro=<Worker.heartbeat() done, defined at /galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py:883> exception=CommClosedError('in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer')>)
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 920, in heartbeat
raise e
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 893, in heartbeat
metrics=await self.get_metrics(),
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
operation=operation,
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
return await coro()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
response = await comm.read(deserializers=deserializers)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
convert_stream_closed_error(self, e)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2ba64a584990>>, <Task finished coro=<Worker.heartbeat() done, defined at /galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py:883> exception=CommClosedError('in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer')>)
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 920, in heartbeat
raise e
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py", line 893, in heartbeat
metrics=await self.get_metrics(),
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
operation=operation,
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
return await coro()
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
response = await comm.read(deserializers=deserializers)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
convert_stream_closed_error(self, e)
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2ac978e74f90>>, <Task finished coro=<Worker.heartbeat() done, defined at /galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/worker.py:883> exception=CommClosedError('in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer')>)
Traceback (most recent call last):
File "/galileo/home/userexternal/mseyedka/miniconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed