Open
Description
2023-09-14 14:24:39,905 - distributed.core - INFO - Starting established connection to tcp://...166.214:46011
slurmstepd-dlcgpu16: error: *** JOB 9277005 ON dlcgpu16 CANCELLED AT 2023-09-14T14:24:41 ***
2023-09-14 14:24:41,045 - distributed.worker - INFO - Stopping worker at tcp://...166.176:40901. Reason: scheduler-close
2023-09-14 14:24:41,046 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...5.166.214:46011>
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 316, in write
raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
value = future.result()
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 327, in write
convert_stream_closed_error(self, e)
File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...166.214:46011>: Stream is closed
2023-09-14 14:24:41,051 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://...166.176:33175'. Reason: scheduler-close
2023-09-14 14:24:41,053 - distributed.core - INFO - Received 'close-stream' from tcp://...166.214:46011; closing.
2023-09-14 14:24:41,053 - distributed.nanny - INFO - Worker closed
I had inserted the following code at the top of submit_trial(), to avoid a timeout from the scheduler. This may be quite central because apparently SMAC3 expects the schedulere to launch the compute nodes instantly:
import asyncio
and
try:
self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
"Try increasing `worker_timeout`.")
raise error
Metadata
Metadata
Assignees
Labels
No labels