ConnectionRefusedError

```
2023-09-14 14:24:39,905 - distributed.core - INFO - Starting established connection to tcp://...166.214:46011
slurmstepd-dlcgpu16: error: *** JOB 9277005 ON dlcgpu16 CANCELLED AT 2023-09-14T14:24:41 ***
2023-09-14 14:24:41,045 - distributed.worker - INFO - Stopping worker at tcp://...166.176:40901. Reason: scheduler-close
2023-09-14 14:24:41,046 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...5.166.214:46011>
Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 316, in write
    raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 327, in write
    convert_stream_closed_error(self, e)
  File "/home/username/.python3.10.6/lib/python3.10/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://...166.176:58820 remote=tcp://...166.214:46011>: Stream is closed
2023-09-14 14:24:41,051 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://...166.176:33175'. Reason: scheduler-close
2023-09-14 14:24:41,053 - distributed.core - INFO - Received 'close-stream' from tcp://...166.214:46011; closing.
2023-09-14 14:24:41,053 - distributed.nanny - INFO - Worker closed

```
I had inserted the following code at the top of submit_trial(), to avoid a timeout from the scheduler. This may be quite central because apparently SMAC3 expects the schedulere to launch the compute nodes instantly:
```
import asyncio
```
and
```
try:
    self._client.wait_for_workers(n_workers=1, timeout=1200)
except asyncio.exceptions.TimeoutError as error:
    logger.debug(f"No worker could be scheduled in time after {self._worker_timeout}s on the cluster. "
                  "Try increasing `worker_timeout`.")
    raise error
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ConnectionRefusedError #614

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ConnectionRefusedError #614

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions