Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot spin up ECS GPU worker with current versions #384

Open
cdc97 opened this issue Oct 13, 2022 · 3 comments
Open

Cannot spin up ECS GPU worker with current versions #384

cdc97 opened this issue Oct 13, 2022 · 3 comments
Labels
bug Something isn't working provider/aws/ecs Cluster provider for AWS ECS

Comments

@cdc97
Copy link

cdc97 commented Oct 13, 2022

Describe the issue:
My GPU worker cannot start via the

"command": [
                            "dask-cuda-worker" if self._worker_gpu else "dask-worker",
                            "--nthreads",
                            "{}".format(
                                max(int(self._worker_cpu / 1024), 1)
                                if self._worker_nthreads is None
                                else self._worker_nthreads
                            ),
                            "--memory-limit",
                            "{}MB".format(int(self._worker_mem)),
                            "--death-timeout",
                            "60",
                        ]

that gets passed in from ecs.py. Dask-cuda seems to have removed the --death-timeout option, so upon startup of the worker, I see

Usage: dask-cuda-worker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Try 'dask-cuda-worker --help' for help.

Error: Got unexpected extra argument: (60)

I'm unfortunately trying to run this from prefect, so I can't pin to a low enough version of dask-cuda and distributed that do have this argument specified. When I try to pin to a low enough version on the scheduler/worker container, the more recent distributed version on the prefect agent container doesn't seem to play nice with the scheduler/worker with the error (from the prefect agent):

2022-10-13 21:25:50,708 - distributed.protocol.core - CRITICAL - Failed to deserialize
2022-10-13 14:25:50Traceback (most recent call last):
2022-10-13 14:25:50File "/usr/local/lib/python3.9/site-packages/distributed/protocol/core.py", line 158, in loads
2022-10-13 14:25:50return msgpack.loads(
2022-10-13 14:25:50File "msgpack/_unpacker.pyx", line 205, in msgpack._cmsgpack.unpackb
2022-10-13 14:25:50ValueError: Unpack failed: incomplete input
2022-10-13 14:22:1721:22:17.913 | INFO | prefect.task_runner.dask - Creating a new Dask cluster with `__prefect_loader__.<lambda>`

Minimal Complete Verifiable Example:
run a docker image with the following, and you should see the error.

RUN pip install prefect distributed dask-cuda dask
CMD ["dask-cuda-worker", "--nthreads", "1", "--death-timeout", "60"]

Anything else we need to know?:

Environment:

dask                          2022.9.2
dask-cuda                     22.10.0
distributed                   2022.9.2
prefect                       2.6.0
  • Dask version:
  • Python version: Python 3.8.0
  • Operating System:
  • Install method (conda, pip, source): pip
@jacobtomlinson
Copy link
Member

Looks like this was removed in rapidsai/dask-cuda#563. cc @charlesbluca @pentschev

I'm surprised this hasn't come up until now.

@jacobtomlinson jacobtomlinson added bug Something isn't working provider/aws/ecs Cluster provider for AWS ECS labels Oct 14, 2022
@pentschev
Copy link
Member

I'm not familiar with dask-cloudprovider, is --death-timeout something generally important or is it something that it can live without?

@jacobtomlinson
Copy link
Member

I would say it is an important feature. One of dask-cloudproviders goals is to fail cheaply, so if a worker cannot connect to a scheduler after a timeout it should shutdown/terminate to save money.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working provider/aws/ecs Cluster provider for AWS ECS
Projects
None yet
Development

No branches or pull requests

3 participants