Skip to content

No exception handle for cluster 'already exists' exception in dask job creation #940

Closed
@guozhans

Description

@guozhans

Describe the issue:
Hi
We use Flyte task to deploy DaskJob, and we encountered an issue that the runner sometimes was missing while the cluster was reported an error 'already exists'

it looks like the await call got an exception 'already exits' at this line

Since the dask cluster already exists, why don't we handle the exception 'already exists' for creating Dask cluster, and then continue to create the runner?

Error message:

Handler 'daskjob_create_components/status.jobStatus' failed with an exception. Will retry. Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 168, in call_api response.raise_for_status() File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 763, in raise_for_status raise HTTPStatusError(message, request=request, response=self) httpx.HTTPStatusError: Client error '409 Conflict' for url '[https://10.0.0.1/apis/kubernetes.dask.org/v1..."](https://10.0.0.1/apis/kubernetes.dask.org/v1...%22), line 774, in daskjob_create_components await cluster.create() File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 320, in create async with self.api.call_api( File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 186, in call_api raise ServerError( kr8s._exceptions.ServerError: daskclusters.kubernetes.dask.org "fn2oaqa4432x5o-n0-0-dn7-0" already exists

Environment:

  • Dask version: 2024.10.0
  • Python version: 3.12
  • Operating System:
  • Install method (conda, pip, source):

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions