Skip to content

Missing idleTimeout key in daskcluster_autoshutdown #882

Open
@timomaier

Description

@timomaier

Describe the issue:

My KubeClusters sometimes do not get shut down properly on kubernetes when they're done with their work. Kubernetes logs state that there's an exception in a kopf finalizer which is retried indefinitely, apparently due to the spec dict given to daskcluster_autoshutdown:

  Timer 'daskcluster_autoshutdown' failed with an exception. Will retry.
  Traceback (most recent call last):
    File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
      result = await invoke_handler(
    File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
      result = await invocation.invoke(
    File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
      result = await fn(**kwargs)  # type: ignore
    File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 852, in daskcluster_autoshutdown
      if spec["idleTimeout"]:
    File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/structs/dicts.py", line 297, in __getitem__
      return resolve(self._src, self._path + (item,))
    File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/structs/dicts.py", line 121, in resolve
      result = result[key]
  KeyError: 'idleTimeout'

When I remove these lines from the DaskCluster resource YAML in kubernetes, the problem is gone

    finalizers:
      - kopf.zalando.org/KopfFinalizerMarker

Is it correct that daskcluster_autoshutdown as below receives spec as a specification dict, e.g. from make_cluster_spec(..., idle_timeout=5)? I tried expicitly adding the idle_timeout, but the problem persists

@kopf.timer("daskcluster.kubernetes.dask.org", interval=5.0)
async def daskcluster_autoshutdown(spec, name, namespace, logger, **kwargs):
    if spec["idleTimeout"]:
        try:
            idle_since = await check_scheduler_idle(
                scheduler_service_name=f"{name}-scheduler",
                namespace=namespace,
                logger=logger,
            )
        except Exception:
            logger.warn("Unable to connect to scheduler, skipping autoshutdown check.")
            return
        if idle_since and time.time() > idle_since + spec["idleTimeout"]:
            cluster = await DaskCluster.get(name, namespace=namespace)
            await cluster.delete()

Not sure if this is a proper bug, or an issue with kopf, or anything is misconfigured on my end. Appreciate any help.
I'd also be fine with just removing the timer/finalizer if that's possible.

Anything else we need to know?:

Environment:

  • Dask version: 2024.4.1
  • Dask operator version: 2024.4.0
  • Python version: 3.10.12
  • kopf python version: 1.37.1
  • Operating System: Linux
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions