Description
Describe the issue:
My KubeClusters sometimes do not get shut down properly on kubernetes when they're done with their work. Kubernetes logs state that there's an exception in a kopf finalizer which is retried indefinitely, apparently due to the spec dict given to daskcluster_autoshutdown
:
Timer 'daskcluster_autoshutdown' failed with an exception. Will retry.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
result = await invoke_handler(
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
result = await invocation.invoke(
File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
result = await fn(**kwargs) # type: ignore
File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 852, in daskcluster_autoshutdown
if spec["idleTimeout"]:
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/structs/dicts.py", line 297, in __getitem__
return resolve(self._src, self._path + (item,))
File "/usr/local/lib/python3.10/site-packages/kopf/_cogs/structs/dicts.py", line 121, in resolve
result = result[key]
KeyError: 'idleTimeout'
When I remove these lines from the DaskCluster resource YAML in kubernetes, the problem is gone
finalizers:
- kopf.zalando.org/KopfFinalizerMarker
Is it correct that daskcluster_autoshutdown
as below receives spec
as a specification dict, e.g. from make_cluster_spec(..., idle_timeout=5)
? I tried expicitly adding the idle_timeout
, but the problem persists
@kopf.timer("daskcluster.kubernetes.dask.org", interval=5.0)
async def daskcluster_autoshutdown(spec, name, namespace, logger, **kwargs):
if spec["idleTimeout"]:
try:
idle_since = await check_scheduler_idle(
scheduler_service_name=f"{name}-scheduler",
namespace=namespace,
logger=logger,
)
except Exception:
logger.warn("Unable to connect to scheduler, skipping autoshutdown check.")
return
if idle_since and time.time() > idle_since + spec["idleTimeout"]:
cluster = await DaskCluster.get(name, namespace=namespace)
await cluster.delete()
Not sure if this is a proper bug, or an issue with kopf, or anything is misconfigured on my end. Appreciate any help.
I'd also be fine with just removing the timer/finalizer if that's possible.
Anything else we need to know?:
Environment:
- Dask version: 2024.4.1
- Dask operator version: 2024.4.0
- Python version: 3.10.12
- kopf python version: 1.37.1
- Operating System: Linux
- Install method (conda, pip, source): pip