Open
Description
Thanks for the great work on the dask-operator!
Three quick notes from trying it on our cluster:
- The scaling of workers assumes that each worker pod only has one dask process. If there are multiple processes the worker-names (in dask) are something like
...-ef8274c-0
,...-ef8274c-1
to mark which process they belong to, and scaling down will fail as the pod name will be...-ef8274c
, so scaling down will fail. A note in the documentation that one process per pod is required would be helpful. If you have multiple processes daskworkergroup scale-up works but scale-down fails. - If you use NodePort (as in the examples), the default service account created during the latest helm install doesn't have permission to list nodes and can lead to errors. ClusterIP works great.
- The kubeflow patch script (
kubectl patch clusterrole kubeflow-kubernetes-edit --patch '{"rules": [{"apiGroups": ["kubernetes.dask.org"],"resources": ["*"],"verbs": ["*"]}]}'
) over-writes the kubeflow permissions rather than adding the dask permissions. I'm not sure what the right way to patch-by-adding is.