Description
As far as I can tell, we don't have a comprehensive doc which covers the expected lifecycle of nodes in Kubernetes.
Specifically, we have lots of intersecting, async things which involve nodes. For example:
- Many environments have VMs "behind" Nodes. Those VMs can be deleted without telling k8s. Then someone comes along and deletes the node "in response", but this is racy and causes confusion.
- Many environments have subsystems which cross-reference things which need to coordinate with node lifecycle. E.g. the service controller puts VMs into LBs, but does so by enumerating Nodes (ignorant of the VM lifecycle).
- Some components manage nodes directl (e.g. Cluster Autoscaler, Karpenter).
For an example of things that I think are "weird" for lack of docs, look at kubernetes/autoscaler#5201 (comment) . ClusterAutoscaler defines a taint which it uses to prevent work from landing on "draining" nodes (even though we have the unschedulable
field already). The service LB controller currently uses that taint to manage LBs. Cluster autoscaler removes the VM from the cloud, and leaves the Node object around for someone else to clean up.
The discussion is about the MEANING of the taint, when it happens, and how to be more graceful. What we want is a clear signal that "this node is going away" and a way for 3rd parties to indicate they have work to do when that happens. It strikes me that we HAVE such a mechanism - delete and finalizers. But CA doesn't do that. I don't know why, but I suspect there are reasons. Should it evolve?
I'd like to see a sig-node (or sig-arch?) owned statement of the node lifecycle. E.g. if the "right" way to signal "this node is going away" is to delete the node, this would say that. Then we can at least say that we think CA should adopt that pattern. If we think it needs to be more sophistacted (aka complicated) then we should express that.