Skip to content

Add node lifecycle documentation #45074

Open
@thockin

Description

@thockin

As far as I can tell, we don't have a comprehensive doc which covers the expected lifecycle of nodes in Kubernetes.

Specifically, we have lots of intersecting, async things which involve nodes. For example:

  • Many environments have VMs "behind" Nodes. Those VMs can be deleted without telling k8s. Then someone comes along and deletes the node "in response", but this is racy and causes confusion.
  • Many environments have subsystems which cross-reference things which need to coordinate with node lifecycle. E.g. the service controller puts VMs into LBs, but does so by enumerating Nodes (ignorant of the VM lifecycle).
  • Some components manage nodes directl (e.g. Cluster Autoscaler, Karpenter).

For an example of things that I think are "weird" for lack of docs, look at kubernetes/autoscaler#5201 (comment) . ClusterAutoscaler defines a taint which it uses to prevent work from landing on "draining" nodes (even though we have the unschedulable field already). The service LB controller currently uses that taint to manage LBs. Cluster autoscaler removes the VM from the cloud, and leaves the Node object around for someone else to clean up.

The discussion is about the MEANING of the taint, when it happens, and how to be more graceful. What we want is a clear signal that "this node is going away" and a way for 3rd parties to indicate they have work to do when that happens. It strikes me that we HAVE such a mechanism - delete and finalizers. But CA doesn't do that. I don't know why, but I suspect there are reasons. Should it evolve?

I'd like to see a sig-node (or sig-arch?) owned statement of the node lifecycle. E.g. if the "right" way to signal "this node is going away" is to delete the node, this would say that. Then we can at least say that we think CA should adopt that pattern. If we think it needs to be more sophistacted (aka complicated) then we should express that.

Metadata

Metadata

Assignees

Labels

kind/documentationCategorizes issue or PR as related to documentation.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.sig/architectureCategorizes an issue or PR as relevant to SIG Architecture.sig/docsCategorizes an issue or PR as relevant to SIG Docs.sig/nodeCategorizes an issue or PR as relevant to SIG Node.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions