Skip to content

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Closed
@otoupin-nsesi

Description

@otoupin-nsesi

Description

Observed Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and endless new nodes are created to try to schedule those pods. Also, pods that don't have PDBs are NOT evicted.

Expected Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and one node at most is created to schedule those pods. Also, pods that don't have PDBs are evicted. There may be an odd pod that has a PDB preventing the node from getting recycled, but if this is the case we can set terminationGracePeriod.

Reproduction Steps:

  1. Have one CloudNativePG database in the cluster (or a similar workload => single replica & a PDB)
  2. CloudNativePG will add a PDB to the primary.
  3. Have a nodepool with relatively short expiry (expireAfter). In our case we have dev environments set at 24h, so we caught this early.
  4. Once a node expires, a weird behaviour is triggered.
    1. As expected, in v1 expiries are now forceful, so Karpenter begins to evict the pods.
    2. As expected, a new node is spun out to take up the slack.
    3. But then the problems start,
      1. Since there is a PDB on a single replica (there is only one PG primary at the time), eviction is not happening. So far so good (this is also the old behaviour, in v0.37.x the node just can't expire until we restart the database manually (or kill the primary)).
      2. However, any other pods on this node are not evicted either, while the documentation, and the log messages appear to believe it should be the case.
      3. The new node from earlier is nomitated for those pods, but they never transfer to that node, as they are not evicted.
      4. Then at the next batch of pod scheduling, we get found provisionable pod(s) again, and a new nodeclaim is added (for the same pod as earlier)
      5. And again
      6. And again
      7. And again
    4. So we end up in a situation where we have a lot of unused nodes, containing only daemonset and new workloads.
  5. At the point, I restart the database, the primary move, the PDB is removed, and everything can then slowly heal. However, there was no sign of "infinite nodeclaim creation" ever ending before.

We believe this is a bug, we couldn't find a workaround (aside from removing expireAfter), and reverted to v0.37.x series for now.

A few clues:
The state of the cluster 30m-45m after expiry. Node 53-23 is the one that expired. Any nodes younger than 30min are running mostly empty (aside from daemonsets).

node-create-hell-clean

On the expired node, the pods are nominated to be scheduled on a different node, but as you can see it can never happen.

NOTE: I don't recall 100% if this screenshot was CloudNativePG primary itself or one of its neighbouring pods, but I think so.

node-should-schedule

And finally the log that appears after every scheduling event saying it found provisionable pod(s) and they precede a new “unnecessary nodeclaim."

karpenter-5d967c944c-k8xb8 {"level":"INFO","time":"2024-11-13T22:47:24.148Z","logger":"controller","message":"found provisionable pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"7c981fa7-3071-4de8-87b3-370a15664ba7","Pods":"monitoring/monitoring-grafana-pg-1, kube-system/coredns-58745b69fb-sd222, cnpg-system/cnpg-cloudnative-pg-7667bd696d-lrqvb, kube-system/aws-load-balancer-controller-74b584c6df-fckdn, harbor/harbor-container-webhook-78657f5698-kmmrz","duration":"87.726672ms"}

Versions:

  • Chart Version: 1.0.7
  • Kubernetes Version (kubectl version): v1.29.10

Extra:

  • I would like to build / modify a test case to prove / diagnose this behaviour, any pointer? I've looked at the source code, but I wanted to post this report first to gather feedback.
  • Any other workaround aside from disabling expireAfter on the node pool?
  • Finally, in our context this bug is triggered by CloudNativePG primaries, but it would apply to any workload with a single replica and a PDB minAvailable: 1.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions