Endless nodes are created after `expireAfter` elapse on a node in some scenarios

### Description

**Observed Behavior**:

After `expireAfter` elapse on a node, pods are starting to get evicted, and endless new nodes are created to try to schedule those pods. Also, pods that don't have PDBs are NOT evicted.

**Expected Behavior**:

After `expireAfter` elapse on a node, pods are starting to get evicted, and one node at most is created to schedule those pods. Also, pods that don't have PDBs are evicted. There may be an odd pod that has a PDB preventing the node from getting recycled, but if this is the case we can set `terminationGracePeriod`.

**Reproduction Steps**:

1. Have one CloudNativePG database in the cluster (or a similar workload => single replica & a PDB)
2. CloudNativePG will add a PDB to the primary.
3. Have a nodepool with relatively short expiry (`expireAfter`). In our case we have dev environments set at 24h, so we caught this early.
4. Once a node expires, a weird behaviour is triggered.
    1. As expected, in v1 expiries are now forceful, so Karpenter begins to evict the pods.
    2. As expected, a new node is spun out to take up the slack.
    3. But then the problems start,
        1. Since there is a PDB on a single replica (there is only one PG primary at the time), eviction is not happening. So far so good (this is also the old behaviour, in v0.37.x the node just can't expire until we restart the database manually (or kill the primary)).
        2. However, any other pods on this node are not evicted either, while the documentation, and the log messages appear to believe it should be the case.
        3. The new node from earlier is `nomitated` for those pods, but they never transfer to that node, as they are not evicted.
        4. Then at the next batch of pod scheduling, we get `found provisionable pod(s)` again, and a new nodeclaim is added (for the same pod as earlier)
        5. And again
        6. And again
        7. And again
    8. So we end up in a situation where we have a lot of unused nodes, containing only daemonset and new workloads.
9. At the point, I restart the database, the primary move, the PDB is removed, and everything can then slowly heal. However, there was no sign of "infinite nodeclaim creation" ever ending before.

We believe this is a bug, we couldn't find a workaround (aside from removing `expireAfter`), and reverted to v0.37.x series for now.

**A few clues**:
The state of the cluster 30m-45m after expiry. Node 53-23 is the one that expired. Any nodes younger than 30min are running mostly empty (aside from daemonsets).

![node-create-hell-clean](https://github.com/user-attachments/assets/0ed7284b-4e03-4940-a472-632e7502affb)

On the expired node, the pods are `nominated` to be scheduled on a different node, but as you can see it can never happen.

NOTE: I don't recall 100% if this screenshot was CloudNativePG primary itself or one of its neighbouring pods, but I think so.

![node-should-schedule](https://github.com/user-attachments/assets/1639d1bb-c5c9-4185-8ee2-fbae56f5c1e5)


And finally the log that appears after every scheduling event saying it `found provisionable pod(s)` and they precede a new “unnecessary nodeclaim."

```
karpenter-5d967c944c-k8xb8 {"level":"INFO","time":"2024-11-13T22:47:24.148Z","logger":"controller","message":"found provisionable pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"7c981fa7-3071-4de8-87b3-370a15664ba7","Pods":"monitoring/monitoring-grafana-pg-1, kube-system/coredns-58745b69fb-sd222, cnpg-system/cnpg-cloudnative-pg-7667bd696d-lrqvb, kube-system/aws-load-balancer-controller-74b584c6df-fckdn, harbor/harbor-container-webhook-78657f5698-kmmrz","duration":"87.726672ms"}
```

**Versions**:
- Chart Version: 1.0.7
- Kubernetes Version (`kubectl version`): v1.29.10

**Extra**:

* I would like to build / modify a test case to prove / diagnose this behaviour, any pointer? I've looked at the source code, but I wanted to post this report first to gather feedback.
* Any other workaround aside from disabling `expireAfter` on the node pool?
* Finally, in our context this bug is triggered by CloudNativePG primaries, but it would apply to any workload with a single replica and a PDB `minAvailable: 1`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Endless nodes are created after `expireAfter` elapse on a node in some scenarios #1842

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Endless nodes are created after `expireAfter` elapse on a node in some scenarios #1842