Description
What problem are you trying to solve?
Today, drift is only observable per-NodeClaim, via the Drifted status condition. There is no aggregate, NodePool-level signal that tells you how far a NodePool is through reconciling drift across the nodes it owns.
This makes it hard for external systems to gate on "this NodePool has finished rolling out a change" without listing and aggregating NodeClaims themselves. Concretely, we run Karpenter under Argo CD (GitOps). When a NodePool/EC2NodeClass change triggers drift, we want Argo CD to report the Application as Progressing until the drift-driven node replacement is substantially complete, and Healthy once it is.
Argo CD's health is evaluated per resource with a sandboxed Lua check that has no access to other resources. So a NodePool health check can only read the NodePool's own status. A NodeClaim health check could in principle read each NodeClaim's Drifted condition, but:
- It would require NodeClaims to appear as children of the NodePool in Argo's resource tree (which, in our multi-cluster setup, they currently do not), and
- It forces all-or-nothing (100%) semantics, with no notion of a tolerance threshold or a settling window.
As a result we maintain a PostSync Job that polls NodeClaims, groups them by karpenter.sh/nodepool, computes the percentage no longer Drifted, and blocks until each NodePool crosses a threshold (e.g. 90%). This is exactly the kind of aggregation we'd expect Karpenter itself to be able to expose, since it already owns the NodeClaims and tracks their drift state.
Proposal
Expose drift/rollout progress at the NodePool level, in NodePool.status. Any of the following would be sufficient for our use case (in rough order of preference):
- Counts in
NodePool.status, e.g. status.driftedNodeClaims / status.nodeClaims (analogous to the existing status.nodes and status.resources), so consumers can compute completion percentage directly.
- A NodePool-level condition such as
Drifted (status: "True" while any owned NodeClaim is drifted, "False" once reconciliation is complete), mirroring how the per-NodeClaim Drifted condition works.
- Both — a condition for a simple boolean gate plus counts for threshold-based logic.
This would let any GitOps/automation system gate on a NodePool's own status without re-implementing NodeClaim aggregation, and it would make rollout progress visible in kubectl get nodepool and dashboards.
How important is this feature to you?
Moderately important. We have a working PostSync Job that does the aggregation today, so we are not blocked, but it adds operational surface (a Job + ServiceAccount + ClusterRole + a maintained image per cluster) purely to compute information Karpenter already has internally. A NodePool-level status field/condition would let us replace that machinery with a standard Argo CD custom health check and would benefit anyone integrating Karpenter rollouts with external orchestration.
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Description
What problem are you trying to solve?
Today, drift is only observable per-NodeClaim, via the
Driftedstatus condition. There is no aggregate, NodePool-level signal that tells you how far a NodePool is through reconciling drift across the nodes it owns.This makes it hard for external systems to gate on "this NodePool has finished rolling out a change" without listing and aggregating NodeClaims themselves. Concretely, we run Karpenter under Argo CD (GitOps). When a NodePool/EC2NodeClass change triggers drift, we want Argo CD to report the Application as
Progressinguntil the drift-driven node replacement is substantially complete, andHealthyonce it is.Argo CD's health is evaluated per resource with a sandboxed Lua check that has no access to other resources. So a NodePool health check can only read the NodePool's own
status. A NodeClaim health check could in principle read each NodeClaim'sDriftedcondition, but:As a result we maintain a PostSync Job that polls NodeClaims, groups them by
karpenter.sh/nodepool, computes the percentage no longerDrifted, and blocks until each NodePool crosses a threshold (e.g. 90%). This is exactly the kind of aggregation we'd expect Karpenter itself to be able to expose, since it already owns the NodeClaims and tracks their drift state.Proposal
Expose drift/rollout progress at the NodePool level, in
NodePool.status. Any of the following would be sufficient for our use case (in rough order of preference):NodePool.status, e.g.status.driftedNodeClaims/status.nodeClaims(analogous to the existingstatus.nodesandstatus.resources), so consumers can compute completion percentage directly.Drifted(status: "True"while any owned NodeClaim is drifted,"False"once reconciliation is complete), mirroring how the per-NodeClaimDriftedcondition works.This would let any GitOps/automation system gate on a NodePool's own
statuswithout re-implementing NodeClaim aggregation, and it would make rollout progress visible inkubectl get nodepooland dashboards.How important is this feature to you?
Moderately important. We have a working PostSync Job that does the aggregation today, so we are not blocked, but it adds operational surface (a Job + ServiceAccount + ClusterRole + a maintained image per cluster) purely to compute information Karpenter already has internally. A NodePool-level status field/condition would let us replace that machinery with a standard Argo CD custom health check and would benefit anyone integrating Karpenter rollouts with external orchestration.