Skip to content

Surface NodeClaim drift/rollout progress in NodePool status #3071

Description

@javanthropus

Description

What problem are you trying to solve?

Today, drift is only observable per-NodeClaim, via the Drifted status condition. There is no aggregate, NodePool-level signal that tells you how far a NodePool is through reconciling drift across the nodes it owns.

This makes it hard for external systems to gate on "this NodePool has finished rolling out a change" without listing and aggregating NodeClaims themselves. Concretely, we run Karpenter under Argo CD (GitOps). When a NodePool/EC2NodeClass change triggers drift, we want Argo CD to report the Application as Progressing until the drift-driven node replacement is substantially complete, and Healthy once it is.

Argo CD's health is evaluated per resource with a sandboxed Lua check that has no access to other resources. So a NodePool health check can only read the NodePool's own status. A NodeClaim health check could in principle read each NodeClaim's Drifted condition, but:

  • It would require NodeClaims to appear as children of the NodePool in Argo's resource tree (which, in our multi-cluster setup, they currently do not), and
  • It forces all-or-nothing (100%) semantics, with no notion of a tolerance threshold or a settling window.

As a result we maintain a PostSync Job that polls NodeClaims, groups them by karpenter.sh/nodepool, computes the percentage no longer Drifted, and blocks until each NodePool crosses a threshold (e.g. 90%). This is exactly the kind of aggregation we'd expect Karpenter itself to be able to expose, since it already owns the NodeClaims and tracks their drift state.

Proposal

Expose drift/rollout progress at the NodePool level, in NodePool.status. Any of the following would be sufficient for our use case (in rough order of preference):

  1. Counts in NodePool.status, e.g. status.driftedNodeClaims / status.nodeClaims (analogous to the existing status.nodes and status.resources), so consumers can compute completion percentage directly.
  2. A NodePool-level condition such as Drifted (status: "True" while any owned NodeClaim is drifted, "False" once reconciliation is complete), mirroring how the per-NodeClaim Drifted condition works.
  3. Both — a condition for a simple boolean gate plus counts for threshold-based logic.

This would let any GitOps/automation system gate on a NodePool's own status without re-implementing NodeClaim aggregation, and it would make rollout progress visible in kubectl get nodepool and dashboards.

How important is this feature to you?

Moderately important. We have a working PostSync Job that does the aggregation today, so we are not blocked, but it adds operational surface (a Job + ServiceAccount + ClusterRole + a maintained image per cluster) purely to compute information Karpenter already has internally. A NodePool-level status field/condition would let us replace that machinery with a standard Argo CD custom health check and would benefit anyone integrating Karpenter rollouts with external orchestration.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-priorityneeds-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions