-
Notifications
You must be signed in to change notification settings - Fork 850
Description
/kind bug
What happened?
We have upgraded some clusters to EKS 1.33 and ebs-csi v1.45.0-eksbuild.2 from EKS 1.32 and ebs-csi v1.42.0-eksbuild.1. We have seen a small number of nodes get stuck with the ebs-csi taint not getting cleared. This leaves the node in a state where workload pods will not get scheduled and Karpenter will not remove the node as it has not finished initializing yet.
What you expected to happen?
Taint to be removed.
How to reproduce it (as minimally and precisely as possible)?
Unfortunately this is a rare race condition that we can't reproduce reliably.
Anything else we need to know?:
Looking through the audit logs it looks like the ebs-csi pod is not refreshing the node's current taints when the update request fails. It keeps retrying with the same outdated state.
Sample from CloudWatch logs:
Taints are:
responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
responseObject.spec.taints.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
responseObject.spec.taints.2 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
node-controller removes a taint leaving just the csi controllers:
@timestamp 1753186659667
requestObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.spec.taints.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestReceivedTimestamp 2025-07-22T12:17:39.545267Z
responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
responseObject.spec.taints.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
verb patch
ebs-csi fails to update:
@timestamp 1753186660233
requestObject.0.op test
requestObject.0.path /spec/taints
requestObject.value.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.2 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestObject.1.op replace
requestObject.1.path /spec/taints
requestObject.value.0 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestReceivedTimestamp 2025-07-22T12:17:40.096491Z
responseObject.code 422
responseObject.kind Status
responseObject.message the server rejected our request due to an error in our request
responseObject.reason Invalid
responseObject.status Failure
responseStatus.code 422
responseStatus.message the server rejected our request due to an error in our request
responseStatus.reason Invalid
responseStatus.status FailurerequestObject.value.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
verb patch
Keeps trying with same input and result:
@timestamp 1753186660483
requestReceivedTimestamp 2025-07-22T12:17:40.194728Z
@timestamp 1753186663240
requestReceivedTimestamp 2025-07-22T12:17:43.202173Z
@timestamp 1753186663240
requestReceivedTimestamp 2025-07-22T12:17:43.104818Z
@timestamp 1753186667752
requestReceivedTimestamp 2025-07-22T12:17:47.711254Z
@timestamp 1753186667752
requestReceivedTimestamp 2025-07-22T12:17:47.614086Z
@timestamp 1753186674520
requestReceivedTimestamp 2025-07-22T12:17:54.470557Z
@timestamp 1753186674520
requestReceivedTimestamp 2025-07-22T12:17:54.379558Z
@timestamp 1753186674520
requestReceivedTimestamp 2025-07-22T12:17:54.372631Z
@timestamp 1753186676525
requestReceivedTimestamp 2025-07-22T12:17:56.387603Z
efs-csi removes its taint:
@timestamp 1753186678278
requestObject.0.op test
requestObject.0.path /spec/taints
requestObject.0.value.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.0.value.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.1.op replace
requestObject.1.path /spec/taints
requestObject.1.value.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestReceivedTimestamp 2025-07-22T12:17:58.113754Z
responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
verb patch
ebs-csi continues trying to replace old taints:
@timestamp 1753186679531
requestObject.0.op test
requestObject.0.path /spec/taints
requestObject.value.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.2 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestObject.1.op replace
requestObject.1.path /spec/taints
requestObject.value.0 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestReceivedTimestamp 2025-07-22T12:17:59.395297Z
responseObject.code 422
responseObject.kind Status
responseObject.message the server rejected our request due to an error in our request
responseObject.reason Invalid
responseObject.status Failure
responseStatus.code 422
responseStatus.message the server rejected our request due to an error in our request
responseStatus.reason Invalid
responseStatus.status Failure
The condition does not resolve itself. We've seen nodes that are 4d+ old.
Environment
- Kubernetes version (use
kubectl version): v1.33.1-eks-595af52 - Driver version: v1.45.0-eksbuild.2