Skip to content

1.45.0 sometimes fails to remove taint #2575

@dpiddock

Description

@dpiddock

/kind bug

What happened?
We have upgraded some clusters to EKS 1.33 and ebs-csi v1.45.0-eksbuild.2 from EKS 1.32 and ebs-csi v1.42.0-eksbuild.1. We have seen a small number of nodes get stuck with the ebs-csi taint not getting cleared. This leaves the node in a state where workload pods will not get scheduled and Karpenter will not remove the node as it has not finished initializing yet.

What you expected to happen?
Taint to be removed.

How to reproduce it (as minimally and precisely as possible)?
Unfortunately this is a rare race condition that we can't reproduce reliably.

Anything else we need to know?:
Looking through the audit logs it looks like the ebs-csi pod is not refreshing the node's current taints when the update request fails. It keeps retrying with the same outdated state.

Sample from CloudWatch logs:
Taints are:

responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
responseObject.spec.taints.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
responseObject.spec.taints.2 {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}

node-controller removes a taint leaving just the csi controllers:

@timestamp                   1753186659667
requestObject.spec.taints.0  {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.spec.taints.1  {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestReceivedTimestamp     2025-07-22T12:17:39.545267Z
responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
responseObject.spec.taints.1 {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
verb                         patch

ebs-csi fails to update:

@timestamp               1753186660233
requestObject.0.op       test
requestObject.0.path     /spec/taints
requestObject.value.0    {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1    {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.2    {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestObject.1.op       replace
requestObject.1.path     /spec/taints
requestObject.value.0    {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1    {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestReceivedTimestamp 2025-07-22T12:17:40.096491Z
responseObject.code      422
responseObject.kind      Status
responseObject.message   the server rejected our request due to an error in our request
responseObject.reason    Invalid
responseObject.status    Failure
responseStatus.code      422
responseStatus.message   the server rejected our request due to an error in our request
responseStatus.reason    Invalid
responseStatus.status    FailurerequestObject.value.0  {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
verb                     patch

Keeps trying with same input and result:

@timestamp 1753186660483
requestReceivedTimestamp 2025-07-22T12:17:40.194728Z

@timestamp 1753186663240
requestReceivedTimestamp 2025-07-22T12:17:43.202173Z

@timestamp 1753186663240
requestReceivedTimestamp 2025-07-22T12:17:43.104818Z

@timestamp 1753186667752
requestReceivedTimestamp 2025-07-22T12:17:47.711254Z

@timestamp 1753186667752
requestReceivedTimestamp 2025-07-22T12:17:47.614086Z

@timestamp 1753186674520
requestReceivedTimestamp 2025-07-22T12:17:54.470557Z

@timestamp 1753186674520
requestReceivedTimestamp 2025-07-22T12:17:54.379558Z

@timestamp 1753186674520
requestReceivedTimestamp 2025-07-22T12:17:54.372631Z

@timestamp 1753186676525
requestReceivedTimestamp 2025-07-22T12:17:56.387603Z

efs-csi removes its taint:

@timestamp                   1753186678278
requestObject.0.op           test
requestObject.0.path         /spec/taints
requestObject.0.value.0      {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.0.value.1      {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.1.op           replace
requestObject.1.path         /spec/taints
requestObject.1.value.0      {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestReceivedTimestamp     2025-07-22T12:17:58.113754Z
responseObject.spec.taints.0 {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
verb                         patch

ebs-csi continues trying to replace old taints:

@timestamp               1753186679531
requestObject.0.op       test
requestObject.0.path     /spec/taints
requestObject.value.0    {"key":"ebs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1    {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.2    {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestObject.1.op       replace
requestObject.1.path     /spec/taints
requestObject.value.0    {"key":"efs.csi.aws.com/agent-not-ready","effect":"NoExecute"}
requestObject.value.1    {"key":"node.kubernetes.io/not-ready","effect":"NoExecute","timeAdded":"2025-07-22T12:17:19Z"}
requestReceivedTimestamp 2025-07-22T12:17:59.395297Z
responseObject.code      422
responseObject.kind      Status
responseObject.message   the server rejected our request due to an error in our request
responseObject.reason    Invalid
responseObject.status    Failure
responseStatus.code      422
responseStatus.message   the server rejected our request due to an error in our request
responseStatus.reason    Invalid
responseStatus.status    Failure

The condition does not resolve itself. We've seen nodes that are 4d+ old.

Environment

  • Kubernetes version (use kubectl version): v1.33.1-eks-595af52
  • Driver version: v1.45.0-eksbuild.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions