Skip to content

Karpenter doesn't update EC2NodeClass status after Failed to detect the cluster CIDR error #7875

@obervinov

Description

@obervinov

Description

Observed Behavior:
We renamed the IAM role of the karpenter controller and noticed an extremely strange behavior in which karpenter stops updating the EC2NodeClass status if it receives the Failed to detect the cluster CIDR error.
After renaming the IAM role, karprnter worked with the already deleted role for a few more seconds and, accordingly, could not correctly extract the CIDR through DescribeCluster

getting amis, getting AMI queries, failed to discover AMIs for alias "al2023@latest"; getting subnets, describing subnets [{"Name":"tag:Name","Values":["private-us-east-1a"]}], operation error EC2: DescribeSubnets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; getting security groups, describing security groups [[{Name:0x4003199980 Values:[karpenter-node] noSmithyDocumentSerde:{}}]], operation error EC2: DescribeSecurityGroups, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; creating instance profile, getting instance profile "eks1_16253362336087885606", operation error IAM: GetInstanceProfile, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; validating ec2:CreateFleet authorization, operation error EC2: CreateFleet, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; failed to detect the cluster CIDR, operation error EKS: DescribeCluster, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity

after that, EC2NodeClass was stuck with the NotReady status in something like this

apiVersion: v1
items:
- apiVersion: karpenter.k8s.aws/v1
  kind: EC2NodeClass
  metadata:
    name: generic
  spec:
    amiFamily: AL2023
    amiSelectorTerms:
    - alias: al2023@latest
    blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 25Gi
        volumeType: gp3
    metadataOptions:
      httpEndpoint: enabled
      httpProtocolIPv6: disabled
      httpPutResponseHopLimit: 1
      httpTokens: required
    role: Karpenter
    securityGroupSelectorTerms:
    - tags:
        Name: karpenter-node
    subnetSelectorTerms:
    - tags:
        Name: private-us-east-1a
  status:
    amis:
    - id: ami-006321160784caf3d
      name: amazon-eks-node-al2023-arm64-standard-1.31-v20250228
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - arm64
      - key: karpenter.k8s.aws/instance-gpu-count
        operator: DoesNotExist
      - key: karpenter.k8s.aws/instance-accelerator-count
        operator: DoesNotExist
    - id: ami-0a89f636458f0aa4e
      name: amazon-eks-node-al2023-x86_64-nvidia-1.31-v20250228
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.k8s.aws/instance-gpu-count
        operator: Exists
    - id: ami-0b94752294befca8a
      name: amazon-eks-node-al2023-x86_64-standard-1.31-v20250228
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.k8s.aws/instance-gpu-count
        operator: DoesNotExist
      - key: karpenter.k8s.aws/instance-accelerator-count
        operator: DoesNotExist
    - id: ami-0ebe659725d068d62
      name: amazon-eks-node-al2023-x86_64-neuron-1.31-v20250228
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.k8s.aws/instance-accelerator-count
        operator: Exists
    conditions:
    - lastTransitionTime: "2025-02-07T11:10:35Z"
      message: ""
      observedGeneration: 4
      reason: AMIsReady
      status: "True"
      type: AMIsReady
    - lastTransitionTime: "2025-02-07T11:10:35Z"
      message: ""
      observedGeneration: 4
      reason: SubnetsReady
      status: "True"
      type: SubnetsReady
    - lastTransitionTime: "2025-02-07T11:10:35Z"
      message: ""
      observedGeneration: 4
      reason: SecurityGroupsReady
      status: "True"
      type: SecurityGroupsReady
    - lastTransitionTime: "2025-02-07T11:10:35Z"
      message: ""
      observedGeneration: 4
      reason: InstanceProfileReady
      status: "True"
      type: InstanceProfileReady
    - lastTransitionTime: "2025-02-07T12:50:00Z"
      message: ""
      observedGeneration: 4
      reason: ValidationSucceeded
      status: "True"
      type: ValidationSucceeded
    - lastTransitionTime: "2025-03-10T16:02:04Z"
      message: Failed to detect the cluster CIDR
      observedGeneration: 4
      reason: NodeClassNotReady
      status: "False"
      type: Ready
    instanceProfile: karpenter_16253362336087885606
    securityGroups:
    - id: sg-***
      name: karpenter-node-20240411124135217800000003
    subnets:
    - id: subnet-***
      zone: us-east-1a
      zoneID: use1-az1

After the karpenter controller pods were restarted with a new correct IAM role, I see in the logs that the controller correctly reads the cidr for EC2NodeClass

{"level":"DEBUG","time":"2025-03-11T13:57:31.984Z","logger":"controller","caller":"nodeclass/securitygroup.go:42","message":"discovered security groups","commit":"1c39126","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic"},"namespace":"","name":"generic","reconcileID":"57602e9a-1a3a-43f9-862e-84ca738d4192","security-groups":["sg-***"]}
{"level":"DEBUG","time":"2025-03-11T13:57:32.845Z","logger":"controller","caller":"nodeclass/readiness.go:45","message":"discovered cluster CIDR","commit":"1c39126","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic"},"namespace":"","name":"generic","reconcileID":"57602e9a-1a3a-43f9-862e-84ca738d4192","cluster-cidr":"172.20.0.0/16"}

But EC2NodeClass continued to maintain the NotReady status and the same error message for several days. The only way to fix the situation is to recreate EC2NodeClass or any change to its specification (for example, add spec.tags). After that, Ec2NodeClass was ready to work.

Expected Behavior:
If an error occurs when extracting Ec2NodeClass data, karpenter should try to repeat this automatically.

Reproduction Steps (Please include YAML):
Helm chart values
eks.amazonaws.com/role-arn: ${module.karpenter.iam_role_arn} - this value has been replaced

logLevel: debug
serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: ${module.karpenter.iam_role_arn}
serviceMonitor:
  enabled: true
replicas: 2
strategy:
  rollingUpdate:
    maxUnavailable: 1
controller:
  resources:
    requests:
      memory: 1Gi
    limits:
      memory: 1Gi
settings:
  clusterName: ${local.cluster_name}
  interruptionQueue: ${module.karpenter.queue_name}

EC2NodeClass manifest

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: generic
spec:
  amiFamily: AL2023
and tags.
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        Name: private-us-east-1a
  securityGroupSelectorTerms:
    - tags:
        Name: eks1-node
  role: Karpenter-eks1
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 25Gi
        volumeType: gp3
        encrypted: true
        deleteOnTermination: true
YAML

Versions:

  • Chart Version: 1.3.1
  • Kubernetes Version (kubectl version): 1.31
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood-first-issueGood for newcomerstriage/needs-investigationIssues that need to be investigated before triagingtriage/solvedMark the issue as solved by a Karpenter maintainer. This gives time for the issue author to confirm.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions