-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
Observed Behavior:
We renamed the IAM role of the karpenter controller and noticed an extremely strange behavior in which karpenter stops updating the EC2NodeClass status if it receives the Failed to detect the cluster CIDR error.
After renaming the IAM role, karprnter worked with the already deleted role for a few more seconds and, accordingly, could not correctly extract the CIDR through DescribeCluster
getting amis, getting AMI queries, failed to discover AMIs for alias "al2023@latest"; getting subnets, describing subnets [{"Name":"tag:Name","Values":["private-us-east-1a"]}], operation error EC2: DescribeSubnets, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; getting security groups, describing security groups [[{Name:0x4003199980 Values:[karpenter-node] noSmithyDocumentSerde:{}}]], operation error EC2: DescribeSecurityGroups, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; creating instance profile, getting instance profile "eks1_16253362336087885606", operation error IAM: GetInstanceProfile, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; validating ec2:CreateFleet authorization, operation error EC2: CreateFleet, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity; failed to detect the cluster CIDR, operation error EKS: DescribeCluster, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ***, api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentityafter that, EC2NodeClass was stuck with the NotReady status in something like this
apiVersion: v1
items:
- apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: generic
spec:
amiFamily: AL2023
amiSelectorTerms:
- alias: al2023@latest
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
volumeSize: 25Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1
httpTokens: required
role: Karpenter
securityGroupSelectorTerms:
- tags:
Name: karpenter-node
subnetSelectorTerms:
- tags:
Name: private-us-east-1a
status:
amis:
- id: ami-006321160784caf3d
name: amazon-eks-node-al2023-arm64-standard-1.31-v20250228
requirements:
- key: kubernetes.io/arch
operator: In
values:
- arm64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
- id: ami-0a89f636458f0aa4e
name: amazon-eks-node-al2023-x86_64-nvidia-1.31-v20250228
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: Exists
- id: ami-0b94752294befca8a
name: amazon-eks-node-al2023-x86_64-standard-1.31-v20250228
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-gpu-count
operator: DoesNotExist
- key: karpenter.k8s.aws/instance-accelerator-count
operator: DoesNotExist
- id: ami-0ebe659725d068d62
name: amazon-eks-node-al2023-x86_64-neuron-1.31-v20250228
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.k8s.aws/instance-accelerator-count
operator: Exists
conditions:
- lastTransitionTime: "2025-02-07T11:10:35Z"
message: ""
observedGeneration: 4
reason: AMIsReady
status: "True"
type: AMIsReady
- lastTransitionTime: "2025-02-07T11:10:35Z"
message: ""
observedGeneration: 4
reason: SubnetsReady
status: "True"
type: SubnetsReady
- lastTransitionTime: "2025-02-07T11:10:35Z"
message: ""
observedGeneration: 4
reason: SecurityGroupsReady
status: "True"
type: SecurityGroupsReady
- lastTransitionTime: "2025-02-07T11:10:35Z"
message: ""
observedGeneration: 4
reason: InstanceProfileReady
status: "True"
type: InstanceProfileReady
- lastTransitionTime: "2025-02-07T12:50:00Z"
message: ""
observedGeneration: 4
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
- lastTransitionTime: "2025-03-10T16:02:04Z"
message: Failed to detect the cluster CIDR
observedGeneration: 4
reason: NodeClassNotReady
status: "False"
type: Ready
instanceProfile: karpenter_16253362336087885606
securityGroups:
- id: sg-***
name: karpenter-node-20240411124135217800000003
subnets:
- id: subnet-***
zone: us-east-1a
zoneID: use1-az1After the karpenter controller pods were restarted with a new correct IAM role, I see in the logs that the controller correctly reads the cidr for EC2NodeClass
{"level":"DEBUG","time":"2025-03-11T13:57:31.984Z","logger":"controller","caller":"nodeclass/securitygroup.go:42","message":"discovered security groups","commit":"1c39126","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic"},"namespace":"","name":"generic","reconcileID":"57602e9a-1a3a-43f9-862e-84ca738d4192","security-groups":["sg-***"]}
{"level":"DEBUG","time":"2025-03-11T13:57:32.845Z","logger":"controller","caller":"nodeclass/readiness.go:45","message":"discovered cluster CIDR","commit":"1c39126","controller":"nodeclass","controllerGroup":"karpenter.k8s.aws","controllerKind":"EC2NodeClass","EC2NodeClass":{"name":"generic"},"namespace":"","name":"generic","reconcileID":"57602e9a-1a3a-43f9-862e-84ca738d4192","cluster-cidr":"172.20.0.0/16"}But EC2NodeClass continued to maintain the NotReady status and the same error message for several days. The only way to fix the situation is to recreate EC2NodeClass or any change to its specification (for example, add spec.tags). After that, Ec2NodeClass was ready to work.
Expected Behavior:
If an error occurs when extracting Ec2NodeClass data, karpenter should try to repeat this automatically.
Reproduction Steps (Please include YAML):
Helm chart values
eks.amazonaws.com/role-arn: ${module.karpenter.iam_role_arn} - this value has been replaced
logLevel: debug
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: ${module.karpenter.iam_role_arn}
serviceMonitor:
enabled: true
replicas: 2
strategy:
rollingUpdate:
maxUnavailable: 1
controller:
resources:
requests:
memory: 1Gi
limits:
memory: 1Gi
settings:
clusterName: ${local.cluster_name}
interruptionQueue: ${module.karpenter.queue_name}EC2NodeClass manifest
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: generic
spec:
amiFamily: AL2023
and tags.
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
Name: private-us-east-1a
securityGroupSelectorTerms:
- tags:
Name: eks1-node
role: Karpenter-eks1
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 25Gi
volumeType: gp3
encrypted: true
deleteOnTermination: true
YAMLVersions:
- Chart Version: 1.3.1
- Kubernetes Version (
kubectl version): 1.31
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment