Skip to content

karpenter pods CrashLoopBackOff - Readiness probe failed read: connection reset by peer / Liveness probe failed connect: connection refusedΒ #7256

Open
@iamsaurabhgupt

Description

@iamsaurabhgupt

Description

Observed Behavior:

kubectl get pods -n kube-system

NAME READY STATUS RESTARTS AGE
aws-node-dc9hb 2/2 Running 0 109m
aws-node-pzbww 2/2 Running 0 109m
coredns-789f8477df-8r5zd 1/1 Running 0 114m
coredns-789f8477df-tc5pt 1/1 Running 0 114m
eks-pod-identity-agent-gqwrz 1/1 Running 0 109m
eks-pod-identity-agent-sbng9 1/1 Running 0 109m
karpenter-df9d8f6dd-xbz9d 0/1 Running 0 118s
karpenter-df9d8f6dd-znnjw 0/1 Pending 0 118s
kube-proxy-l8bcp 1/1 Running 0 109m
kube-proxy-mnw6n 1/1 Running 0 109m

kubectl describe pod karpenter-df9d8f6dd-xbz9d -n kube-system
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
aws-iam-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 86400
kube-api-access-n9sbj:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Guaranteed
Node-Selectors: kubernetes.io/os=linux
Tolerations: CriticalAddonsOnly op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 3m15s default-scheduler Successfully assigned kube-system/karpenter-df9d8f6dd-xbz9d to ip-10-110-164-199.ec2.internal
Normal Pulled 75s (x2 over 3m15s) kubelet Container image "public.ecr.aws/karpenter/controller:1.0.5@sha256:f2df98735b232b143d37f0c6819a6cae2be4740e3c8b38297bceb365cf3f668b" already present on machine
Normal Created 75s (x2 over 3m15s) kubelet Created container controller
Normal Killing 75s kubelet Container controller failed liveness probe, will be restarted
Warning Unhealthy 75s kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": read tcp 10.xxx.1x4.1x9:33238->10.xxx.1x5.1x3:8081: read: connection reset by peer
Warning Unhealthy 75s (x2 over 75s) kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": dial tcp 10.xxx.1x5.153:8081: connect: connection refused
Normal Started 74s (x2 over 3m14s) kubelet Started container controller
Warning Unhealthy 5s (x5 over 2m35s) kubelet Readiness probe failed: Get "http://10.xxx.1x5.153:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 5s (x4 over 2m15s) kubelet Liveness probe failed: Get "http://10.xxx.1x5.153:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Expected Behavior:
karpenter pod must get to Running stage

Reproduction Steps (Please include YAML):
EKS cluster version 1.31 created using
followed both https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/
and https://karpenter.sh/docs/getting-started/migrating-from-cas/ but nothing worked

eksctl create cluster -f - <<EOF
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: xxxxx
region: us-east-1
version: "1.31"
tags:
karpenter.sh/discovery: xxxxx

privateCluster:
enabled: true
skipEndpointCreation: true

iam:
withOIDC: true
podIdentityAssociations:

  • namespace: "kube-system"
    serviceAccountName: karpenter
    roleName: xxxx-karpenter
    permissionPolicyARNs:
    • arn:aws:iam::xxxxxxxx:policy/KarpenterControllerPolicy-xxxx

iamIdentityMappings:

  • arn: "arn:aws:iam::xxxxxxx:role/KarpenterNodeRole-xxxx"
    username: system:node:{{EC2PrivateDNSName}}
    groups:
    • system:bootstrappers
    • system:nodes

managedNodeGroups:

  • instanceType: m5d.large
    amiFamily: AmazonLinux2
    name: xxxxx-ng
    desiredCapacity: 2
    minSize: 1
    maxSize: 10
    privateNetworking: true

addons:

  • name: eks-pod-identity-agent
  • name: coredns
  • name: vpc-cni
  • name: kube-proxy
    EOF

KARPENTER_VERSION = 1.0.5 (tried 1.0.6 as well but didn't work)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace
--set "settings.clusterName=${CLUSTER_NAME}"
--set "settings.interruptionQueue=${CLUSTER_NAME}"
--set "settings.isolatedVPC=true"
--set controller.resources.requests.cpu=1
--set controller.resources.requests.memory=1Gi
--set controller.resources.limits.cpu=1
--set controller.resources.limits.memory=1Gi
--wait

tried dnsPolicy=Default but didn't work
kubectl logs karpenter-df9d8f6dd-xbz9d -n kube-system
{"level":"DEBUG","time":"2024-10-21T00:04:42.255Z","logger":"controller","caller":"operator/operator.go:149","message":"discovered karpenter version","commit":"652e6aa","version":"1.0.5"}

kubectl get events -A --field-selector source=karpenter --sort-by='.lastTimestamp' -n 100
No resources found

tried DISABLE_WEBHOOK=true but didn't work as well

Versions:

  • Chart Version: 1.0.5 and 1.0.6 both don't work
  • Kubernetes Version (kubectl version): 1.31
  • Please vote on this issue by adding a πŸ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinglifecycle/staletriage/solvedMark the issue as solved by a Karpenter maintainer. This gives time for the issue author to confirm.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions