Skip to content

Karpenter unable to create new nodes for scheduling pods #7839

Open
@narayan-autodesk

Description

@narayan-autodesk

Description

Observed Behavior:
We are running Trino on EKS installed using a helm chart. We are using Karpenter to autoscale our EKS cluster. Since Trino is a bit different from other applications, we have to run one pod per node with some daemonsets. We are using r5a.8xlarge instances which has 32vCPU and 256 GB of memory. We see the following errors when karpenter is trying to schedule pods for our deployment.

  Warning  FailedScheduling  106s (x2 over 6m46s)  karpenter          Failed to schedule pod, incompatible with nodepool "default", daemonset overhead={"cpu":"500m","memory":"640Mi","pods":"5"}, no instance type satisfied resources {"cpu":"29500m","memory":"246400Mi","pods":"6"} and requirements karpenter.k8s.aws/ec2nodeclass In [default], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [default], node.kubernetes.io/instance-type In [r5a.8xlarge] (no instance type has enough resources)
  Warning  FailedScheduling  81s (x2 over 6m47s)   default-scheduler  0/10 nodes are available: 10 Insufficient cpu, 10 Insufficient memory. preemption: 0/10 nodes are available: 10 No preemption victims found for incoming pod.

We never faced this issue with cluster autoscaler. The requirements of any of the deployments or any daemonsets haven't changed.

The pods are currently scheduled and running on the node and they were scheduled using Karpenter but we are facing this issue when trying to update the deployment using helm.

This is what the resource usage looks like on a node:

Image

We see that all the resources are not being used. There is room.

We do see the following event on the node:

Normal DisruptionBlocked Karpenter Not all pods would schedule, ....-6f89d94b57-phrx6 => incompatible with nodepool "default", daemonset overhead={"cpu":"500m","memory":"640Mi","pods":"5"}, no instance type satisfied resources {"cpu":"29500m","memory":"246400Mi","pods":"6"} and requirements karpenter.k8s.aws/ec2nodeclass In [default], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [default], node.kubernetes.io/instance-type In [r5a.8xlarge] (no instance type has enough resources) 

Expected Behavior:
Karpenter should be successfully able to create new nodes and schedule the pod on it.

Reproduction Steps (Please include YAML):

  • Using r5a.8xlarge instance
  • Creating a deployment with following requests:
  resources:
    limits:
      cpu: "29"
      memory: "240Gi"
    requests:
      cpu: "29"
      memory: "240Gi"
  • Creating another deployment which consumes 500m and 640Mi of resources.

Versions:

  • Chart Version: 1.2.1
  • Kubernetes Version (kubectl version):
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.30.9-eks-8cce635

Nodepool Yaml

apiVersion: v1
items:
- apiVersion: karpenter.sh/v1
  kind: NodePool
  metadata:
    annotations:
      karpenter.sh/nodepool-hash: "6821555240594823858"
      karpenter.sh/nodepool-hash-version: v3
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{},"name":"default"},"spec":{"disruption":{"consolidateAfter":"60s","consolidationPolicy":"WhenEmpty"},"limits":{"cpu":"40000","memory":"1000Ti"},"template":{"spec":{"nodeClassRef":{"group":"karpenter.k8s.aws","kind":"EC2NodeClass","name":"default"},"requirements":[{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]},{"key":"node.kubernetes.io/instance-type","operator":"In","values":["r5a.8xlarge"]}]}}}}
    creationTimestamp: "2025-02-13T21:56:08Z"
    generation: 2
    name: default
    resourceVersion: "42022597"
    uid: 936bb3dc-6923-4e8a-a325-5bddd0351958
  spec:
    disruption:
      budgets:
      - nodes: 10%
      consolidateAfter: 60s
      consolidationPolicy: WhenEmpty
    limits:
      cpu: "40000"
      memory: 1000Ti
    template:
      spec:
        expireAfter: 720h
        nodeClassRef:
          group: karpenter.k8s.aws
          kind: EC2NodeClass
          name: default
        requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values:
          - on-demand
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - r5a.8xlarge
  status:
    conditions:
    - lastTransitionTime: "2025-02-13T21:56:08Z"
      message: ""
      observedGeneration: 2
      reason: NodeClassReady
      status: "True"
      type: NodeClassReady
    - lastTransitionTime: "2025-02-13T21:56:08Z"
      message: ""
      observedGeneration: 2
      reason: ValidationSucceeded
      status: "True"
      type: ValidationSucceeded
    - lastTransitionTime: "2025-02-25T14:15:05Z"
      message: ""
      observedGeneration: 2
      reason: Ready
      status: "True"
      type: Ready
    resources:
      cpu: "224"
      ephemeral-storage: 3758010228Ki
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 1832551800Ki
      nodes: "7"
      pods: "1638"
kind: List
metadata:
  resourceVersion: ""

Ec2nodeclass yaml:

apiVersion: v1
items:
- apiVersion: karpenter.k8s.aws/v1
  kind: EC2NodeClass
  metadata:
    annotations:
      karpenter.k8s.aws/ec2nodeclass-hash: "11437525108660720326"
      karpenter.k8s.aws/ec2nodeclass-hash-version: v4
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"karpenter.k8s.aws/v1","kind":"EC2NodeClass","metadata":{"annotations":{},"name":"default"},"spec":{"amiFamily":"AL2","amiSelectorTerms":[{"id":"ami-0b7fffc35083cdb51"}],"blockDeviceMappings":[{"deviceName":"/dev/xvda","ebs":{"deleteOnTermination":true,"encrypted":true,"volumeSize":"512Gi","volumeType":"gp3"}}],"role":"presto-eks-nodes","securityGroupSelectorTerms":[{"id":"sg-0a37e15082ff8061c"}],"subnetSelectorTerms":[{"id":"subnet-0a8fac299db77af7a"}],"tags":{"karpenter.sh/discovery":"presto"}}}
    creationTimestamp: "2025-02-13T21:56:07Z"
    finalizers:
    - karpenter.k8s.aws/termination
    generation: 3
    name: default
    resourceVersion: "40163904"
    uid: 0dcd6328-7c6c-4f75-b341-2240934852c0
  spec:
    amiFamily: AL2
    amiSelectorTerms:
    - id: ami-0b7fffc35083cdb51
    blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        volumeSize: 512Gi
        volumeType: gp3
    metadataOptions:
      httpEndpoint: enabled
      httpProtocolIPv6: disabled
      httpPutResponseHopLimit: 1
      httpTokens: required
    role: presto-eks-nodes
    securityGroupSelectorTerms:
    - id: sg-0a37e15082ff8061c
    subnetSelectorTerms:
    - id: subnet-0a8fac299db77af7a
    tags:
      karpenter.sh/discovery: presto
  status:
    amis:
    - id: ami-0b7fffc35083cdb51
      name: .....
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
    conditions:
    - lastTransitionTime: "2025-02-13T21:56:08Z"
      message: ""
      observedGeneration: 3
      reason: AMIsReady
      status: "True"
      type: AMIsReady
    - lastTransitionTime: "2025-02-13T21:56:08Z"
      message: ""
      observedGeneration: 3
      reason: SubnetsReady
      status: "True"
      type: SubnetsReady
    - lastTransitionTime: "2025-02-13T21:56:08Z"
      message: ""
      observedGeneration: 3
      reason: SecurityGroupsReady
      status: "True"
      type: SecurityGroupsReady
    - lastTransitionTime: "2025-02-13T21:56:08Z"
      message: ""
      observedGeneration: 3
      reason: InstanceProfileReady
      status: "True"
      type: InstanceProfileReady
    - lastTransitionTime: "2025-02-13T21:56:08Z"
      message: ""
      observedGeneration: 3
      reason: ValidationSucceeded
      status: "True"
      type: ValidationSucceeded
    - lastTransitionTime: "2025-02-28T16:52:09Z"
      message: ""
      observedGeneration: 3
      reason: Ready
      status: "True"
      type: Ready
    instanceProfile: presto_15843455441266977890
    securityGroups:
    - id: sg-0a37e15082ff8061c
      name: eks-cluster-sg-presto-980097877
    subnets:
    - id: subnet-0a8fac299db77af7a
      zone: us-east-1a
      zoneID: use1-az1
kind: List
metadata:
  resourceVersion: ""
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds-triageIssues that need to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions