Description
Description
Observed Behavior:
We are running Trino on EKS installed using a helm chart. We are using Karpenter to autoscale our EKS cluster. Since Trino is a bit different from other applications, we have to run one pod per node with some daemonsets. We are using r5a.8xlarge
instances which has 32vCPU and 256 GB of memory. We see the following errors when karpenter is trying to schedule pods for our deployment.
Warning FailedScheduling 106s (x2 over 6m46s) karpenter Failed to schedule pod, incompatible with nodepool "default", daemonset overhead={"cpu":"500m","memory":"640Mi","pods":"5"}, no instance type satisfied resources {"cpu":"29500m","memory":"246400Mi","pods":"6"} and requirements karpenter.k8s.aws/ec2nodeclass In [default], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [default], node.kubernetes.io/instance-type In [r5a.8xlarge] (no instance type has enough resources)
Warning FailedScheduling 81s (x2 over 6m47s) default-scheduler 0/10 nodes are available: 10 Insufficient cpu, 10 Insufficient memory. preemption: 0/10 nodes are available: 10 No preemption victims found for incoming pod.
We never faced this issue with cluster autoscaler. The requirements of any of the deployments or any daemonsets haven't changed.
The pods are currently scheduled and running on the node and they were scheduled using Karpenter but we are facing this issue when trying to update the deployment using helm.
This is what the resource usage looks like on a node:

We see that all the resources are not being used. There is room.
We do see the following event on the node:
Normal DisruptionBlocked Karpenter Not all pods would schedule, ....-6f89d94b57-phrx6 => incompatible with nodepool "default", daemonset overhead={"cpu":"500m","memory":"640Mi","pods":"5"}, no instance type satisfied resources {"cpu":"29500m","memory":"246400Mi","pods":"6"} and requirements karpenter.k8s.aws/ec2nodeclass In [default], karpenter.sh/capacity-type In [on-demand], karpenter.sh/nodepool In [default], node.kubernetes.io/instance-type In [r5a.8xlarge] (no instance type has enough resources)
Expected Behavior:
Karpenter should be successfully able to create new nodes and schedule the pod on it.
Reproduction Steps (Please include YAML):
- Using r5a.8xlarge instance
- Creating a deployment with following requests:
resources:
limits:
cpu: "29"
memory: "240Gi"
requests:
cpu: "29"
memory: "240Gi"
- Creating another deployment which consumes 500m and 640Mi of resources.
Versions:
- Chart Version: 1.2.1
- Kubernetes Version (
kubectl version
):
Client Version: v1.31.0
Kustomize Version: v5.4.2
Server Version: v1.30.9-eks-8cce635
Nodepool Yaml
apiVersion: v1
items:
- apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
karpenter.sh/nodepool-hash: "6821555240594823858"
karpenter.sh/nodepool-hash-version: v3
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.sh/v1","kind":"NodePool","metadata":{"annotations":{},"name":"default"},"spec":{"disruption":{"consolidateAfter":"60s","consolidationPolicy":"WhenEmpty"},"limits":{"cpu":"40000","memory":"1000Ti"},"template":{"spec":{"nodeClassRef":{"group":"karpenter.k8s.aws","kind":"EC2NodeClass","name":"default"},"requirements":[{"key":"karpenter.sh/capacity-type","operator":"In","values":["on-demand"]},{"key":"node.kubernetes.io/instance-type","operator":"In","values":["r5a.8xlarge"]}]}}}}
creationTimestamp: "2025-02-13T21:56:08Z"
generation: 2
name: default
resourceVersion: "42022597"
uid: 936bb3dc-6923-4e8a-a325-5bddd0351958
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 60s
consolidationPolicy: WhenEmpty
limits:
cpu: "40000"
memory: 1000Ti
template:
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: node.kubernetes.io/instance-type
operator: In
values:
- r5a.8xlarge
status:
conditions:
- lastTransitionTime: "2025-02-13T21:56:08Z"
message: ""
observedGeneration: 2
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2025-02-13T21:56:08Z"
message: ""
observedGeneration: 2
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
- lastTransitionTime: "2025-02-25T14:15:05Z"
message: ""
observedGeneration: 2
reason: Ready
status: "True"
type: Ready
resources:
cpu: "224"
ephemeral-storage: 3758010228Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 1832551800Ki
nodes: "7"
pods: "1638"
kind: List
metadata:
resourceVersion: ""
Ec2nodeclass yaml:
apiVersion: v1
items:
- apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
annotations:
karpenter.k8s.aws/ec2nodeclass-hash: "11437525108660720326"
karpenter.k8s.aws/ec2nodeclass-hash-version: v4
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"karpenter.k8s.aws/v1","kind":"EC2NodeClass","metadata":{"annotations":{},"name":"default"},"spec":{"amiFamily":"AL2","amiSelectorTerms":[{"id":"ami-0b7fffc35083cdb51"}],"blockDeviceMappings":[{"deviceName":"/dev/xvda","ebs":{"deleteOnTermination":true,"encrypted":true,"volumeSize":"512Gi","volumeType":"gp3"}}],"role":"presto-eks-nodes","securityGroupSelectorTerms":[{"id":"sg-0a37e15082ff8061c"}],"subnetSelectorTerms":[{"id":"subnet-0a8fac299db77af7a"}],"tags":{"karpenter.sh/discovery":"presto"}}}
creationTimestamp: "2025-02-13T21:56:07Z"
finalizers:
- karpenter.k8s.aws/termination
generation: 3
name: default
resourceVersion: "40163904"
uid: 0dcd6328-7c6c-4f75-b341-2240934852c0
spec:
amiFamily: AL2
amiSelectorTerms:
- id: ami-0b7fffc35083cdb51
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
volumeSize: 512Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1
httpTokens: required
role: presto-eks-nodes
securityGroupSelectorTerms:
- id: sg-0a37e15082ff8061c
subnetSelectorTerms:
- id: subnet-0a8fac299db77af7a
tags:
karpenter.sh/discovery: presto
status:
amis:
- id: ami-0b7fffc35083cdb51
name: .....
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
conditions:
- lastTransitionTime: "2025-02-13T21:56:08Z"
message: ""
observedGeneration: 3
reason: AMIsReady
status: "True"
type: AMIsReady
- lastTransitionTime: "2025-02-13T21:56:08Z"
message: ""
observedGeneration: 3
reason: SubnetsReady
status: "True"
type: SubnetsReady
- lastTransitionTime: "2025-02-13T21:56:08Z"
message: ""
observedGeneration: 3
reason: SecurityGroupsReady
status: "True"
type: SecurityGroupsReady
- lastTransitionTime: "2025-02-13T21:56:08Z"
message: ""
observedGeneration: 3
reason: InstanceProfileReady
status: "True"
type: InstanceProfileReady
- lastTransitionTime: "2025-02-13T21:56:08Z"
message: ""
observedGeneration: 3
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
- lastTransitionTime: "2025-02-28T16:52:09Z"
message: ""
observedGeneration: 3
reason: Ready
status: "True"
type: Ready
instanceProfile: presto_15843455441266977890
securityGroups:
- id: sg-0a37e15082ff8061c
name: eks-cluster-sg-presto-980097877
subnets:
- id: subnet-0a8fac299db77af7a
zone: us-east-1a
zoneID: use1-az1
kind: List
metadata:
resourceVersion: ""
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Activity