Description
Description
Note that I am cross-posting this from aws/karpenter-provider-aws#7254 as the more I look into the issue, the more it seems to be related to core Karpenter logic rather than something on AWS's end.
Observed Behavior:
Occasionally, Karpenter will provision a node that is far, far above what is being requested.
For example, notice the provisioned node below is 10x larger than what is being requested. Moreover, the generated nodeclaim only has a single entry for instance-types
.
That is despite the NodePool (manifest below) having many, many instances types that would fit the scheduling request (which it normally does).
{
"level": "INFO",
"time": "2024-10-19T15:04:32.809Z",
"logger": "controller",
"message": "created nodeclaim",
"commit": "62a726c",
"controller": "provisioner",
"namespace": "",
"name": "",
"reconcileID": "e438aaaa-f5dd-4ac9-8fd3-c8d5d4ddb230",
"NodePool": {
"name": "spot-arm-9468ed6c"
},
"NodeClaim": {
"name": "spot-arm-9468ed6c-ckz95"
},
"requests": {
"cpu": "1263m",
"ephemeral-storage": "50Mi",
"memory": "8289507076",
"pods": "19"
},
"instance-types": "c6a.12xlarge"
}
{
"level": "INFO",
"time": "2024-10-19T15:04:34.710Z",
"logger": "controller",
"message": "launched nodeclaim",
"commit": "62a726c",
"controller": "nodeclaim.lifecycle",
"controllerGroup": "karpenter.sh",
"controllerKind": "NodeClaim",
"NodeClaim": {
"name": "spot-arm-9468ed6c-ckz95"
},
"namespace": "",
"name": "spot-arm-9468ed6c-ckz95",
"reconcileID": "cb294d4a-ffb6-4ed4-a5f0-caede430e7de",
"provider-id": "aws:///us-east-2b/i-0485d92ce19cea74e",
"instance-type": "c6a.12xlarge",
"zone": "us-east-2b",
"capacity-type": "spot",
"allocatable": {
"cpu": "47810m",
"ephemeral-storage": "35Gi",
"memory": "77078Mi",
"pods": "110",
"vpc.amazonaws.com/pod-eni": "114"
}
}
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
karpenter.sh/nodepool-hash: "1709207863625532397"
karpenter.sh/nodepool-hash-version: v3
creationTimestamp: "2024-09-04T23:12:01Z"
generation: 7
labels:
panfactum.com/environment: production
panfactum.com/local: "false"
panfactum.com/module: kube_karpenter_node_pools
panfactum.com/region: us-east-2
panfactum.com/root-module: kube_karpenter_node_pools
panfactum.com/stack-commit: local
panfactum.com/stack-version: local
test.1/2.3.4.5: test.1.2.3.4.5
test1: foo
test2: bar
test3: baz
test4: "42"
name: spot-arm-9468ed6c
resourceVersion: "249793223"
uid: f43f92a5-c202-4b83-892a-4838375de78e
spec:
disruption:
budgets: ]
consolidateAfter: 10s
consolidationPolicy: WhenEmptyOrUnderutilized
template:
metadata:
labels:
panfactum.com/class: spot
spec:
expireAfter: 24h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: spot-3008ed27
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- m8g
- m7g
- m7i
- m7a
- m6g
- m6i
- m6a
- c8g
- c7g
- c7i
- c7a
- c6g
- c6gn
- c6i
- c6a
- r8g
- r7g
- r7i
- r7iz
- r7a
- r6g
- r6i
- r6a
- key: karpenter.k8s.aws/instance-size
operator: NotIn
values:
- metal
- metal-24xl
- metal-48xl
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "2500"
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- key: kubernetes.io/arch
operator: In
values:
- arm64
- amd64
startupTaints:
- effect: NoSchedule
key: node.cilium.io/agent-not-ready
value: "true"
taints:
- effect: NoSchedule
key: spot
value: "true"
- effect: NoSchedule
key: arm64
value: "true"
terminationGracePeriod: 2m0s
weight: 20
status:
conditions:
- lastTransitionTime: "2024-09-04T23:12:01Z"
message: ""
reason: NodeClassReady
status: "True"
type: NodeClassReady
- lastTransitionTime: "2024-09-04T23:12:02Z"
message: ""
reason: Ready
status: "True"
type: Ready
- lastTransitionTime: "2024-09-04T23:12:02Z"
message: ""
reason: ValidationSucceeded
status: "True"
type: ValidationSucceeded
resources:
cpu: "8"
ephemeral-storage: 40894Mi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
hugepages-32Mi: "0"
hugepages-64Ki: "0"
memory: 32247340Ki
nodes: "1"
pods: "110"
Expected Behavior:
When a set of pods is pending and needs a new node, the generated node claim includes all applicable
instance-types
and an appropriately sized node is created.
This normally works correctly and generates logs as follows:
{
"level": "INFO",
"time": "2024-10-19T14:57:53.863Z",
"logger": "controller",
"message": "created nodeclaim",
"commit": "62a726c",
"controller": "provisioner",
"namespace": "",
"name": "",
"reconcileID": "826e7b10-5052-4dcf-8688-40bdbbc4283a",
"NodePool": {
"name": "spot-arm-9468ed6c"
},
"NodeClaim": {
"name": "spot-arm-9468ed6c-fqvlb"
},
"requests": {
"cpu": "1310m",
"memory": "3214608968",
"pods": "6"
},
"instance-types": "c6g.12xlarge, c6g.16xlarge, c6g.2xlarge, c6g.4xlarge, c6g.8xlarge and 55 other(s)"
}
{
"level": "INFO",
"time": "2024-10-19T14:57:56.076Z",
"logger": "controller",
"message": "launched nodeclaim",
"commit": "62a726c",
"controller": "nodeclaim.lifecycle",
"controllerGroup": "karpenter.sh",
"controllerKind": "NodeClaim",
"NodeClaim": {
"name": "spot-arm-9468ed6c-fqvlb"
},
"namespace": "",
"name": "spot-arm-9468ed6c-fqvlb",
"reconcileID": "1e545dea-1bc7-4bbb-82f9-b98a29a79c96",
"provider-id": "aws:///us-east-2a/i-08d9e7d0c1aead853",
"instance-type": "m8g.large",
"zone": "us-east-2a",
"capacity-type": "spot",
"allocatable": {
"cpu": "1930m",
"ephemeral-storage": "35Gi",
"memory": "4124Mi",
"pods": "110"
}
}
Reproduction Steps (Please include YAML):
It is unclear to me how to reproduce. I have tried all the obvious things and am not able to reliability re-trigger the behavior (it seems to occur somewhat randomly):
- Created sets of pending pods with higher cpu, memory, and pod count requirements than the above requests
- Updated the NodePool to trigger drift detection
- Upgraded Karpenter
- Used various NodePools with different requirement settings
I have also verified that the pods do not have any scheduling constraints that would limit them to a single instance type.
In fact, which particular type is chosen for instance-types
seems somewhat random. Sometimes it is appropriately sized, sometimes it is 10x too large, sometimes it is 100x too large. The instance families also differ. However, what is consistent is the the node claim is (a) created by the provisioner
controller and (b) gets generated with just a single type rather than the full expected set.
After the node is created, Karpenter will then usually disrupt it shortly after and replace it with a smaller node. However, we have sometimes had PDBs prevent this which is when we noticed that this behavior was occurring.
Additionally, all of the NodePools where we have observed this behavior allow spot instances, but I do not know if that is relevant (all of our NodePools are spot-enabled).
Finally, we only started noticing this issue after upgrading to Karpenter v1 or at least it seems far more prevalent now.
Versions:
- Chart Version:
1.0.1
- Kubernetes Version (
kubectl version
):v1.29.8-eks-a737599
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Activity