-
Notifications
You must be signed in to change notification settings - Fork 101
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happened?
Recently, a couple of of our applications have had pods fail to schedule, the PodGroup is repeatedly killed and recreated for several hours with message:
Operation cannot be fulfilled on podgroups.scheduling.run.ai
"pg-my-app-b9cb66595-s25q6-d6454e9b-4ad9-459b-aeeb-2f3cfa0b153a":
the object has been modified; please apply your changes to the latest version and try again
The binder has the following error message:
2025-10-29T23:48:58.612Z ERROR Failed to reserve GPUs resources for pod {"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"n │
│ ame":"my-app-b9cb66595-rwxq6","namespace":"my-namespace"},
"namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6", "reconcileID": "cbf3a793-739f-4f50-8bda-56d005e6286e", " │
│ namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6",
"error": "cluster is scaling up, could not create reservation pod"} │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/binding.(*Binder).Bind │
│ /local/pkg/binder/binding/binder.go:55 │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile │
│ /local/pkg/binder/controllers/bindrequest_controller.go:155 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1 │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202 │
│ 2025-10-29T23:48:58.612Z ERROR Failed to bind pod to node
{"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"name":"my-app-b9cb66595-rwxq6","namespace":"my-namespace""},
"namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6",
"reconcileID": "cbf3a793-739f-4f50-8bda-56d005e6286e",
"pod": "my-app-b9cb66595-rwxq6", "namespace": "my-namespace"",
"node": "ip-10-107-3-14.us-west-2.compute.internal",
"error": "cluster is scaling up, could not create reservation pod"} │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile │
│ /local/pkg/binder/controllers/bindrequest_controller.go:164 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300 │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1 │
│ /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202
When disabling KAI-Scheduler for these applications, Karpenter creates a nodeclaim and the pod is scheduled successfully.
UPDATE:
There was a pod in the kai-scale-adjust namespace that was stuck in the Terminating state. Manually forcing the deletion of this pod fixed the scheduling of all of our workloads.
The logs from the nodescaleadjuster show that it was trying to delete this pod multiple times a second.
2025/10/30 09:32:36 Deleted scaling pod: pod-name
2025/10/30 09:32:36 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:36 Deleted scaling pod: pod-name
2025/10/30 09:32:36 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
What did you expect to happen?
Nodeclaims to be created and pods to be scheduled successfully
Environment
- Kubernetes version: 1.33
- KAI Scheduler version: 0.9.4
- Cloud provider or hardware configuration: AWS
- Tools that you are using KAI together with: Karpenter and KEDA
james-flwls and thomash-flwls
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working