Skip to content

Pods failing to schedule: blocked by terminating scaling-pod #611

@georgef-flwls

Description

@georgef-flwls

What happened?

Recently, a couple of of our applications have had pods fail to schedule, the PodGroup is repeatedly killed and recreated for several hours with message:

Operation cannot be fulfilled on podgroups.scheduling.run.ai 
"pg-my-app-b9cb66595-s25q6-d6454e9b-4ad9-459b-aeeb-2f3cfa0b153a":
 the object has been modified; please apply your changes to the latest version and try again

The binder has the following error message:

 2025-10-29T23:48:58.612Z    ERROR    Failed to reserve GPUs resources for pod    {"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"n │
│ ame":"my-app-b9cb66595-rwxq6","namespace":"my-namespace"},
 "namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6", "reconcileID": "cbf3a793-739f-4f50-8bda-56d005e6286e", " │
│ namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6",
 "error": "cluster is scaling up, could not create reservation pod"}                                                              │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/binding.(*Binder).Bind                                                                                                                                          │
│     /local/pkg/binder/binding/binder.go:55                                                                                                                                                                 │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile                                                                                                                  │
│     /local/pkg/binder/controllers/bindrequest_controller.go:155                                                                                                                                            │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile                                                                                                                        │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler                                                                                                                 │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem                                                                                                              │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1                                                                                                                    │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202                                                                                                           │
│ 2025-10-29T23:48:58.612Z    ERROR    Failed to bind pod to node
    {"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"name":"my-app-b9cb66595-rwxq6","namespace":"my-namespace""},
 "namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6",
 "reconcileID": "cbf3a793-739f-4f50-8bda-56d005e6286e", 
"pod": "my-app-b9cb66595-rwxq6", "namespace": "my-namespace"",
 "node": "ip-10-107-3-14.us-west-2.compute.internal", 
"error": "cluster is scaling up, could not create reservation pod"}                        │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile                                                                                                                  │
│     /local/pkg/binder/controllers/bindrequest_controller.go:164                                                                                                                                            │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile                                                                                                                        │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler                                                                                                                 │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem                                                                                                              │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1                                                                                                                    │
│     /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202

When disabling KAI-Scheduler for these applications, Karpenter creates a nodeclaim and the pod is scheduled successfully.

UPDATE:
There was a pod in the kai-scale-adjust namespace that was stuck in the Terminating state. Manually forcing the deletion of this pod fixed the scheduling of all of our workloads.

The logs from the nodescaleadjuster show that it was trying to delete this pod multiple times a second.

2025/10/30 09:32:36 Deleted scaling pod: pod-name
2025/10/30 09:32:36 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:36 Deleted scaling pod: pod-name
2025/10/30 09:32:36 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name

What did you expect to happen?

Nodeclaims to be created and pods to be scheduled successfully

Environment

  • Kubernetes version: 1.33
  • KAI Scheduler version: 0.9.4
  • Cloud provider or hardware configuration: AWS
  • Tools that you are using KAI together with: Karpenter and KEDA

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions