Pods failing to schedule: blocked by terminating scaling-pod

### What happened?

Recently, a couple of of our applications have had pods fail to schedule, the PodGroup is repeatedly killed and recreated for several hours with message:

```
Operation cannot be fulfilled on podgroups.scheduling.run.ai 
"pg-my-app-b9cb66595-s25q6-d6454e9b-4ad9-459b-aeeb-2f3cfa0b153a":
 the object has been modified; please apply your changes to the latest version and try again
```

The binder has the following error message:

```
 2025-10-29T23:48:58.612Z    ERROR    Failed to reserve GPUs resources for pod    {"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"n │
│ ame":"my-app-b9cb66595-rwxq6","namespace":"my-namespace"},
 "namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6", "reconcileID": "cbf3a793-739f-4f50-8bda-56d005e6286e", " │
│ namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6",
 "error": "cluster is scaling up, could not create reservation pod"}                                                              │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/binding.(*Binder).Bind                                                                                                                                          │
│     /local/pkg/binder/binding/binder.go:55                                                                                                                                                                 │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile                                                                                                                  │
│     /local/pkg/binder/controllers/bindrequest_controller.go:155                                                                                                                                            │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile                                                                                                                        │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler                                                                                                                 │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem                                                                                                              │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1                                                                                                                    │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202                                                                                                           │
│ 2025-10-29T23:48:58.612Z    ERROR    Failed to bind pod to node
    {"controller": "bindrequest", "controllerGroup": "scheduling.run.ai", "controllerKind": "BindRequest", "BindRequest": {"name":"my-app-b9cb66595-rwxq6","namespace":"my-namespace""},
 "namespace": "my-namespace", "name": "my-app-b9cb66595-rwxq6",
 "reconcileID": "cbf3a793-739f-4f50-8bda-56d005e6286e", 
"pod": "my-app-b9cb66595-rwxq6", "namespace": "my-namespace"",
 "node": "ip-10-107-3-14.us-west-2.compute.internal", 
"error": "cluster is scaling up, could not create reservation pod"}                        │
│ github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers.(*BindRequestReconciler).Reconcile                                                                                                                  │
│     /local/pkg/binder/controllers/bindrequest_controller.go:164                                                                                                                                            │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile                                                                                                                        │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:119                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler                                                                                                                 │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:340                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem                                                                                                              │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:300                                                                                                           │
│ sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1                                                                                                                    │
│     /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.21.0/pkg/internal/controller/controller.go:202
```

When disabling KAI-Scheduler for these applications, Karpenter creates a nodeclaim and the pod is scheduled successfully.

UPDATE:
There was a pod in the kai-scale-adjust namespace that was stuck in the Terminating state. Manually forcing the deletion of this pod fixed the scheduling of all of our workloads.

The logs from the nodescaleadjuster show that it was trying to delete this pod multiple times a second. 

```
2025/10/30 09:32:36 Deleted scaling pod: pod-name
2025/10/30 09:32:36 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:36 Deleted scaling pod: pod-name
2025/10/30 09:32:36 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
2025/10/30 09:32:37 Deleting scaling pod: kai-scale-adjust/pod-name
2025/10/30 09:32:37 Deleted scaling pod: pod-name
```

### What did you expect to happen?

Nodeclaims to be created and pods to be scheduled successfully

### Environment

- Kubernetes version: 1.33
- KAI Scheduler version: 0.9.4
- Cloud provider or hardware configuration: AWS 
- Tools that you are using KAI together with: Karpenter and KEDA


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pods failing to schedule: blocked by terminating scaling-pod #611

What happened?

What did you expect to happen?

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pods failing to schedule: blocked by terminating scaling-pod #611

Description

What happened?

What did you expect to happen?

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions