Skip to content

Infinite preemption loop in fair sharing #3779

@yuvalaz99

Description

What happened:

An infinite preemption loop occurs in a hierarchical cohort scenario with Kueue's Pod integration.
When a higher-priority workload preempts a lower-priority one from a different queue, the Deployment controller recreates the preempted workload, causing it to be readmitted.
Once readmitted, the preemption occurs again, triggering this infinite loop.

I'll show an edge case where this infinite preemption behavior occurs. Not all preemption operations using hierarchical cohorts behave in this way.

Cohort Structure
--- Cohort Root (NM: 0m)
-------- Cohort Guaranteed (NM: 300m)
------------- Clusterqueue Guaranteed (NM: 0m)
-------- ClusterQueue BestEffortTenantA (NM: 0m)
-------- ClusterQueue BestEffortTenantB (NM: 0m)

Event Sequence Leading to the Issue

  1. BestEffortTenantA1 (100m) admitted
  2. BestEffortTenantA2 (100m) admitted
  3. BestEffortTenantB1 (50m) admitted
  4. BestEffortTenantB2 (50m) admitted
  5. Guaranteed1 (50m) created → preempts BestEffortTenantA2
  6. Guaranteed1 (50m) admitted
  7. Guaranteed2 (100m) created → triggers infinite loop:
    • BestEffortTenantB2 preempted
    • Guaranteed2 remains SchedulingGated
    • BestEffortTenantB2 recreated by Deployment
    • BestEffortTenantB2 admitted
    • Loop repeats
  8. After a few minutes, BestEffortTenantB2 resumes running without being preempted, while Guaranteed2 remains in the SchedulingGated state.

Pods status

> kubectl get pods
NAME                                        READY   STATUS            RESTARTS   AGE
best-effort-tenant-a-1-5576db6f68-nms98     1/1     Running           0          55s
best-effort-tenant-a-2-695bd7b4c-b5n4l      0/1     SchedulingGated   0          43s
best-effort-tenant-b-1-68596f9f8b-52p5z     1/1     Running           0          50s
best-effort-tenant-b-2-69dc4f5797-l4vq6     0/1     Terminating       0          5s
best-effort-tenant-b-2-69dc4f5797-l4vq6     0/1     SchedulingGated   0          2s
guaranteed-1-7c96f8db5-n6575                1/1     Running           0          43s
guaranteed-2-79f59d8b74-7ft4b               0/1     SchedulingGated   0          39s

What you expected to happen:

Preemption should complete successfully without entering an infinite loop.

How to reproduce it (as minimally and precisely as possible):

  1. Apply core configuration (cohorts, queues, and priority classes):
---
# Root Cohort
apiVersion: kueue.x-k8s.io/v1alpha1
kind: Cohort
metadata:
  name: root
---
# Guaranteed Cohort
apiVersion: kueue.x-k8s.io/v1alpha1
kind: Cohort
metadata:
  name: guaranteed
spec:
  parent: root
  resourceGroups:
  - coveredResources: ["cpu"]
    flavors:
    - name: "default"
      resources:
      - name: "cpu"
        nominalQuota: "300m"
---
# Guaranteed ClusterQueue
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "guaranteed"
spec:
  namespaceSelector: {}
  preemption:
    reclaimWithinCohort: LowerPriority
    borrowWithinCohort:
      policy: LowerPriority
      maxPriorityThreshold: 100
    withinClusterQueue: Never
  cohort: guaranteed
  resourceGroups:
  - coveredResources: ["cpu"]
    flavors:
    - name: "default"
      resources:
      - name: "cpu"
        nominalQuota: "0m"
---
# Best-effort ClusterQueues
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "best-effort-tenant-a"
spec:
  namespaceSelector: {}
  preemption:
    reclaimWithinCohort: Never
    borrowWithinCohort:
      policy: Never
    withinClusterQueue: Never
  cohort: root
  resourceGroups:
  - coveredResources: ["cpu"]
    flavors:
    - name: "default"
      resources:
      - name: "cpu"
        nominalQuota: "0m"
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "best-effort-tenant-b"
spec:
  namespaceSelector: {}
  preemption:
    reclaimWithinCohort: Never
    borrowWithinCohort:
      policy: Never
    withinClusterQueue: Never
  cohort: root
  resourceGroups:
  - coveredResources: ["cpu"]
    flavors:
    - name: "default"
      resources:
      - name: "cpu"
        nominalQuota: "0m"
---
# LocalQueues
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: guaranteed
spec:
  clusterQueue: guaranteed
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: best-effort-tenant-a
spec:
  clusterQueue: best-effort-tenant-a
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: best-effort-tenant-b
spec:
  clusterQueue: best-effort-tenant-b
---
# Priority Classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low
value: 70
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: medium
value: 102
globalDefault: false
---
# Resource Flavor
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default
---
  1. Create best-effort workloads:
# Workloads - Best Effort A
apiVersion: apps/v1
kind: Deployment
metadata:
  name: best-effort-tenant-a-1
  labels: {app: best-effort-tenant-a-1, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-a}
spec:
  replicas: 1
  selector:
    matchLabels: {app: best-effort-tenant-a-1}
  template:
    metadata:
      labels: {app: best-effort-tenant-a-1, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-a}
    spec:
      priorityClassName: low
      terminationGracePeriodSeconds: 1
      containers:
      - name: main
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "100m"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: best-effort-tenant-a-2
  labels: {app: best-effort-tenant-a-2, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-a}
spec:
  replicas: 1
  selector:
    matchLabels: {app: best-effort-tenant-a-2}
  template:
    metadata:
      labels: {app: best-effort-tenant-a-2, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-a}
    spec:
      priorityClassName: low
      terminationGracePeriodSeconds: 1
      containers:
      - name: main
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "100m"
---
# Workloads - Best Effort B
apiVersion: apps/v1
kind: Deployment
metadata:
  name: best-effort-tenant-b-1
  labels: {app: best-effort-tenant-b-1, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-b}
spec:
  replicas: 1
  selector:
    matchLabels: {app: best-effort-tenant-b-1}
  template:
    metadata:
      labels: {app: best-effort-tenant-b-1, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-b}
    spec:
      priorityClassName: low
      terminationGracePeriodSeconds: 1
      containers:
      - name: main
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "50m"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: best-effort-tenant-b-2
  labels: {app: best-effort-tenant-b-2, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-b}
spec:
  replicas: 1
  selector:
    matchLabels: {app: best-effort-tenant-b-2}
  template:
    metadata:
      labels: {app: best-effort-tenant-b-2, kueue-job: "true", kueue.x-k8s.io/queue-name: best-effort-tenant-b}
    spec:
      priorityClassName: low
      terminationGracePeriodSeconds: 1
      containers:
      - name: main
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "50m"
  1. Create first guaranteed workload (triggers normal preemption):
apiVersion: apps/v1
kind: Deployment
metadata:
 name: guaranteed-1
 labels: {app: guaranteed-1, kueue-job: "true", kueue.x-k8s.io/queue-name: guaranteed}
spec:
 replicas: 1
 selector:
   matchLabels: {app: guaranteed-1}
 template:
   metadata:
     labels: {app: guaranteed-1, kueue-job: "true", kueue.x-k8s.io/queue-name: guaranteed}
   spec:
     priorityClassName: medium
     terminationGracePeriodSeconds: 1
     containers:
     - name: main
       image: registry.k8s.io/pause:3.9
       resources:
         requests:
           cpu: "50m"
  1. Create second guaranteed workload (triggers infinite loop):
apiVersion: apps/v1
kind: Deployment
metadata:
  name: guaranteed-2
  labels: {app: guaranteed-2, kueue-job: "true", kueue.x-k8s.io/queue-name: guaranteed}
spec:
  replicas: 1
  selector:
    matchLabels: {app: guaranteed-2}
  template:
    metadata:
      labels: {app: guaranteed-2, kueue-job: "true", kueue.x-k8s.io/queue-name: guaranteed}
    spec:
      priorityClassName: medium
      terminationGracePeriodSeconds: 1
      containers:
      - name: main
        image: registry.k8s.io/pause:3.9
        resources:
          requests:
            cpu: "100m"

Anything else we need to know?:

kueue-manager logs

{"level":"Level(-2)","ts":"2024-12-09T21:21:39.997044887Z","logger":"workload-reconciler","caller":"queue/manager.go:528","msg":"Attempting to move workloads","workload":{"name":"pod-best-effort-tenant-b-2-69dc4f5797-gbndl-be03a","namespace":"default"},"queue":"best-effort-tenant-b","status":"admitted","cohort":"root","root":"root"}
{"level":"Level(-2)","ts":"2024-12-09T21:21:39.997158095Z","logger":"scheduler","caller":"scheduler/scheduler.go:543","msg":"Workload assumed in the cache","attemptCount":115,"workload":{"name":"pod-best-effort-tenant-b-2-69dc4f5797-2jpd6-289ea","namespace":"default"},"clusterQueue":{"name":"best-effort-tenant-b"}}
{"level":"info","ts":"2024-12-09T21:21:39.997385095Z","logger":"scheduler","caller":"scheduler/scheduler.go:365","msg":"Workload skipped from admission because it's already assumed or admitted","attemptCount":116,"workload":{"name":"pod-best-effort-tenant-b-2-69dc4f5797-2jpd6-289ea","namespace":"default"},"clusterQueue":{"name":"best-effort-tenant-b"},"workload":{"name":"pod-best-effort-tenant-b-2-69dc4f5797-2jpd6-289ea","namespace":"default"}}
{"level":"error","ts":"2024-12-09T21:21:40.00190772Z","logger":"scheduler","caller":"scheduler/scheduler.go:263","msg":"Failed to preempt workloads","attemptCount":116,"workload":{"name":"pod-guranateed-2-788b9dd5bb-zbvmc-667aa","namespace":"default"},"clusterQueue":{"name":"guranateed"},"error":"Operation cannot be fulfilled on workloads.kueue.x-k8s.io \"pod-best-effort-tenant-b-2-69dc4f5797-2jpd6-289ea\": the object has been modified; please apply your changes to the latest version and try again","stacktrace":"sigs.k8s.io/kueue/pkg/scheduler.(*Scheduler).schedule\n\t/workspace/pkg/scheduler/scheduler.go:263\nsigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff.func1\n\t/workspace/pkg/util/wait/backoff.go:43\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/workspace/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227\nsigs.k8s.io/kueue/pkg/util/wait.untilWithBackoff\n\t/workspace/pkg/util/wait/backoff.go:42\nsigs.k8s.io/kueue/pkg/util/wait.UntilWithBackoff\n\t/workspace/pkg/util/wait/backoff.go:34"}
{"level":"Level(-2)","ts":"2024-12-09T21:21:40.002000345Z","logger":"scheduler","caller":"scheduler/scheduler.go:645","msg":"Workload re-queued","attemptCount":116,"workload":{"name":"pod-best-effort-tenant-a-1-5576db6f68-hjsv2-9fc95","namespace":"default"},"clusterQueue":{"name":"best-effort-tenant-a"},"queue":{"name":"best-effort-tenant-a","namespace":"default"},"requeueReason":"","added":true,"status":""}
{"level":"Level(-2)","ts":"2024-12-09T21:21:40.006699845Z","logger":"scheduler","caller":"scheduler/scheduler.go:567","msg":"Workload successfully admitted and assigned flavors","attemptCount":115,"workload":{"name":"pod-best-effort-tenant-b-2-69dc4f5797-2jpd6-289ea","namespace":"default"},"clusterQueue":{"name":"best-effort-tenant-b"},"assignments":[{"name":"main","flavors":{"cpu":"default"},"resourceUsage":{"cpu":"50m"},"count":1}]}
{"level":"debug","ts":"2024-12-09T21:21:40.006734262Z","logger":"events","caller":"recorder/recorder.go:104","msg":"Quota reserved in ClusterQueue best-effort-tenant-b, wait time since queued was 2s","type":"Normal","object":{"kind":"Workload","namespace":"default","name":"pod-best-effort-tenant-b-2-69dc4f5797-2jpd6-289ea","uid":"df8be6f5-0bba-4bd4-9f00-39b3209aa82c","apiVersion":"kueue.x-k8s.io/v1beta1","resourceVersion":"556812"},"reason":"QuotaReserved"}
{"level":"debug","ts":"2024-12-09T21:21:40.006753928Z","logger":"events","caller":"recorder/recorder.go:104","msg":"Admitted by ClusterQueue best-effort-tenant-b, wait time since reservation was 0s","type":"Normal","object":{"kind":"Workload","namespace":"default","name":"pod-best-effort-tenant-b-2-69dc4f5797-2jpd6-289ea","uid":"df8be6f5-0bba-4bd4-9f00-39b3209aa82c","apiVersion":"kueue.x-k8s.io/v1beta1","resourceVersion":"556812"},"reason":"Admitted"}

Environment:

  • Kubernetes version: v1.31.0
  • Kueue version: v0.10.0-rc.3-20-g6df4a225-dirty
  • Cloud provider or hardware configuration: local setup ( minikube ) & gke
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions