Skip to content

Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working #4679

@mtian29

Description

@mtian29

Description

Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working

Description

Current Behavior

When the cluster's capacity is fully allocated, the capacity plugin blocks new jobs from transitioning from Pending to Inqueue state. This prevents the reclaim action from ever considering these jobs for resource reclamation, even when:

  • Queues are marked as reclaimable: true
  • Some queues are over their deserved allocation and should yield resources

This creates a chicken-egg situation:

  1. Job is in Pending state
  2. Capacity plugin blocks enqueue because cluster is at capacity (root queue fully allocated)
  3. Job remains in Pending state
  4. Reclaim action skips Pending jobs
  5. No reclamation happens, job stays stuck

Queue Hierarchy:

root (capacity: 320 H100 = cluster total)
├── child-queue-a (capacity: 320, deserved: 160)
│   ├── subchild-queue-a1 (capacity: 160, deserved: 80)
│   └── subchild-queue-a2 (capacity: 160, deserved: 80)
└── child-queue-b (capacity: 320, deserved: 160)

All queues have reclaimable: true and each job member requests 8 H100 GPUs.

Current State:

| NAME                                | STATUS  | MINMEMBER | RUNNINGS | AGE   |
| :---------------------------------- | :------ | :-------- | :------- | :---- |
| ray-kwok-raycluster-h100-q2-a1-0-pg | Pending | 1         | 0        | 119m  |
| ray-kwok-raycluster-h100-q2-a1-pg   | Running | 8         | 8        | 119m  |
| ray-kwok-raycluster-h100-q2-a2-pg   | Running | 11        | 11       | 175m  |
| ray-kwok-raycluster-h100-q2-b-pg    | Running | 21        | 21       | 174m  |

Total cluster allocation: 320 H100s (8×8 + 11×8 + 21×8 = 40 members × 8 GPUs)

Submit a new job of ray-kwok-raycluster-h100-q2-a1-0-pg (8 GPU) got stuck

I1017 21:10:21.430616       1 enqueue.go:45] Enter Enqueue ...
I1017 21:10:21.430622       1 enqueue.go:63] Added Queue  for Job <volcano/ray-kwok-raycluster-h100-q2-a2-pg>
I1017 21:10:21.430626       1 enqueue.go:63] Added Queue  for Job <volcano/ray-kwok-raycluster-h100-q2-b-pg>
I1017 21:10:21.430633       1 enqueue.go:63] Added Queue  for Job <volcano/ray-kwok-raycluster-h100-q2-a1-0-pg>
I1017 21:10:21.430639       1 enqueue.go:74] Added Job <volcano/ray-kwok-raycluster-h100-q2-a1-0-pg> into Queue 
I1017 21:10:21.430646       1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I1017 21:10:21.430690       1 capacity.go:772] job ray-kwok-raycluster-h100-q2-a1-0-pg min resource <cpu 8000.00, memory 8589934592.00, nvidia.com/h100 8000.00>, queue root capability <cpu 751370.00, memory 1122232659968.00, nvidia.com/a100 320000.00, nvidia.com/h100 320000.00> allocated <cpu 320000.00, memory 343597383680.00, nvidia.com/h100 320000.00> inqueue <cpu 0.00, memory 0.00> elastic <cpu 0.00, memory 0.00>
I1017 21:10:21.430715       1 enqueue.go:104] Leaving Enqueue 

Expected Behavior

The job should:

  1. Transition to Inqueue state even when the cluster is at capacity
  2. Allow the reclaim action to evaluate if resources can be reclaimed from queues exceeding their deserved allocation
root: 320/320 H100 (FULL)
├── child-queue-a: 152/320 H100 (deserved: 160)
│   ├── subchild-queue-a1: 64/160 H100 (deserved: 80) ← Job wants to go here. 
│   └── subchild-queue-a2: 88/160 H100 (deserved: 80) <- ← OVER deserved!
└── child-queue-b: 168/320 H100 (deserved: 160) ← OVER deserved!

Root Cause

The capacity plugin's jobEnqueueable check https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L884-L894 uses the formula:

r := minReq.Clone().Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic)
return r.LessEqualWithDimension(attr.realCapability, minReq)

When checking hierarchically, this is applied to the root queue (cluster total). When allocated == capability (cluster fully allocated) and elastic == 0 (jobs running at minimum resources), this check fails even though:

  • Resources exist in other queues that could be reclaimed
  • Some queues are over their deserved share
  • All queues are marked reclaimable
    The elastic metric only tracks resources within a job that exceed its minimum requirements, not reclaimable resources across the queue hierarchy.

Fix

Option 1: In capacity plugin,

func (cp *capacityPlugin) jobEnqueueable(queue *api.QueueInfo, job *api.JobInfo) bool {
    // ... existing capacity check ...
    
    // If cluster is at capacity, still allow enqueue if resources could be reclaimed
    if cp.hasReclaimableResourcesInCluster(queue, job) {
        return true
    }
    return false
}

OPTION 2: #4680 Skip the check at root level.

Steps to reproduce the issue

Describe the results you received and expected

see description.

What version of Volcano are you using?

1.12.0

Any other relevant information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions