-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Description
Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working
Description
Current Behavior
When the cluster's capacity is fully allocated, the capacity plugin blocks new jobs from transitioning from Pending to Inqueue state. This prevents the reclaim action from ever considering these jobs for resource reclamation, even when:
- Queues are marked as
reclaimable: true - Some queues are over their deserved allocation and should yield resources
This creates a chicken-egg situation:
- Job is in Pending state
- Capacity plugin blocks enqueue because cluster is at capacity (root queue fully allocated)
- Job remains in Pending state
- Reclaim action skips Pending jobs
- No reclamation happens, job stays stuck
Queue Hierarchy:
root (capacity: 320 H100 = cluster total)
├── child-queue-a (capacity: 320, deserved: 160)
│ ├── subchild-queue-a1 (capacity: 160, deserved: 80)
│ └── subchild-queue-a2 (capacity: 160, deserved: 80)
└── child-queue-b (capacity: 320, deserved: 160)
All queues have reclaimable: true and each job member requests 8 H100 GPUs.
Current State:
| NAME | STATUS | MINMEMBER | RUNNINGS | AGE |
| :---------------------------------- | :------ | :-------- | :------- | :---- |
| ray-kwok-raycluster-h100-q2-a1-0-pg | Pending | 1 | 0 | 119m |
| ray-kwok-raycluster-h100-q2-a1-pg | Running | 8 | 8 | 119m |
| ray-kwok-raycluster-h100-q2-a2-pg | Running | 11 | 11 | 175m |
| ray-kwok-raycluster-h100-q2-b-pg | Running | 21 | 21 | 174m |
Total cluster allocation: 320 H100s (8×8 + 11×8 + 21×8 = 40 members × 8 GPUs)
Submit a new job of ray-kwok-raycluster-h100-q2-a1-0-pg (8 GPU) got stuck
I1017 21:10:21.430616 1 enqueue.go:45] Enter Enqueue ...
I1017 21:10:21.430622 1 enqueue.go:63] Added Queue for Job <volcano/ray-kwok-raycluster-h100-q2-a2-pg>
I1017 21:10:21.430626 1 enqueue.go:63] Added Queue for Job <volcano/ray-kwok-raycluster-h100-q2-b-pg>
I1017 21:10:21.430633 1 enqueue.go:63] Added Queue for Job <volcano/ray-kwok-raycluster-h100-q2-a1-0-pg>
I1017 21:10:21.430639 1 enqueue.go:74] Added Job <volcano/ray-kwok-raycluster-h100-q2-a1-0-pg> into Queue
I1017 21:10:21.430646 1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I1017 21:10:21.430690 1 capacity.go:772] job ray-kwok-raycluster-h100-q2-a1-0-pg min resource <cpu 8000.00, memory 8589934592.00, nvidia.com/h100 8000.00>, queue root capability <cpu 751370.00, memory 1122232659968.00, nvidia.com/a100 320000.00, nvidia.com/h100 320000.00> allocated <cpu 320000.00, memory 343597383680.00, nvidia.com/h100 320000.00> inqueue <cpu 0.00, memory 0.00> elastic <cpu 0.00, memory 0.00>
I1017 21:10:21.430715 1 enqueue.go:104] Leaving Enqueue
Expected Behavior
The job should:
- Transition to Inqueue state even when the cluster is at capacity
- Allow the reclaim action to evaluate if resources can be reclaimed from queues exceeding their deserved allocation
root: 320/320 H100 (FULL)
├── child-queue-a: 152/320 H100 (deserved: 160)
│ ├── subchild-queue-a1: 64/160 H100 (deserved: 80) ← Job wants to go here.
│ └── subchild-queue-a2: 88/160 H100 (deserved: 80) <- ← OVER deserved!
└── child-queue-b: 168/320 H100 (deserved: 160) ← OVER deserved!
Root Cause
The capacity plugin's jobEnqueueable check https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L884-L894 uses the formula:
r := minReq.Clone().Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic)
return r.LessEqualWithDimension(attr.realCapability, minReq)When checking hierarchically, this is applied to the root queue (cluster total). When allocated == capability (cluster fully allocated) and elastic == 0 (jobs running at minimum resources), this check fails even though:
- Resources exist in other queues that could be reclaimed
- Some queues are over their deserved share
- All queues are marked reclaimable
The elastic metric only tracks resources within a job that exceed its minimum requirements, not reclaimable resources across the queue hierarchy.
Fix
Option 1: In capacity plugin,
func (cp *capacityPlugin) jobEnqueueable(queue *api.QueueInfo, job *api.JobInfo) bool {
// ... existing capacity check ...
// If cluster is at capacity, still allow enqueue if resources could be reclaimed
if cp.hasReclaimableResourcesInCluster(queue, job) {
return true
}
return false
}
OPTION 2: #4680 Skip the check at root level.
Steps to reproduce the issue
Describe the results you received and expected
see description.
What version of Volcano are you using?
1.12.0
Any other relevant information
No response