Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working

### Description

# Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working

## Description

### Current Behavior

When the cluster's capacity is fully allocated, the capacity plugin blocks new jobs from transitioning from Pending to Inqueue state. This prevents the reclaim action from ever considering these jobs for resource reclamation, even when:

*   Queues are marked as `reclaimable: true`
*   Some queues are over their deserved allocation and should yield resources

This creates a chicken-egg situation:

1.  Job is in Pending state
2.  Capacity plugin blocks enqueue because cluster is at capacity (root queue fully allocated)
3.  Job remains in Pending state
4.  Reclaim action skips Pending jobs
5.  No reclamation happens, job stays stuck

### Queue Hierarchy:
```
root (capacity: 320 H100 = cluster total)
├── child-queue-a (capacity: 320, deserved: 160)
│   ├── subchild-queue-a1 (capacity: 160, deserved: 80)
│   └── subchild-queue-a2 (capacity: 160, deserved: 80)
└── child-queue-b (capacity: 320, deserved: 160)
```

All queues have `reclaimable: true` and each job member requests 8 H100 GPUs.

### Current State:
```
| NAME                                | STATUS  | MINMEMBER | RUNNINGS | AGE   |
| :---------------------------------- | :------ | :-------- | :------- | :---- |
| ray-kwok-raycluster-h100-q2-a1-0-pg | Pending | 1         | 0        | 119m  |
| ray-kwok-raycluster-h100-q2-a1-pg   | Running | 8         | 8        | 119m  |
| ray-kwok-raycluster-h100-q2-a2-pg   | Running | 11        | 11       | 175m  |
| ray-kwok-raycluster-h100-q2-b-pg    | Running | 21        | 21       | 174m  |
```
Total cluster allocation: 320 H100s (8×8 + 11×8 + 21×8 = 40 members × 8 GPUs)

Submit a new job of `ray-kwok-raycluster-h100-q2-a1-0-pg` (8 GPU) got stuck
```
I1017 21:10:21.430616       1 enqueue.go:45] Enter Enqueue ...
I1017 21:10:21.430622       1 enqueue.go:63] Added Queue  for Job <volcano/ray-kwok-raycluster-h100-q2-a2-pg>
I1017 21:10:21.430626       1 enqueue.go:63] Added Queue  for Job <volcano/ray-kwok-raycluster-h100-q2-b-pg>
I1017 21:10:21.430633       1 enqueue.go:63] Added Queue  for Job <volcano/ray-kwok-raycluster-h100-q2-a1-0-pg>
I1017 21:10:21.430639       1 enqueue.go:74] Added Job <volcano/ray-kwok-raycluster-h100-q2-a1-0-pg> into Queue 
I1017 21:10:21.430646       1 enqueue.go:79] Try to enqueue PodGroup to 1 Queues
I1017 21:10:21.430690       1 capacity.go:772] job ray-kwok-raycluster-h100-q2-a1-0-pg min resource <cpu 8000.00, memory 8589934592.00, nvidia.com/h100 8000.00>, queue root capability <cpu 751370.00, memory 1122232659968.00, nvidia.com/a100 320000.00, nvidia.com/h100 320000.00> allocated <cpu 320000.00, memory 343597383680.00, nvidia.com/h100 320000.00> inqueue <cpu 0.00, memory 0.00> elastic <cpu 0.00, memory 0.00>
I1017 21:10:21.430715       1 enqueue.go:104] Leaving Enqueue 
```

## Expected Behavior

The job should:

1.  Transition to Inqueue state even when the cluster is at capacity
2.  Allow the reclaim action to evaluate if resources can be reclaimed from queues exceeding their deserved allocation

```
root: 320/320 H100 (FULL)
├── child-queue-a: 152/320 H100 (deserved: 160)
│   ├── subchild-queue-a1: 64/160 H100 (deserved: 80) ← Job wants to go here. 
│   └── subchild-queue-a2: 88/160 H100 (deserved: 80) <- ← OVER deserved!
└── child-queue-b: 168/320 H100 (deserved: 160) ← OVER deserved!
```

## Root Cause

The capacity plugin's `jobEnqueueable` check <https://github.com/volcano-sh/volcano/blob/master/pkg/scheduler/plugins/capacity/capacity.go#L884-L894> uses the formula:

```go
r := minReq.Clone().Add(attr.allocated).Add(attr.inqueue).Sub(attr.elastic)
return r.LessEqualWithDimension(attr.realCapability, minReq)
```
When checking hierarchically, this is applied to the root queue (cluster total). When allocated == capability (cluster fully allocated) and elastic == 0 (jobs running at minimum resources), this check fails even though:
- Resources exist in other queues that could be reclaimed
- Some queues are over their deserved share
- All queues are marked reclaimable
The elastic metric only tracks resources within a job that exceed its minimum requirements, not reclaimable resources across the queue hierarchy.


### Fix
Option 1: In capacity plugin, 
```
func (cp *capacityPlugin) jobEnqueueable(queue *api.QueueInfo, job *api.JobInfo) bool {
    // ... existing capacity check ...
    
    // If cluster is at capacity, still allow enqueue if resources could be reclaimed
    if cp.hasReclaimableResourcesInCluster(queue, job) {
        return true
    }
    return false
}
```

OPTION 2: https://github.com/volcano-sh/volcano/pull/4680 Skip the check at root level.


### Steps to reproduce the issue

1.
2.
3.


### Describe the results you received and expected

see description.

### What version of Volcano are you using?

1.12.0

### Any other relevant information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working #4679

Description

Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working

Description

Current Behavior

Queue Hierarchy:

Current State:

Expected Behavior

Root Cause

Fix

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working #4679

Description

Description

Capacity plugin blocks job enqueue when cluster is at capacity, preventing reclaim action from working

Description

Current Behavior

Queue Hierarchy:

Current State:

Expected Behavior

Root Cause

Fix

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions