-
Notifications
You must be signed in to change notification settings - Fork 500
Description
What happened:
We have a ClusterQueue configured with queueingStrategy: StrictFIFO and multiple ResourceFlavors (B200 and B300 GPU pools) with independent quotas. A workload requesting B300 GPUs (position 0 in queue) cannot be admitted because B300 is fully utilized. However, this single stuck workload is blocking all subsequent workloads requesting B200 GPUs, even though B200 has 40 available GPUs.
Observed queue state:
- Position 0: job-1 (B300-raid0, 8 GPUs) - Cannot admit (B300: 16/16 used)
- Position 1: job-2 (B200-raid0, 8 GPUs) - BLOCKED, never evaluated
- Position 2: job-3 (B200-raid0) - BLOCKED, never evaluated
- Position 3: job-4 (B200-raid0, 2 GPUs) - BLOCKED, never evaluated
Evidence:
- Only the head workload (position 0) appears in Kueue controller logs
- Zero log entries exist for positions 1-3 workloads
- B200 has 0/40 GPUs used, B300 has 16/16 GPUs used
Controller log shows only the head workload:
{"msg":"couldn't assign flavors to pod set main: insufficient unused quota for nvidia.com/gpu in flavor pool-b300, 8 more needed", "object":{"name":"job-1"}}What you expected to happen:
- Option A (per-flavor FIFO): Workloads requesting B200 should be evaluated and admitted even when the B300 workload at the head cannot be admitted, since they're using completely independent resource pools.
- Option B (current global FIFO): If cross-flavor blocking is intentional, this should be clearly documented as it has significant operational implications.
How to reproduce it (as minimally and precisely as possible):
- Create a ClusterQueue with StrictFIFO and multiple flavors:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: test-cq
spec:
queueingStrategy: StrictFIFO
resourceGroups:
- coveredResources:
- nvidia.com/gpu
flavors:
- name: flavor-a
resources:
- name: nvidia.com/gpu
nominalQuota: "16"
- name: flavor-b
resources:
- name: nvidia.com/gpu
nominalQuota: "40"- Submit a workload requesting flavor-a (will consume all 16 GPUs)
- Submit workload-1 requesting flavor-a (8 GPUs) - will be blocked due to no capacity
- Submit workload-2 requesting flavor-b (8 GPUs) - should fit but will be blocked
- Observe that workload-2 never gets evaluated despite flavor-b having 40 available GPUs
Code reference:
From pkg/cache/queue/manager.go:688-710, the scheduler pops ONE workload per ClusterQueue per cycle:
func (m *Manager) heads() []workload.Info {
for cqName, cq := range m.hm.ClusterQueues() {
wl := cq.Pop() // Pops only one workload per CQ per cycle
}
}With StrictFIFO, Pop() returns workloads in strict creation order regardless of ResourceFlavor.
Anything else we need to know?:
Impact:
- Head-of-line blocking across independent resource pools
- Poor resource utilization (40 idle B200 GPUs while jobs wait)
- Defeats the purpose of multiple flavors with separate quotas
Documentation gap:
- The current docs state: "Older workloads that can't be admitted will block newer workloads, even if the newer workloads fit in the available quota"
This is ambiguous regarding whether blocking applies:
- Within the same flavor only, or
- Across all flavors in the ClusterQueue
Questions:
- Is cross-flavor blocking the intended behavior of StrictFIFO?
- If yes, what's the recommended architecture for multi-flavor setups requiring FIFO ordering?
- Should separate ClusterQueues be created per flavor family to avoid this?
- Would a per-flavor FIFO mode be considered as a feature enhancement?
Workarounds considered:
- BestEffortFIFO: Loses strict ordering guarantees
- Separate ClusterQueues per flavor: Management overhead, prevents inter-flavor borrowing
- Delete blocking workload: Not operationally sustainable
Environment:
- Kubernetes version: v1.31.13-eks-ecaa3a6
- Kueue version: v1beta1 (ClusterQueue API version)
- Cloud provider: AWS EKS
- Hardware: p6-b200.48xlarge (B200 GPUs), p6-b300.48xlarge (B300 GPUs)
- OS: Amazon Linux 2023
- Install tools: Deployed via ArgoCD