Skip to content

StrictFIFO blocks workloads requesting different ResourceFlavors in same ClusterQueue #8309

@sfc-gh-raravena

Description

@sfc-gh-raravena

What happened:

We have a ClusterQueue configured with queueingStrategy: StrictFIFO and multiple ResourceFlavors (B200 and B300 GPU pools) with independent quotas. A workload requesting B300 GPUs (position 0 in queue) cannot be admitted because B300 is fully utilized. However, this single stuck workload is blocking all subsequent workloads requesting B200 GPUs, even though B200 has 40 available GPUs.

Observed queue state:

  • Position 0: job-1 (B300-raid0, 8 GPUs) - Cannot admit (B300: 16/16 used)
  • Position 1: job-2 (B200-raid0, 8 GPUs) - BLOCKED, never evaluated
  • Position 2: job-3 (B200-raid0) - BLOCKED, never evaluated
  • Position 3: job-4 (B200-raid0, 2 GPUs) - BLOCKED, never evaluated

Evidence:

  • Only the head workload (position 0) appears in Kueue controller logs
  • Zero log entries exist for positions 1-3 workloads
  • B200 has 0/40 GPUs used, B300 has 16/16 GPUs used

Controller log shows only the head workload:

{"msg":"couldn't assign flavors to pod set main: insufficient unused quota for nvidia.com/gpu in flavor pool-b300, 8 more needed", "object":{"name":"job-1"}}

What you expected to happen:

  • Option A (per-flavor FIFO): Workloads requesting B200 should be evaluated and admitted even when the B300 workload at the head cannot be admitted, since they're using completely independent resource pools.
  • Option B (current global FIFO): If cross-flavor blocking is intentional, this should be clearly documented as it has significant operational implications.

How to reproduce it (as minimally and precisely as possible):

  1. Create a ClusterQueue with StrictFIFO and multiple flavors:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: test-cq
  spec:
  queueingStrategy: StrictFIFO
  resourceGroups:
  - coveredResources:
    - nvidia.com/gpu
    flavors:
    - name: flavor-a
      resources:
      - name: nvidia.com/gpu
         nominalQuota: "16"
    - name: flavor-b
      resources:
      - name: nvidia.com/gpu
        nominalQuota: "40"
  1. Submit a workload requesting flavor-a (will consume all 16 GPUs)
  2. Submit workload-1 requesting flavor-a (8 GPUs) - will be blocked due to no capacity
  3. Submit workload-2 requesting flavor-b (8 GPUs) - should fit but will be blocked
  4. Observe that workload-2 never gets evaluated despite flavor-b having 40 available GPUs

Code reference:
From pkg/cache/queue/manager.go:688-710, the scheduler pops ONE workload per ClusterQueue per cycle:

func (m *Manager) heads() []workload.Info {
    for cqName, cq := range m.hm.ClusterQueues() {
        wl := cq.Pop()  // Pops only one workload per CQ per cycle
    }
}

With StrictFIFO, Pop() returns workloads in strict creation order regardless of ResourceFlavor.

Anything else we need to know?:

Impact:

  • Head-of-line blocking across independent resource pools
  • Poor resource utilization (40 idle B200 GPUs while jobs wait)
  • Defeats the purpose of multiple flavors with separate quotas

Documentation gap:

  • The current docs state: "Older workloads that can't be admitted will block newer workloads, even if the newer workloads fit in the available quota"

This is ambiguous regarding whether blocking applies:

  • Within the same flavor only, or
  • Across all flavors in the ClusterQueue

Questions:

  1. Is cross-flavor blocking the intended behavior of StrictFIFO?
  2. If yes, what's the recommended architecture for multi-flavor setups requiring FIFO ordering?
  3. Should separate ClusterQueues be created per flavor family to avoid this?
  4. Would a per-flavor FIFO mode be considered as a feature enhancement?

Workarounds considered:

  • BestEffortFIFO: Loses strict ordering guarantees
  • Separate ClusterQueues per flavor: Management overhead, prevents inter-flavor borrowing
  • Delete blocking workload: Not operationally sustainable

Environment:

  • Kubernetes version: v1.31.13-eks-ecaa3a6
  • Kueue version: v1beta1 (ClusterQueue API version)
  • Cloud provider: AWS EKS
  • Hardware: p6-b200.48xlarge (B200 GPUs), p6-b300.48xlarge (B300 GPUs)
  • OS: Amazon Linux 2023
  • Install tools: Deployed via ArgoCD

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/important-longtermImportant over the long term, but may not be staffed and/or may need multiple releases to complete.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions