Skip to content

Gang Scheduling Race Condition During Rolling Updates #316

@gflarity

Description

@gflarity

Summary

Rolling updates can get permanently stuck when pods are created during the brief window when PodGang/PodClique resources are being recreated. The affected pods become permanently stuck in SchedulingGated state because the gang scheduler cannot associate them with their gang resources. The test Test_RU19_RollingUpdateWithPodCliqueScaleOutBeforeUpdate fails intermittently with a 4-minute timeout.

e2e-failure-analysis-RU19.diag.txt

What Should Happen

During a rolling update:

  1. Rolling update is triggered when PodCliqueSet spec changes (detected via PodTemplateHash comparison)
  2. PodCliqueSet controller reconciles:
    • prepareSyncFlow() snapshots all existing pods into the sync context
    • Excess PodGangs are deleted (if scale changed)
    • PodGangs are created/updated via CreateOrPatch, with PodReferences populated from the sync context snapshot
  3. PodClique resources are updated in-place via CreateOrPatch with new specs (UIDs remain stable since resources aren't deleted)
  4. Old pods (with outdated template hash) are marked for deletion one at a time
  5. New replacement pods are created with LabelPodGang label and store the current PodClique UID for validation
  6. PodCliqueSet controller reconciles again, captures new pods in sync context, and updates PodGang's PodReferences to include them
  7. PodClique controller sees pods are in PodReferences, removes their scheduling gates
  8. Scheduler plugin validates pod's stored UID matches current PodClique UID ✓
  9. Pods get scheduled and become ready
  10. Once minAvailable is satisfied, rolling update progresses to the next pod

What Happens (Intermittently)

The system ends up in a deadlocked state with the following observed problems:

  1. Empty subgroup in PodGang: Subgroup workload1-0-sg-x-0-pc-b has podReferences: [] but requires minReplicas: 1

  2. Pods stuck in SchedulingGated: 7 pods remain permanently in SchedulingGated state with their scheduling gates never removed

  3. Gang cannot be scheduled: The PodGroup reports Job is not ready for scheduling. Waiting for 1 pods for SubGroup...

  4. UID mismatch warnings: Some pods received PodGrouperWarning: failed to find PodClique... uid: <old-uid> errors

  5. Rolling update deadlocked: The controller repeatedly requeues with available replicas 1 lesser than minAvailable 2 until timeout

Evidence from Diagnostics

1. Pod Status at Timeout

All 7 pods for replica index 0 (workload1-0) are stuck in SchedulingGated state, while replica index 1 (workload1-1) pods are running normally:

NAME                                     PHASE        READY      NODE                   CONDITIONS
workload1-0-pc-a-k7f5r                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-pc-a-kzqls                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-pc-a-r8xzv                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-pc-a-sq7n9                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-8d44l            Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-kjsx5            Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-r554q            Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-1-pc-a-75dcs                   Running      1/1        k3d-shared-e2e-test-cluster-agent-18    OK
workload1-1-pc-a-95tt9                   Running      1/1        k3d-shared-e2e-test-cluster-agent-19    OK
... (workload1-1 pods all running)

Source: Lines 2953-2977 of e2e-failure-analysis-RU19.diag.txt

2. PodClique Status Shows All Pods Gated

The PodClique workload1-0-pc-a status confirms the stuck state:

status:
  conditions:
  - message: 'Insufficient scheduled pods. expected at least: 2, found: 0'
    reason: InsufficientScheduledPods
    status: "False"
    type: PodCliqueScheduled
  readyReplicas: 0
  replicas: 4
  scheduleGatedReplicas: 4    # ← All 4 pods are gated
  scheduledReplicas: 0        # ← None have been scheduled
  updatedReplicas: 4          # ← Pods have new spec but can't be scheduled
  rollingUpdateProgress:
    updateStartedAt: "2026-01-12T18:48:38Z"
    readyPodsSelectedToUpdate:
      current: workload1-0-pc-a-bvwnb

Source: Lines 935-965 of e2e-failure-analysis-RU19.diag.txt

3. PodGrouperWarning Events

The Kubernetes events show the race condition during the rolling update phase (starting at 18:48:38):

2026-01-12T18:48:39  Normal   PodCliqueScalingGroupR...  PodCliqueSet/workload1     Deleted PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0
2026-01-12T18:48:39  Normal   PodCreateSuccessful       PodClique/workload1-0-sg-x-0-pc-b   Created Pod: workload1-0-sg-x-0-pc-b-nctvs
2026-01-12T18:48:39  Normal   PodCliqueCreateSuccessful PodCliqueScalingGroup/...   PodClique default/workload1-0-sg-x-0-pc-c created successfully
2026-01-12T18:48:39  Normal   PodCliqueCreateSuccessful PodCliqueScalingGroup/...   PodClique default/workload1-0-sg-x-0-pc-b created successfully
2026-01-12T18:48:41  Warning  PodGrouperWarning         Pod/workload1-0-sg-x-0-pc-b-nctvs   failed to find PodClique: <default/workload1-0-sg-x-0-pc-b>, uid: <bc4a45c2-1...
2026-01-12T18:48:42  Warning  PodGrouperWarning         Pod/workload1-0-pc-a-kzqls          error assigning pods to subgroup: pods "workload1-0-sg-x-0-pc-c-gpbhk" not found
2026-01-12T18:48:42  Warning  PodGrouperWarning         Pod/workload1-0-sg-x-0-pc-c-gpbhk   failed to find PodClique: <default/workload1-0-sg-x-0-pc-c>, uid: <68db0da5-c...
2026-01-12T18:48:44  Normal   NotReady                  PodGroup/pg-workload1-0-...         Job is not ready for scheduling. Waiting for 1 pods for SubGroup workload1-0-...

Key observations:

  • At 18:48:39: PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0 is deleted, then new PodCliques are recreated with new UIDs
  • At 18:48:41: Pod workload1-0-sg-x-0-pc-b-nctvs has a UID mismatch — it stored the old PodClique UID
  • At 18:48:42: Pod workload1-0-pc-a-kzqls fails with error assigning pods to subgroup: pods "workload1-0-sg-x-0-pc-c-gpbhk" not found — the gang scheduler can't find a referenced pod
  • At 18:48:44: The PodGroup reports it's waiting for pods that will never come

Source: Lines 4685-4719 of e2e-failure-analysis-RU19.diag.txt

4. PodGang Has Empty Subgroup

The PodGang workload1-0 shows that subgroup workload1-0-sg-x-0-pc-b has no pods despite requiring minReplicas: 1:

spec:
  podgroups:
  - minReplicas: 2
    name: workload1-0-pc-a
    podReferences:
    - name: workload1-0-pc-a-k7f5r      # ← Stuck in SchedulingGated
    - name: workload1-0-pc-a-kzqls      # ← Stuck in SchedulingGated
    - name: workload1-0-pc-a-r8xzv      # ← Stuck in SchedulingGated
    - name: workload1-0-pc-a-sq7n9      # ← Stuck in SchedulingGated
  - minReplicas: 1
    name: workload1-0-sg-x-0-pc-b
    podReferences: []                    # ← EMPTY! Needs 1 pod, has 0
  - minReplicas: 3
    name: workload1-0-sg-x-0-pc-c
    podReferences:
    - name: workload1-0-sg-x-0-pc-c-8d44l  # ← Stuck in SchedulingGated
    - name: workload1-0-sg-x-0-pc-c-kjsx5  # ← Stuck in SchedulingGated
    - name: workload1-0-sg-x-0-pc-c-r554q  # ← Stuck in SchedulingGated

The gang scheduler requires ALL subgroups to have their minReplicas satisfied. Since workload1-0-sg-x-0-pc-b has 0 pods (but needs 1), the entire gang is blocked forever.

Source: Lines 2824-2848 of e2e-failure-analysis-RU19.diag.txt

5. Controller Stuck Waiting for minAvailable

The controller repeatedly logs that it can't proceed because availableReplicas < minAvailable:

{
  "level": "info",
  "ts": "2026-01-12T18:51:49.470Z",
  "logger": "podcliquescalinggroup-controller",
  "msg": "components has registered a request to requeue post completion of all components syncs",
  "PodCliqueScalingGroup": {"name": "workload1-0-sg-x", "namespace": "default"},
  "kind": "PodClique",
  "message": "[Operation: Sync, Code: ERR_CONTINUE_RECONCILE_AND_REQUEUE] message: available replicas 1 lesser than minAvailable 2, requeuing"
}

This message repeats every ~5 seconds from 18:51:49 until the test timeout at 18:52:38, showing the system is stuck in an infinite requeue loop.

Source: Lines 99, 148, 197, 246, 293 of e2e-failure-analysis-RU19.diag.txt

6. PodCliqueSet Shows Rolling Update Stuck

The PodCliqueSet status shows the rolling update is blocked on replica index 0:

status:
  availableReplicas: 1
  replicas: 2
  updatedReplicas: 0
  rollingUpdateProgress:
    currentlyUpdating:
      replicaIndex: 0                           # ← Stuck on replica 0
      updateStartedAt: "2026-01-12T18:48:38Z"
    updateStartedAt: "2026-01-12T18:48:38Z"

The rolling update started at 18:48:38 and was still stuck at 18:52:38 (4 minutes later) when the test timed out.

Source: Lines 770-780 of e2e-failure-analysis-RU19.diag.txt

Timeline Summary

Time Event State
18:48:38 Rolling update starts updateStartedAt set, old pods being deleted
18:48:39 PodCliqueScalingGroup deleted workload1-0-sg-x replicaIndex: 0 deleted
18:48:39 New PodCliques created New UIDs assigned to recreated PodCliques
18:48:39-41 New pods created Some pods store old PodClique UIDs
18:48:41 UID mismatch errors failed to find PodClique... uid: <old>
18:48:42 Gang assignment fails error assigning pods to subgroup: pods "X" not found
18:48:44 Gang blocked Job is not ready for scheduling. Waiting for 1 pods...
18:51:49+ Controller requeue loop "available replicas 1 lesser than minAvailable 2"
18:52:38 Test timeout 4 minutes with no progress

Update: Additional Evidence

PodClique IS Recreated

The diagnostics show the PodClique IS recreated:

2026-01-12T18:48:39  Normal   PodCliqueCreateSuccessful PodCliqueScalingGroup/...   PodClique default/workload1-0-sg-x-0-pc-b created successfully

Source: Line 4695 of diagnostics

New PodClique Status at Test Failure

The PodClique workload1-0-sg-x-0-pc-b exists at test failure time with the following status:

metadata:
  creationTimestamp: "2026-01-12T18:48:39Z"
  generation: 1
  uid: 7e97287e-5b8f-4b9f-935c-d4b66bb974ad    # NEW UID
spec:
  replicas: 1                                   # Should have 1 pod
status:
  observedGeneration: 1                         # Controller DID process it
  readyReplicas: 0
  scheduleGatedReplicas: 0                      # NO pods are even gated
  scheduledReplicas: 0
  updatedReplicas: 0                            # NO pods exist at all

Source: Lines 968-1150 of diagnostics

Note: The observedGeneration: 1 matches generation: 1, indicating the PodClique controller reconciled this resource.

Pod Counts at Test Failure

The pod list at test failure shows zero pods for workload1-0-sg-x-0-pc-b:

Pod Pattern Count Status
workload1-0-pc-a-* 4 SchedulingGated
workload1-0-sg-x-0-pc-b-* 0 NONE EXIST
workload1-0-sg-x-0-pc-c-* 3 SchedulingGated
workload1-0-sg-x-1-pc-b-* 1 Running

Source: Lines 2951-2975 of diagnostics

Event Sequence at 18:48:39

Events within the same second:

  1. PCSG deleted: Deleted PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0
  2. Pod created by OLD PodClique: Created Pod: workload1-0-sg-x-0-pc-b-nctvs (stores OLD UID bc4a45c2...)
  3. NEW PodClique created: PodClique default/workload1-0-sg-x-0-pc-b created successfully (NEW UID 7e97287e...)
  4. UID mismatch reported (18:48:41): Pod nctvs fails validation with failed to find PodClique... uid: <bc4a45c2...>
  5. At test failure: Pod nctvs does not exist; NEW PodClique has updatedReplicas: 0

Diagnostic Gap: Missing PodClique Controller Logs

The diagnostics contain no logs from podclique-controller. Only these controllers are logged:

  • podcliqueset-controller
  • podcliquescalinggroup-controller

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions