-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Summary
Rolling updates can get permanently stuck when pods are created during the brief window when PodGang/PodClique resources are being recreated. The affected pods become permanently stuck in SchedulingGated state because the gang scheduler cannot associate them with their gang resources. The test Test_RU19_RollingUpdateWithPodCliqueScaleOutBeforeUpdate fails intermittently with a 4-minute timeout.
e2e-failure-analysis-RU19.diag.txt
What Should Happen
During a rolling update:
- Rolling update is triggered when PodCliqueSet spec changes (detected via PodTemplateHash comparison)
- PodCliqueSet controller reconciles:
prepareSyncFlow()snapshots all existing pods into the sync context- Excess PodGangs are deleted (if scale changed)
- PodGangs are created/updated via
CreateOrPatch, withPodReferencespopulated from the sync context snapshot
- PodClique resources are updated in-place via
CreateOrPatchwith new specs (UIDs remain stable since resources aren't deleted) - Old pods (with outdated template hash) are marked for deletion one at a time
- New replacement pods are created with
LabelPodGanglabel and store the current PodClique UID for validation - PodCliqueSet controller reconciles again, captures new pods in sync context, and updates PodGang's
PodReferencesto include them - PodClique controller sees pods are in
PodReferences, removes their scheduling gates - Scheduler plugin validates pod's stored UID matches current PodClique UID ✓
- Pods get scheduled and become ready
- Once minAvailable is satisfied, rolling update progresses to the next pod
What Happens (Intermittently)
The system ends up in a deadlocked state with the following observed problems:
-
Empty subgroup in PodGang: Subgroup
workload1-0-sg-x-0-pc-bhaspodReferences: []but requiresminReplicas: 1 -
Pods stuck in SchedulingGated: 7 pods remain permanently in
SchedulingGatedstate with their scheduling gates never removed -
Gang cannot be scheduled: The PodGroup reports
Job is not ready for scheduling. Waiting for 1 pods for SubGroup... -
UID mismatch warnings: Some pods received
PodGrouperWarning: failed to find PodClique... uid: <old-uid>errors -
Rolling update deadlocked: The controller repeatedly requeues with
available replicas 1 lesser than minAvailable 2until timeout
Evidence from Diagnostics
1. Pod Status at Timeout
All 7 pods for replica index 0 (workload1-0) are stuck in SchedulingGated state, while replica index 1 (workload1-1) pods are running normally:
NAME PHASE READY NODE CONDITIONS
workload1-0-pc-a-k7f5r Pending 0/1 <unscheduled> PodScheduled:SchedulingGated
workload1-0-pc-a-kzqls Pending 0/1 <unscheduled> PodScheduled:SchedulingGated
workload1-0-pc-a-r8xzv Pending 0/1 <unscheduled> PodScheduled:SchedulingGated
workload1-0-pc-a-sq7n9 Pending 0/1 <unscheduled> PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-8d44l Pending 0/1 <unscheduled> PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-kjsx5 Pending 0/1 <unscheduled> PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-r554q Pending 0/1 <unscheduled> PodScheduled:SchedulingGated
workload1-1-pc-a-75dcs Running 1/1 k3d-shared-e2e-test-cluster-agent-18 OK
workload1-1-pc-a-95tt9 Running 1/1 k3d-shared-e2e-test-cluster-agent-19 OK
... (workload1-1 pods all running)
Source: Lines 2953-2977 of e2e-failure-analysis-RU19.diag.txt
2. PodClique Status Shows All Pods Gated
The PodClique workload1-0-pc-a status confirms the stuck state:
status:
conditions:
- message: 'Insufficient scheduled pods. expected at least: 2, found: 0'
reason: InsufficientScheduledPods
status: "False"
type: PodCliqueScheduled
readyReplicas: 0
replicas: 4
scheduleGatedReplicas: 4 # ← All 4 pods are gated
scheduledReplicas: 0 # ← None have been scheduled
updatedReplicas: 4 # ← Pods have new spec but can't be scheduled
rollingUpdateProgress:
updateStartedAt: "2026-01-12T18:48:38Z"
readyPodsSelectedToUpdate:
current: workload1-0-pc-a-bvwnbSource: Lines 935-965 of e2e-failure-analysis-RU19.diag.txt
3. PodGrouperWarning Events
The Kubernetes events show the race condition during the rolling update phase (starting at 18:48:38):
2026-01-12T18:48:39 Normal PodCliqueScalingGroupR... PodCliqueSet/workload1 Deleted PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0
2026-01-12T18:48:39 Normal PodCreateSuccessful PodClique/workload1-0-sg-x-0-pc-b Created Pod: workload1-0-sg-x-0-pc-b-nctvs
2026-01-12T18:48:39 Normal PodCliqueCreateSuccessful PodCliqueScalingGroup/... PodClique default/workload1-0-sg-x-0-pc-c created successfully
2026-01-12T18:48:39 Normal PodCliqueCreateSuccessful PodCliqueScalingGroup/... PodClique default/workload1-0-sg-x-0-pc-b created successfully
2026-01-12T18:48:41 Warning PodGrouperWarning Pod/workload1-0-sg-x-0-pc-b-nctvs failed to find PodClique: <default/workload1-0-sg-x-0-pc-b>, uid: <bc4a45c2-1...
2026-01-12T18:48:42 Warning PodGrouperWarning Pod/workload1-0-pc-a-kzqls error assigning pods to subgroup: pods "workload1-0-sg-x-0-pc-c-gpbhk" not found
2026-01-12T18:48:42 Warning PodGrouperWarning Pod/workload1-0-sg-x-0-pc-c-gpbhk failed to find PodClique: <default/workload1-0-sg-x-0-pc-c>, uid: <68db0da5-c...
2026-01-12T18:48:44 Normal NotReady PodGroup/pg-workload1-0-... Job is not ready for scheduling. Waiting for 1 pods for SubGroup workload1-0-...
Key observations:
- At
18:48:39:PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0is deleted, then new PodCliques are recreated with new UIDs - At
18:48:41: Podworkload1-0-sg-x-0-pc-b-nctvshas a UID mismatch — it stored the old PodClique UID - At
18:48:42: Podworkload1-0-pc-a-kzqlsfails witherror assigning pods to subgroup: pods "workload1-0-sg-x-0-pc-c-gpbhk" not found— the gang scheduler can't find a referenced pod - At
18:48:44: The PodGroup reports it's waiting for pods that will never come
Source: Lines 4685-4719 of e2e-failure-analysis-RU19.diag.txt
4. PodGang Has Empty Subgroup
The PodGang workload1-0 shows that subgroup workload1-0-sg-x-0-pc-b has no pods despite requiring minReplicas: 1:
spec:
podgroups:
- minReplicas: 2
name: workload1-0-pc-a
podReferences:
- name: workload1-0-pc-a-k7f5r # ← Stuck in SchedulingGated
- name: workload1-0-pc-a-kzqls # ← Stuck in SchedulingGated
- name: workload1-0-pc-a-r8xzv # ← Stuck in SchedulingGated
- name: workload1-0-pc-a-sq7n9 # ← Stuck in SchedulingGated
- minReplicas: 1
name: workload1-0-sg-x-0-pc-b
podReferences: [] # ← EMPTY! Needs 1 pod, has 0
- minReplicas: 3
name: workload1-0-sg-x-0-pc-c
podReferences:
- name: workload1-0-sg-x-0-pc-c-8d44l # ← Stuck in SchedulingGated
- name: workload1-0-sg-x-0-pc-c-kjsx5 # ← Stuck in SchedulingGated
- name: workload1-0-sg-x-0-pc-c-r554q # ← Stuck in SchedulingGatedThe gang scheduler requires ALL subgroups to have their minReplicas satisfied. Since workload1-0-sg-x-0-pc-b has 0 pods (but needs 1), the entire gang is blocked forever.
Source: Lines 2824-2848 of e2e-failure-analysis-RU19.diag.txt
5. Controller Stuck Waiting for minAvailable
The controller repeatedly logs that it can't proceed because availableReplicas < minAvailable:
{
"level": "info",
"ts": "2026-01-12T18:51:49.470Z",
"logger": "podcliquescalinggroup-controller",
"msg": "components has registered a request to requeue post completion of all components syncs",
"PodCliqueScalingGroup": {"name": "workload1-0-sg-x", "namespace": "default"},
"kind": "PodClique",
"message": "[Operation: Sync, Code: ERR_CONTINUE_RECONCILE_AND_REQUEUE] message: available replicas 1 lesser than minAvailable 2, requeuing"
}This message repeats every ~5 seconds from 18:51:49 until the test timeout at 18:52:38, showing the system is stuck in an infinite requeue loop.
Source: Lines 99, 148, 197, 246, 293 of e2e-failure-analysis-RU19.diag.txt
6. PodCliqueSet Shows Rolling Update Stuck
The PodCliqueSet status shows the rolling update is blocked on replica index 0:
status:
availableReplicas: 1
replicas: 2
updatedReplicas: 0
rollingUpdateProgress:
currentlyUpdating:
replicaIndex: 0 # ← Stuck on replica 0
updateStartedAt: "2026-01-12T18:48:38Z"
updateStartedAt: "2026-01-12T18:48:38Z"The rolling update started at 18:48:38 and was still stuck at 18:52:38 (4 minutes later) when the test timed out.
Source: Lines 770-780 of e2e-failure-analysis-RU19.diag.txt
Timeline Summary
| Time | Event | State |
|---|---|---|
| 18:48:38 | Rolling update starts | updateStartedAt set, old pods being deleted |
| 18:48:39 | PodCliqueScalingGroup deleted | workload1-0-sg-x replicaIndex: 0 deleted |
| 18:48:39 | New PodCliques created | New UIDs assigned to recreated PodCliques |
| 18:48:39-41 | New pods created | Some pods store old PodClique UIDs |
| 18:48:41 | UID mismatch errors | failed to find PodClique... uid: <old> |
| 18:48:42 | Gang assignment fails | error assigning pods to subgroup: pods "X" not found |
| 18:48:44 | Gang blocked | Job is not ready for scheduling. Waiting for 1 pods... |
| 18:51:49+ | Controller requeue loop | "available replicas 1 lesser than minAvailable 2" |
| 18:52:38 | Test timeout | 4 minutes with no progress |
Update: Additional Evidence
PodClique IS Recreated
The diagnostics show the PodClique IS recreated:
2026-01-12T18:48:39 Normal PodCliqueCreateSuccessful PodCliqueScalingGroup/... PodClique default/workload1-0-sg-x-0-pc-b created successfully
Source: Line 4695 of diagnostics
New PodClique Status at Test Failure
The PodClique workload1-0-sg-x-0-pc-b exists at test failure time with the following status:
metadata:
creationTimestamp: "2026-01-12T18:48:39Z"
generation: 1
uid: 7e97287e-5b8f-4b9f-935c-d4b66bb974ad # NEW UID
spec:
replicas: 1 # Should have 1 pod
status:
observedGeneration: 1 # Controller DID process it
readyReplicas: 0
scheduleGatedReplicas: 0 # NO pods are even gated
scheduledReplicas: 0
updatedReplicas: 0 # NO pods exist at allSource: Lines 968-1150 of diagnostics
Note: The observedGeneration: 1 matches generation: 1, indicating the PodClique controller reconciled this resource.
Pod Counts at Test Failure
The pod list at test failure shows zero pods for workload1-0-sg-x-0-pc-b:
| Pod Pattern | Count | Status |
|---|---|---|
workload1-0-pc-a-* |
4 | SchedulingGated |
workload1-0-sg-x-0-pc-b-* |
0 | NONE EXIST |
workload1-0-sg-x-0-pc-c-* |
3 | SchedulingGated |
workload1-0-sg-x-1-pc-b-* |
1 | Running |
Source: Lines 2951-2975 of diagnostics
Event Sequence at 18:48:39
Events within the same second:
- PCSG deleted:
Deleted PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0 - Pod created by OLD PodClique:
Created Pod: workload1-0-sg-x-0-pc-b-nctvs(stores OLD UIDbc4a45c2...) - NEW PodClique created:
PodClique default/workload1-0-sg-x-0-pc-b created successfully(NEW UID7e97287e...) - UID mismatch reported (18:48:41): Pod
nctvsfails validation withfailed to find PodClique... uid: <bc4a45c2...> - At test failure: Pod
nctvsdoes not exist; NEW PodClique hasupdatedReplicas: 0
Diagnostic Gap: Missing PodClique Controller Logs
The diagnostics contain no logs from podclique-controller. Only these controllers are logged:
podcliqueset-controllerpodcliquescalinggroup-controller