-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Summary
Rolling updates can get permanently stuck due to a race condition between when UpdateEndedAt is set and when pods are counted. The test Test_RU9_RollingUpdateAllPodCliques fails intermittently with a 4-minute timeout waiting for the rolling update to complete.
What Should Happen
-
Update in progress: Old pods are deleted, new pods are created.
CurrentPodTemplateHash= OLD hashRollingUpdateProgress.PodTemplateHash= NEW hashUpdateEndedAt= nil →mutateUpdatedReplica()counts against NEW hash ✓UpdatedReplicasincreases as new pods become Ready
-
All new pods become Ready:
UpdatedReplicas= 2,Replicas= 2- All old pods have been deleted
-
markRollingUpdateEnd()setsUpdateEndedAt:UpdateEndedAtis now set →IsPCLQUpdateInProgress()returnsfalse- At this moment:
UpdatedReplicas= 2,Replicas= 2 ✓
-
Next status reconciliation — hash switch happens safely:
mutateCurrentHashes()runs first:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)UpdatedReplicas == Replicas? → YES (2 == 2) ✓- Updates
CurrentPodTemplateHashto the NEW hash
mutateUpdatedReplica()runs next:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)- Uses
CurrentPodTemplateHash(now the NEW hash!) - Counts pods matching NEW hash → 2 pods match ✓
UpdatedReplicas= 2
-
Update completes successfully.
What Happened (The Bug)
-
Update in progress: Old pods are deleted, new pods are created.
CurrentPodTemplateHash= OLD hashRollingUpdateProgress.PodTemplateHash= NEW hashUpdateEndedAt= nil →mutateUpdatedReplica()counts against NEW hash ✓UpdatedReplicas= 1 (one new pod is Ready)
-
Last old pod deleted, but one new pod is still becoming Ready:
markRollingUpdateEnd()is called (no old pods remain)- Sets
UpdateEndedAt→IsPCLQUpdateInProgress()now returnsfalse - At this moment:
UpdatedReplicas= 1,Replicas= 2
-
Next status reconciliation — the breaking moment:
mutateCurrentHashes()runs first:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)UpdatedReplicas == Replicas? → NO (1 != 2)- Refuses to update
CurrentPodTemplateHash— it stays as the OLD hash
mutateUpdatedReplica()runs next:IsPCLQUpdateInProgress()? → NO (becauseUpdateEndedAtis set)- Uses
CurrentPodTemplateHash(the OLD hash!) - Counts pods matching OLD hash → 0 pods match (all pods have NEW hash)
- Sets
UpdatedReplicas = 0
-
Deadlock — every subsequent reconciliation:
mutateCurrentHashes():UpdatedReplicas (0) != Replicas (2)→ refuses to update hashmutateUpdatedReplica(): uses OLD hash → counts 0 pods →UpdatedReplicas = 0UpdatedReplicasstays at 0 forever
The system is permanently stuck. CurrentPodTemplateHash will never be updated because UpdatedReplicas != Replicas, but UpdatedReplicas will never be correct because it's counting against the wrong hash.
Evidence
nce from Diagnostics
1. PodClique Shows Contradictory State
# From diagnostic dump
status:
replicas: 2
updatedReplicas: 0 # ← Should be 2!
currentPodCliqueSetGenerationHash: f6c5fd949cb444c6558 # ← OLD hash
currentPodTemplateHash: 9f945b7b659c5f4cb8f # ← OLD hash (never updated)
rollingUpdateProgress:
podCliqueSetGenerationHash: c947687c4d79f9cfb59 # ← NEW target hash
podTemplateHash: 574f6fb86d49bfdfbd9d # ← NEW target hash
updateStartedAt: "2026-01-13T00:38:19Z"
updateEndedAt: "2026-01-13T00:38:19Z" # ← Update marked COMPLETE!The contradiction:
updateEndedAtis set → Rolling update component thinks it's donecurrentPodTemplateHashhas OLD value → Hash was never updatedupdatedReplicas: 0→ Pods being counted against wrong hash
2. Controller Logs Show the Loop
{
"level": "info",
"ts": "2026-01-13T00:38:29.260Z",
"logger": "podclique-controller",
"msg": "PodClique is currently updating, cannot set PodCliqueSet CurrentGenerationHash yet",
"PodClique": {"name": "workload1-0-pc-a", "namespace": "default"}
}This was logged 10 seconds after updateEndedAt was set. The controller is stuck refusing to update the hash because UpdatedReplicas != Replicas.
3. Timeline of Events
| Time | Event | State |
|---|---|---|
| 00:38:18 | Rolling update triggered | updateStartedAt set |
| 00:38:19 | updateEndedAt set |
Old pods gone, but UpdatedReplicas may not equal Replicas |
| 00:38:19+ | First broken reconciliation | mutateUpdatedReplica() switches to OLD hash, counts 0 |
| 00:38:29 | Controller logs "cannot set hash" | Deadlock confirmed |
| 00:42:18 | Test timeout | 4 minutes with no progress |