Skip to content

Rolling Update Deadlock in PodClique Controller #315

@gflarity

Description

@gflarity

Summary

Rolling updates can get permanently stuck due to a race condition between when UpdateEndedAt is set and when pods are counted. The test Test_RU9_RollingUpdateAllPodCliques fails intermittently with a 4-minute timeout waiting for the rolling update to complete.

What Should Happen

  1. Update in progress: Old pods are deleted, new pods are created.

    • CurrentPodTemplateHash = OLD hash
    • RollingUpdateProgress.PodTemplateHash = NEW hash
    • UpdateEndedAt = nil → mutateUpdatedReplica() counts against NEW hash ✓
    • UpdatedReplicas increases as new pods become Ready
  2. All new pods become Ready:

    • UpdatedReplicas = 2, Replicas = 2
    • All old pods have been deleted
  3. markRollingUpdateEnd() sets UpdateEndedAt:

    • UpdateEndedAt is now set → IsPCLQUpdateInProgress() returns false
    • At this moment: UpdatedReplicas = 2, Replicas = 2 ✓
  4. Next status reconciliation — hash switch happens safely:

    mutateCurrentHashes() runs first:

    • IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
    • UpdatedReplicas == Replicas? → YES (2 == 2) ✓
    • Updates CurrentPodTemplateHash to the NEW hash

    mutateUpdatedReplica() runs next:

    • IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
    • Uses CurrentPodTemplateHash (now the NEW hash!)
    • Counts pods matching NEW hash → 2 pods match
    • UpdatedReplicas = 2
  5. Update completes successfully.

What Happened (The Bug)

  1. Update in progress: Old pods are deleted, new pods are created.

    • CurrentPodTemplateHash = OLD hash
    • RollingUpdateProgress.PodTemplateHash = NEW hash
    • UpdateEndedAt = nil → mutateUpdatedReplica() counts against NEW hash ✓
    • UpdatedReplicas = 1 (one new pod is Ready)
  2. Last old pod deleted, but one new pod is still becoming Ready:

    • markRollingUpdateEnd() is called (no old pods remain)
    • Sets UpdateEndedAtIsPCLQUpdateInProgress() now returns false
    • At this moment: UpdatedReplicas = 1, Replicas = 2
  3. Next status reconciliation — the breaking moment:

    mutateCurrentHashes() runs first:

    • IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
    • UpdatedReplicas == Replicas? → NO (1 != 2)
    • Refuses to update CurrentPodTemplateHash — it stays as the OLD hash

    mutateUpdatedReplica() runs next:

    • IsPCLQUpdateInProgress()? → NO (because UpdateEndedAt is set)
    • Uses CurrentPodTemplateHash (the OLD hash!)
    • Counts pods matching OLD hash → 0 pods match (all pods have NEW hash)
    • Sets UpdatedReplicas = 0
  4. Deadlock — every subsequent reconciliation:

    • mutateCurrentHashes(): UpdatedReplicas (0) != Replicas (2) → refuses to update hash
    • mutateUpdatedReplica(): uses OLD hash → counts 0 pods → UpdatedReplicas = 0
    • UpdatedReplicas stays at 0 forever

The system is permanently stuck. CurrentPodTemplateHash will never be updated because UpdatedReplicas != Replicas, but UpdatedReplicas will never be correct because it's counting against the wrong hash.

Evidence

e2eRU9-v2.diag.txt

nce from Diagnostics

1. PodClique Shows Contradictory State

# From diagnostic dump
status:
  replicas: 2
  updatedReplicas: 0                                    # ← Should be 2!
  currentPodCliqueSetGenerationHash: f6c5fd949cb444c6558  # ← OLD hash
  currentPodTemplateHash: 9f945b7b659c5f4cb8f             # ← OLD hash (never updated)
  rollingUpdateProgress:
    podCliqueSetGenerationHash: c947687c4d79f9cfb59       # ← NEW target hash
    podTemplateHash: 574f6fb86d49bfdfbd9d                 # ← NEW target hash
    updateStartedAt: "2026-01-13T00:38:19Z"
    updateEndedAt: "2026-01-13T00:38:19Z"                 # ← Update marked COMPLETE!

The contradiction:

  • updateEndedAt is set → Rolling update component thinks it's done
  • currentPodTemplateHash has OLD value → Hash was never updated
  • updatedReplicas: 0 → Pods being counted against wrong hash

2. Controller Logs Show the Loop

{
  "level": "info",
  "ts": "2026-01-13T00:38:29.260Z",
  "logger": "podclique-controller",
  "msg": "PodClique is currently updating, cannot set PodCliqueSet CurrentGenerationHash yet",
  "PodClique": {"name": "workload1-0-pc-a", "namespace": "default"}
}

This was logged 10 seconds after updateEndedAt was set. The controller is stuck refusing to update the hash because UpdatedReplicas != Replicas.

3. Timeline of Events

Time Event State
00:38:18 Rolling update triggered updateStartedAt set
00:38:19 updateEndedAt set Old pods gone, but UpdatedReplicas may not equal Replicas
00:38:19+ First broken reconciliation mutateUpdatedReplica() switches to OLD hash, counts 0
00:38:29 Controller logs "cannot set hash" Deadlock confirmed
00:42:18 Test timeout 4 minutes with no progress

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions