Gang Scheduling Race Condition During Rolling Updates

## Summary

Rolling updates can get permanently stuck when pods are created during the brief window when PodGang/PodClique resources are being recreated. The affected pods become permanently stuck in `SchedulingGated` state because the gang scheduler cannot associate them with their gang resources. The test `Test_RU19_RollingUpdateWithPodCliqueScaleOutBeforeUpdate` fails intermittently with a 4-minute timeout.

[e2e-failure-analysis-RU19.diag.txt](https://github.com/user-attachments/files/24602321/e2e-failure-analysis-RU19.diag.txt)

## What Should Happen

During a rolling update:
1. Rolling update is triggered when PodCliqueSet spec changes (detected via PodTemplateHash comparison)
2. PodCliqueSet controller reconciles:
   - `prepareSyncFlow()` snapshots all existing pods into the sync context
   - Excess PodGangs are deleted (if scale changed)
   - PodGangs are created/updated via `CreateOrPatch`, with `PodReferences` populated from the sync context snapshot
3. PodClique resources are updated in-place via `CreateOrPatch` with new specs (UIDs remain stable since resources aren't deleted)
4. Old pods (with outdated template hash) are marked for deletion one at a time
5. New replacement pods are created with `LabelPodGang` label and store the **current PodClique UID** for validation
6. PodCliqueSet controller reconciles again, captures new pods in sync context, and updates PodGang's `PodReferences` to include them
7. PodClique controller sees pods are in `PodReferences`, removes their scheduling gates
8. Scheduler plugin validates pod's stored UID matches current PodClique UID ✓
9. Pods get scheduled and become ready
10. Once minAvailable is satisfied, rolling update progresses to the next pod

### What Happens (Intermittently)

The system ends up in a deadlocked state with the following observed problems:

1. **Empty subgroup in PodGang:** Subgroup `workload1-0-sg-x-0-pc-b` has `podReferences: []` but requires `minReplicas: 1`

2. **Pods stuck in SchedulingGated:** 7 pods remain permanently in `SchedulingGated` state with their scheduling gates never removed

3. **Gang cannot be scheduled:** The PodGroup reports `Job is not ready for scheduling. Waiting for 1 pods for SubGroup...`

4. **UID mismatch warnings:** Some pods received `PodGrouperWarning: failed to find PodClique... uid: <old-uid>` errors

5. **Rolling update deadlocked:** The controller repeatedly requeues with `available replicas 1 lesser than minAvailable 2` until timeout

## Evidence from Diagnostics

### 1. Pod Status at Timeout

All 7 pods for replica index 0 (`workload1-0`) are stuck in `SchedulingGated` state, while replica index 1 (`workload1-1`) pods are running normally:

```
NAME                                     PHASE        READY      NODE                   CONDITIONS
workload1-0-pc-a-k7f5r                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-pc-a-kzqls                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-pc-a-r8xzv                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-pc-a-sq7n9                   Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-8d44l            Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-kjsx5            Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-0-sg-x-0-pc-c-r554q            Pending      0/1        <unscheduled>          PodScheduled:SchedulingGated
workload1-1-pc-a-75dcs                   Running      1/1        k3d-shared-e2e-test-cluster-agent-18    OK
workload1-1-pc-a-95tt9                   Running      1/1        k3d-shared-e2e-test-cluster-agent-19    OK
... (workload1-1 pods all running)
```

**Source:** Lines 2953-2977 of `e2e-failure-analysis-RU19.diag.txt`

### 2. PodClique Status Shows All Pods Gated

The PodClique `workload1-0-pc-a` status confirms the stuck state:

```yaml
status:
  conditions:
  - message: 'Insufficient scheduled pods. expected at least: 2, found: 0'
    reason: InsufficientScheduledPods
    status: "False"
    type: PodCliqueScheduled
  readyReplicas: 0
  replicas: 4
  scheduleGatedReplicas: 4    # ← All 4 pods are gated
  scheduledReplicas: 0        # ← None have been scheduled
  updatedReplicas: 4          # ← Pods have new spec but can't be scheduled
  rollingUpdateProgress:
    updateStartedAt: "2026-01-12T18:48:38Z"
    readyPodsSelectedToUpdate:
      current: workload1-0-pc-a-bvwnb
```

**Source:** Lines 935-965 of `e2e-failure-analysis-RU19.diag.txt`

### 3. PodGrouperWarning Events

The Kubernetes events show the race condition during the rolling update phase (starting at `18:48:38`):

```
2026-01-12T18:48:39  Normal   PodCliqueScalingGroupR...  PodCliqueSet/workload1     Deleted PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0
2026-01-12T18:48:39  Normal   PodCreateSuccessful       PodClique/workload1-0-sg-x-0-pc-b   Created Pod: workload1-0-sg-x-0-pc-b-nctvs
2026-01-12T18:48:39  Normal   PodCliqueCreateSuccessful PodCliqueScalingGroup/...   PodClique default/workload1-0-sg-x-0-pc-c created successfully
2026-01-12T18:48:39  Normal   PodCliqueCreateSuccessful PodCliqueScalingGroup/...   PodClique default/workload1-0-sg-x-0-pc-b created successfully
2026-01-12T18:48:41  Warning  PodGrouperWarning         Pod/workload1-0-sg-x-0-pc-b-nctvs   failed to find PodClique: <default/workload1-0-sg-x-0-pc-b>, uid: <bc4a45c2-1...
2026-01-12T18:48:42  Warning  PodGrouperWarning         Pod/workload1-0-pc-a-kzqls          error assigning pods to subgroup: pods "workload1-0-sg-x-0-pc-c-gpbhk" not found
2026-01-12T18:48:42  Warning  PodGrouperWarning         Pod/workload1-0-sg-x-0-pc-c-gpbhk   failed to find PodClique: <default/workload1-0-sg-x-0-pc-c>, uid: <68db0da5-c...
2026-01-12T18:48:44  Normal   NotReady                  PodGroup/pg-workload1-0-...         Job is not ready for scheduling. Waiting for 1 pods for SubGroup workload1-0-...
```

**Key observations:**
- At `18:48:39`: `PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0` is **deleted**, then new PodCliques are **recreated** with new UIDs
- At `18:48:41`: Pod `workload1-0-sg-x-0-pc-b-nctvs` has a **UID mismatch** — it stored the old PodClique UID
- At `18:48:42`: Pod `workload1-0-pc-a-kzqls` fails with `error assigning pods to subgroup: pods "workload1-0-sg-x-0-pc-c-gpbhk" not found` — the gang scheduler can't find a referenced pod
- At `18:48:44`: The PodGroup reports it's **waiting for pods** that will never come

**Source:** Lines 4685-4719 of `e2e-failure-analysis-RU19.diag.txt`

### 4. PodGang Has Empty Subgroup

The PodGang `workload1-0` shows that subgroup `workload1-0-sg-x-0-pc-b` has **no pods** despite requiring `minReplicas: 1`:

```yaml
spec:
  podgroups:
  - minReplicas: 2
    name: workload1-0-pc-a
    podReferences:
    - name: workload1-0-pc-a-k7f5r      # ← Stuck in SchedulingGated
    - name: workload1-0-pc-a-kzqls      # ← Stuck in SchedulingGated
    - name: workload1-0-pc-a-r8xzv      # ← Stuck in SchedulingGated
    - name: workload1-0-pc-a-sq7n9      # ← Stuck in SchedulingGated
  - minReplicas: 1
    name: workload1-0-sg-x-0-pc-b
    podReferences: []                    # ← EMPTY! Needs 1 pod, has 0
  - minReplicas: 3
    name: workload1-0-sg-x-0-pc-c
    podReferences:
    - name: workload1-0-sg-x-0-pc-c-8d44l  # ← Stuck in SchedulingGated
    - name: workload1-0-sg-x-0-pc-c-kjsx5  # ← Stuck in SchedulingGated
    - name: workload1-0-sg-x-0-pc-c-r554q  # ← Stuck in SchedulingGated
```

The gang scheduler requires ALL subgroups to have their `minReplicas` satisfied. Since `workload1-0-sg-x-0-pc-b` has 0 pods (but needs 1), the entire gang is blocked forever.

**Source:** Lines 2824-2848 of `e2e-failure-analysis-RU19.diag.txt`

### 5. Controller Stuck Waiting for minAvailable

The controller repeatedly logs that it can't proceed because `availableReplicas < minAvailable`:

```json
{
  "level": "info",
  "ts": "2026-01-12T18:51:49.470Z",
  "logger": "podcliquescalinggroup-controller",
  "msg": "components has registered a request to requeue post completion of all components syncs",
  "PodCliqueScalingGroup": {"name": "workload1-0-sg-x", "namespace": "default"},
  "kind": "PodClique",
  "message": "[Operation: Sync, Code: ERR_CONTINUE_RECONCILE_AND_REQUEUE] message: available replicas 1 lesser than minAvailable 2, requeuing"
}
```

This message repeats every ~5 seconds from `18:51:49` until the test timeout at `18:52:38`, showing the system is stuck in an infinite requeue loop.

**Source:** Lines 99, 148, 197, 246, 293 of `e2e-failure-analysis-RU19.diag.txt`

### 6. PodCliqueSet Shows Rolling Update Stuck

The PodCliqueSet status shows the rolling update is blocked on replica index 0:

```yaml
status:
  availableReplicas: 1
  replicas: 2
  updatedReplicas: 0
  rollingUpdateProgress:
    currentlyUpdating:
      replicaIndex: 0                           # ← Stuck on replica 0
      updateStartedAt: "2026-01-12T18:48:38Z"
    updateStartedAt: "2026-01-12T18:48:38Z"
```

The rolling update started at `18:48:38` and was still stuck at `18:52:38` (4 minutes later) when the test timed out.

**Source:** Lines 770-780 of `e2e-failure-analysis-RU19.diag.txt`

## Timeline Summary

| Time | Event | State |
|------|-------|-------|
| 18:48:38 | Rolling update starts | `updateStartedAt` set, old pods being deleted |
| 18:48:39 | PodCliqueScalingGroup deleted | `workload1-0-sg-x replicaIndex: 0` deleted |
| 18:48:39 | New PodCliques created | New UIDs assigned to recreated PodCliques |
| 18:48:39-41 | New pods created | Some pods store old PodClique UIDs |
| 18:48:41 | UID mismatch errors | `failed to find PodClique... uid: <old>` |
| 18:48:42 | Gang assignment fails | `error assigning pods to subgroup: pods "X" not found` |
| 18:48:44 | Gang blocked | `Job is not ready for scheduling. Waiting for 1 pods...` |
| 18:51:49+ | Controller requeue loop | "available replicas 1 lesser than minAvailable 2" |
| 18:52:38 | **Test timeout** | 4 minutes with no progress |


## Update: Additional Evidence

### PodClique IS Recreated

The diagnostics show the PodClique **IS recreated**:

```
2026-01-12T18:48:39  Normal   PodCliqueCreateSuccessful PodCliqueScalingGroup/...   PodClique default/workload1-0-sg-x-0-pc-b created successfully
```

**Source:** Line 4695 of diagnostics

### New PodClique Status at Test Failure

The PodClique `workload1-0-sg-x-0-pc-b` exists at test failure time with the following status:

```yaml
metadata:
  creationTimestamp: "2026-01-12T18:48:39Z"
  generation: 1
  uid: 7e97287e-5b8f-4b9f-935c-d4b66bb974ad    # NEW UID
spec:
  replicas: 1                                   # Should have 1 pod
status:
  observedGeneration: 1                         # Controller DID process it
  readyReplicas: 0
  scheduleGatedReplicas: 0                      # NO pods are even gated
  scheduledReplicas: 0
  updatedReplicas: 0                            # NO pods exist at all
```

**Source:** Lines 968-1150 of diagnostics

**Note:** The `observedGeneration: 1` matches `generation: 1`, indicating the PodClique controller reconciled this resource.

### Pod Counts at Test Failure

The pod list at test failure shows **zero pods** for `workload1-0-sg-x-0-pc-b`:

| Pod Pattern | Count | Status |
|-------------|-------|--------|
| `workload1-0-pc-a-*` | 4 | SchedulingGated |
| `workload1-0-sg-x-0-pc-b-*` | **0** | **NONE EXIST** |
| `workload1-0-sg-x-0-pc-c-*` | 3 | SchedulingGated |
| `workload1-0-sg-x-1-pc-b-*` | 1 | Running |

**Source:** Lines 2951-2975 of diagnostics

### Event Sequence at 18:48:39

Events within the same second:

1. **PCSG deleted:** `Deleted PodCliqueScalingGroup workload1-0-sg-x replicaIndex: 0`
2. **Pod created by OLD PodClique:** `Created Pod: workload1-0-sg-x-0-pc-b-nctvs` (stores OLD UID `bc4a45c2...`)
3. **NEW PodClique created:** `PodClique default/workload1-0-sg-x-0-pc-b created successfully` (NEW UID `7e97287e...`)
4. **UID mismatch reported (18:48:41):** Pod `nctvs` fails validation with `failed to find PodClique... uid: <bc4a45c2...>`
5. **At test failure:** Pod `nctvs` does not exist; NEW PodClique has `updatedReplicas: 0`

### Diagnostic Gap: Missing PodClique Controller Logs

The diagnostics contain **no logs from `podclique-controller`**. Only these controllers are logged:
- `podcliqueset-controller`
- `podcliquescalinggroup-controller`





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gang Scheduling Race Condition During Rolling Updates #316

Summary

What Should Happen

What Happens (Intermittently)

Evidence from Diagnostics

1. Pod Status at Timeout

2. PodClique Status Shows All Pods Gated

3. PodGrouperWarning Events

4. PodGang Has Empty Subgroup

5. Controller Stuck Waiting for minAvailable

6. PodCliqueSet Shows Rolling Update Stuck

Timeline Summary

Update: Additional Evidence

PodClique IS Recreated

New PodClique Status at Test Failure

Pod Counts at Test Failure

Event Sequence at 18:48:39

Diagnostic Gap: Missing PodClique Controller Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Event	State
18:48:38	Rolling update starts	`updateStartedAt` set, old pods being deleted
18:48:39	PodCliqueScalingGroup deleted	`workload1-0-sg-x replicaIndex: 0` deleted
18:48:39	New PodCliques created	New UIDs assigned to recreated PodCliques
18:48:39-41	New pods created	Some pods store old PodClique UIDs
18:48:41	UID mismatch errors	`failed to find PodClique... uid: <old>`
18:48:42	Gang assignment fails	`error assigning pods to subgroup: pods "X" not found`
18:48:44	Gang blocked	`Job is not ready for scheduling. Waiting for 1 pods...`
18:51:49+	Controller requeue loop	"available replicas 1 lesser than minAvailable 2"
18:52:38	Test timeout	4 minutes with no progress

Pod Pattern	Count	Status
`workload1-0-pc-a-*`	4	SchedulingGated
`workload1-0-sg-x-0-pc-b-*`	0	NONE EXIST
`workload1-0-sg-x-0-pc-c-*`	3	SchedulingGated
`workload1-0-sg-x-1-pc-b-*`	1	Running

Gang Scheduling Race Condition During Rolling Updates #316

Description

Summary

What Should Happen

What Happens (Intermittently)

Evidence from Diagnostics

1. Pod Status at Timeout

2. PodClique Status Shows All Pods Gated

3. PodGrouperWarning Events

4. PodGang Has Empty Subgroup

5. Controller Stuck Waiting for minAvailable

6. PodCliqueSet Shows Rolling Update Stuck

Timeline Summary

Update: Additional Evidence

PodClique IS Recreated

New PodClique Status at Test Failure

Pod Counts at Test Failure

Event Sequence at 18:48:39

Diagnostic Gap: Missing PodClique Controller Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions