KeeperCluster StatefulSet infinite restart loop due to K8s API server default field drift

### Company or project name

Introspection (introspection.dev) — AI observability platform using self-hosted ClickHouse via the operator.

### Describe what's wrong

The KeeperCluster reconciler enters an infinite restart loop because the desired StatefulSet spec omits fields that the Kubernetes API server fills with defaults. On each reconcile cycle, the operator detects a diff between the desired state (nil/zero values) and the actual state (K8s-defaulted values), concludes config has changed, and force-restarts the keeper pod via the kubectl.kubernetes.io/restartedAt annotation. This annotation change itself creates a new diff on the next reconcile, making the loop self-reinforcing.

### Does it reproduce on the most recent release?

Yes

### How to reproduce

Keeper pod is killed and recreated every ~10-30 seconds indefinitely. 

The operator logs show:

```
  INFO  keeper  forcing Pod restart, because of config changes
  INFO  keeper  updating replica StatefulSet
```

### Expected behavior

Once the keeper is running and config hasn't actually changed, the pod should remain stable.

### Error message and/or stacktrace

The operator's `templateStatefulSet()` and `templatePodSpec()` functions build a desired spec with several fields left as nil/zero:

| Field | Desired (operator) | Actual (K8s-defaulted) |
|---|---|---|
| `spec.template.spec.terminationGracePeriodSeconds` | `nil` | `30` |
| `spec.template.spec.schedulerName` | `""` | `"default-scheduler"` |
| `spec.template.spec.securityContext` | `nil` | `{}` |
| `spec.updateStrategy.rollingUpdate.partition` | `nil` | `0` |
| `spec.updateStrategy.rollingUpdate.maxUnavailable` | `nil` | `1` |
| `spec.persistentVolumeClaimRetentionPolicy` | `nil` | `{whenDeleted: Retain, whenScaled: Retain}` |
| Liveness probe `successThreshold` | `0` | `1` |

When `DeepHashObject()` hashes the desired spec vs the actual spec from K8s, the hashes never match. This triggers the config-change detection at `resources.go:371`, which sets `restartedAt` to `time.Now()`, which changes the pod template hash, which triggers another update on the next reconcile.

### Additional context

### Affected files

- `internal/controller/keeper/templates.go` — `templateStatefulSet()` and `templatePodSpec()`
- `internal/controller/clickhouse/templates.go` — same functions (same pattern)
- `internal/controller/constants.go` — `DefaultLivenessProbeSettings` missing `SuccessThreshold: 1`
- `internal/controller/resources.go` — `ReconcileReplicaResources()` where the diff is detected

### Suggested fix

Explicitly set K8s-defaulted fields in the desired spec so they match what the API server returns:

1. Set `SuccessThreshold: 1` on `DefaultLivenessProbeSettings`
2. In `templatePodSpec()`, default `terminationGracePeriodSeconds` to `30`, `schedulerName` to `"default-scheduler"`, and `securityContext` to `&PodSecurityContext{}`
3. In `templateStatefulSet()`, set `RollingUpdate.Partition` to `0`, `RollingUpdate.MaxUnavailable` to `1`, and `PersistentVolumeClaimRetentionPolicy` to `Retain/Retain`

Field	Desired (operator)	Actual (K8s-defaulted)
`spec.template.spec.terminationGracePeriodSeconds`	`nil`	`30`
`spec.template.spec.schedulerName`	`""`	`"default-scheduler"`
`spec.template.spec.securityContext`	`nil`	`{}`
`spec.updateStrategy.rollingUpdate.partition`	`nil`	`0`
`spec.updateStrategy.rollingUpdate.maxUnavailable`	`nil`	`1`
`spec.persistentVolumeClaimRetentionPolicy`	`nil`	`{whenDeleted: Retain, whenScaled: Retain}`
Liveness probe `successThreshold`	`0`	`1`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeeperCluster StatefulSet infinite restart loop due to K8s API server default field drift #131

Company or project name

Describe what's wrong

Does it reproduce on the most recent release?

How to reproduce

Expected behavior

Error message and/or stacktrace

Additional context

Affected files

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

KeeperCluster StatefulSet infinite restart loop due to K8s API server default field drift #131

Description

Company or project name

Describe what's wrong

Does it reproduce on the most recent release?

How to reproduce

Expected behavior

Error message and/or stacktrace

Additional context

Affected files

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions