Company or project name
Introspection (introspection.dev) — AI observability platform using self-hosted ClickHouse via the operator.
Describe what's wrong
The KeeperCluster reconciler enters an infinite restart loop because the desired StatefulSet spec omits fields that the Kubernetes API server fills with defaults. On each reconcile cycle, the operator detects a diff between the desired state (nil/zero values) and the actual state (K8s-defaulted values), concludes config has changed, and force-restarts the keeper pod via the kubectl.kubernetes.io/restartedAt annotation. This annotation change itself creates a new diff on the next reconcile, making the loop self-reinforcing.
Does it reproduce on the most recent release?
Yes
How to reproduce
Keeper pod is killed and recreated every ~10-30 seconds indefinitely.
The operator logs show:
INFO keeper forcing Pod restart, because of config changes
INFO keeper updating replica StatefulSet
Expected behavior
Once the keeper is running and config hasn't actually changed, the pod should remain stable.
Error message and/or stacktrace
The operator's templateStatefulSet() and templatePodSpec() functions build a desired spec with several fields left as nil/zero:
| Field |
Desired (operator) |
Actual (K8s-defaulted) |
spec.template.spec.terminationGracePeriodSeconds |
nil |
30 |
spec.template.spec.schedulerName |
"" |
"default-scheduler" |
spec.template.spec.securityContext |
nil |
{} |
spec.updateStrategy.rollingUpdate.partition |
nil |
0 |
spec.updateStrategy.rollingUpdate.maxUnavailable |
nil |
1 |
spec.persistentVolumeClaimRetentionPolicy |
nil |
{whenDeleted: Retain, whenScaled: Retain} |
Liveness probe successThreshold |
0 |
1 |
When DeepHashObject() hashes the desired spec vs the actual spec from K8s, the hashes never match. This triggers the config-change detection at resources.go:371, which sets restartedAt to time.Now(), which changes the pod template hash, which triggers another update on the next reconcile.
Additional context
Affected files
internal/controller/keeper/templates.go — templateStatefulSet() and templatePodSpec()
internal/controller/clickhouse/templates.go — same functions (same pattern)
internal/controller/constants.go — DefaultLivenessProbeSettings missing SuccessThreshold: 1
internal/controller/resources.go — ReconcileReplicaResources() where the diff is detected
Suggested fix
Explicitly set K8s-defaulted fields in the desired spec so they match what the API server returns:
- Set
SuccessThreshold: 1 on DefaultLivenessProbeSettings
- In
templatePodSpec(), default terminationGracePeriodSeconds to 30, schedulerName to "default-scheduler", and securityContext to &PodSecurityContext{}
- In
templateStatefulSet(), set RollingUpdate.Partition to 0, RollingUpdate.MaxUnavailable to 1, and PersistentVolumeClaimRetentionPolicy to Retain/Retain
Company or project name
Introspection (introspection.dev) — AI observability platform using self-hosted ClickHouse via the operator.
Describe what's wrong
The KeeperCluster reconciler enters an infinite restart loop because the desired StatefulSet spec omits fields that the Kubernetes API server fills with defaults. On each reconcile cycle, the operator detects a diff between the desired state (nil/zero values) and the actual state (K8s-defaulted values), concludes config has changed, and force-restarts the keeper pod via the kubectl.kubernetes.io/restartedAt annotation. This annotation change itself creates a new diff on the next reconcile, making the loop self-reinforcing.
Does it reproduce on the most recent release?
Yes
How to reproduce
Keeper pod is killed and recreated every ~10-30 seconds indefinitely.
The operator logs show:
Expected behavior
Once the keeper is running and config hasn't actually changed, the pod should remain stable.
Error message and/or stacktrace
The operator's
templateStatefulSet()andtemplatePodSpec()functions build a desired spec with several fields left as nil/zero:spec.template.spec.terminationGracePeriodSecondsnil30spec.template.spec.schedulerName"""default-scheduler"spec.template.spec.securityContextnil{}spec.updateStrategy.rollingUpdate.partitionnil0spec.updateStrategy.rollingUpdate.maxUnavailablenil1spec.persistentVolumeClaimRetentionPolicynil{whenDeleted: Retain, whenScaled: Retain}successThreshold01When
DeepHashObject()hashes the desired spec vs the actual spec from K8s, the hashes never match. This triggers the config-change detection atresources.go:371, which setsrestartedAttotime.Now(), which changes the pod template hash, which triggers another update on the next reconcile.Additional context
Affected files
internal/controller/keeper/templates.go—templateStatefulSet()andtemplatePodSpec()internal/controller/clickhouse/templates.go— same functions (same pattern)internal/controller/constants.go—DefaultLivenessProbeSettingsmissingSuccessThreshold: 1internal/controller/resources.go—ReconcileReplicaResources()where the diff is detectedSuggested fix
Explicitly set K8s-defaulted fields in the desired spec so they match what the API server returns:
SuccessThreshold: 1onDefaultLivenessProbeSettingstemplatePodSpec(), defaultterminationGracePeriodSecondsto30,schedulerNameto"default-scheduler", andsecurityContextto&PodSecurityContext{}templateStatefulSet(), setRollingUpdate.Partitionto0,RollingUpdate.MaxUnavailableto1, andPersistentVolumeClaimRetentionPolicytoRetain/Retain