-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hello :)
Description
I'm experiencing frequent crashes of the Doppler operator across 5 different OVH managed Kubernetes clusters. The operator loses leader election and restarts every few hours due to API server timeout issues.
Error Message
E1122 12:01:41.275002 1 leaderelection.go:361] Failed to update lock: Put "https://10.3.0.1:443/api/v1/namespaces/doppler-operator-system/configmaps/f39fa519.doppler.com": context deadline exceeded
I1122 12:01:41.275125 1 leaderelection.go:278] failed to renew lease doppler-operator-system/xxx.doppler.com: timed out waiting for the condition
2025-11-22T12:01:41.275Z ERROR setup problem running manager {"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:144
main.main
/workspace/main.go:103
runtime.main
/usr/local/go/src/runtime/proc.go:250
Environment
- Doppler Operator version: 1.5.1
- Kubernetes version: 1.30.14
- Cluster type: OVH Managed Kubernetes (affecting 5 different production clusters)
- Pod Resources: 100m CPU / 256Mi RAM
What I've Tried
- ✅ Reduced API server load by optimizing other operators (Velero sync periods from 1m → 10m)
- ✅ Verified the operator has sufficient CPU/memory resources
- ❌ Tried to increase leader election timeouts but the flags aren't exposed
Impact
The operator restarts every few hours, causing:
- Brief interruptions in secret synchronization
- Alert noise and operational overhead
- Concerns about reliability in production
Suggested Fix
Could you expose the standard controller-runtime leader election flags? This would let me test if increasing the timeouts resolves the issue on managed Kubernetes platforms:
--leader-elect-lease-duration (default: 15s)
--leader-elect-renew-deadline (default: 10s)
--leader-elect-retry-period (default: 2s)The current 10s deadline seems too aggressive for managed clusters where API server latency can occasionally spike. Being able to configure these values (e.g., 30s/20s/5s) would help determine if this is just a timing issue or a deeper problem.
Additional Context
This issue appears specific to managed Kubernetes environments where we don't control the API server performance.
Happy to provide more logs or help test a fix if needed!