Skip to content

Leader election fails on managed Kubernetes clusters - need configurable timeouts #95

@cmer81

Description

@cmer81

Hello :)

Description

I'm experiencing frequent crashes of the Doppler operator across 5 different OVH managed Kubernetes clusters. The operator loses leader election and restarts every few hours due to API server timeout issues.

Error Message

E1122 12:01:41.275002       1 leaderelection.go:361] Failed to update lock: Put "https://10.3.0.1:443/api/v1/namespaces/doppler-operator-system/configmaps/f39fa519.doppler.com": context deadline exceeded
I1122 12:01:41.275125       1 leaderelection.go:278] failed to renew lease doppler-operator-system/xxx.doppler.com: timed out waiting for the condition
2025-11-22T12:01:41.275Z    ERROR   setup   problem running manager {"error": "leader election lost"}
github.com/go-logr/zapr.(*zapLogger).Error
    /go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error
    /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:144
main.main
    /workspace/main.go:103
runtime.main
    /usr/local/go/src/runtime/proc.go:250

Environment

  • Doppler Operator version: 1.5.1
  • Kubernetes version: 1.30.14
  • Cluster type: OVH Managed Kubernetes (affecting 5 different production clusters)
  • Pod Resources: 100m CPU / 256Mi RAM

What I've Tried

  • ✅ Reduced API server load by optimizing other operators (Velero sync periods from 1m → 10m)
  • ✅ Verified the operator has sufficient CPU/memory resources
  • ❌ Tried to increase leader election timeouts but the flags aren't exposed

Impact

The operator restarts every few hours, causing:

  • Brief interruptions in secret synchronization
  • Alert noise and operational overhead
  • Concerns about reliability in production

Suggested Fix

Could you expose the standard controller-runtime leader election flags? This would let me test if increasing the timeouts resolves the issue on managed Kubernetes platforms:

--leader-elect-lease-duration (default: 15s)
--leader-elect-renew-deadline (default: 10s)
--leader-elect-retry-period (default: 2s)

The current 10s deadline seems too aggressive for managed clusters where API server latency can occasionally spike. Being able to configure these values (e.g., 30s/20s/5s) would help determine if this is just a timing issue or a deeper problem.

Additional Context

This issue appears specific to managed Kubernetes environments where we don't control the API server performance.

Happy to provide more logs or help test a fix if needed!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions