Skip to content

Feature Request: Add Leader Election Support for High Availability #1000

@gouthamreddykotapalle

Description

@gouthamreddykotapalle

Feature Request: Add Leader Election Support for High Availability

Summary

The GKE operator currently does not support running multiple instances safely. This limits its ability to provide high availability and creates potential split-brain scenarios if multiple instances are accidentally deployed.

Problem

Current State:

  • Only supports single operator instance
  • No protection against multiple instances processing the same resources
  • No built-in high availability or failover mechanism
  • Risk of race conditions and conflicts if scaled beyond 1 replica

Impact:

  • Single point of failure for GKE cluster management
  • Cannot scale operator for performance or availability
  • Manual intervention required if operator pod fails
  • Production deployments lack resilience

Proposed Solution

Implement leader election functionality to enable:

  1. Safe Multi-Instance Deployment: Multiple operator pods can run simultaneously with only one actively processing resources
  2. Automatic Failover: If the leader fails, another instance automatically takes over
  3. High Availability: Eliminates single point of failure
  4. Configurable Behavior: Allow disabling leader election for development scenarios

Technical Requirements

Core Functionality

  • Implement leader election using Kubernetes coordination.k8s.io/leases API
  • Only leader instance should run controllers and process GKE resources
  • Non-leader instances should wait for leadership opportunity
  • Automatic leadership transfer on failure

Configuration

  • Add --leader-election flag (default: enabled)
  • Support namespace-aware leader election
  • Proper RBAC permissions for lease management

Deployment Updates

  • Update RBAC to include lease permissions
  • Add POD_NAMESPACE environment variable for namespace detection
  • Ensure backward compatibility with existing deployments

Example Use Cases

Production High Availability:

kubectl scale deployment/gke-config-operator --replicas=3 -n cattle-system

Development Mode:

./gke-operator --leader-election=false

Acceptance Criteria

  • Multiple operator instances can run safely without conflicts
  • Only one instance processes GKE cluster operations at a time
  • Automatic failover when leader instance becomes unavailable
  • No breaking changes to existing single-instance deployments
  • Leader election can be disabled for development/testing
  • Proper logging indicates leader election status

Technical Implementation Notes

  • Use existing github.com/rancher/wrangler/v3/pkg/leader library for consistency
  • Leader election resource name: gke-operator-leader
  • Lease resources created in same namespace as operator deployment
  • Required RBAC: coordination.k8s.io/leases with full permissions

Related Work

This follows the same pattern as other Rancher operators (EKS, AKS) and aligns with Kubernetes controller best practices for leader election.

Priority

Medium-High - This enables production-grade high availability deployments and is essential for enterprise use cases requiring resilient infrastructure management.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions