Skip to content

Reload kubeconfig on disk changes for remote-cluster manager (CA / API server endpoint rotation without manual restart) #922

@ricogu

Description

@ricogu

Which component does this relate to?

Manager bootstrap (cmd/main.go) — specifically the ctrl.GetConfigOrDie() call when the operator runs against a remote cluster with a kubeconfig mounted from a Secret/ConfigMap. Affects any deployment topology where the operator's kubeconfig is rotated by an external system (e.g. a Gardener-based runtime cluster, EKS IRSA renewal) rather than baked into the image or the in-cluster ServiceAccount.

What is the reason for this feature request or change?

When metal-operator-controller-manager is deployed against a remote cluster, it loads its kubeconfig once at startup via:

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{...})

The resulting rest.Config is held by the manager for the process lifetime. client-go's transport handles two of three rotation cases transparently:

Rotation type Form Handled?
Bearer token tokenFile: (path) ✅ transport re-reads BearerTokenFile per request
CA certificate certificate-authority: (path) ✅ since client-go v0.36, ClientsAllowCARotation (Beta, default-on) reloads from the file at most every 5 min (KEP-4222, atomicTransportHolder)
CA certificate certificate-authority-data: (bytes) ❌ frozen — embedded bytes excluded by the feature gate's len(c.TLS.CAData) == 0 check
API server endpoint server: <url> no reload mechanism existsrest.Config.Host is a string, baked into every *http.Transport

The third case (endpoint change) is the only one that is genuinely unhandled today. It happens during runtime cluster rebuilds, control-plane zone migrations, and DR scenarios. The operator silently retains the stale endpoint until manually pod-deleted.

A secondary issue affects deployments whose kubeconfig embeds the CA as certificate-authority-data: bytes rather than a path reference; those deployments don't benefit from ClientsAllowCARotation and also need a manual restart on CA rotation.

Empirical evidence from a production deployment (metal-operator-controller-manager image sha-715247e, client-go v0.36.1):

  • The operator's kubeconfig is provided as a ConfigMap referencing the CA + token via paths, so token + CA rotation already work without a restart. Pod uptime: 5 days, 0 restarts, actively reconciling.
  • The same kubeconfig hardcodes clusters[0].cluster.server, which would require manual intervention if the API server endpoint ever changes.

So the fix is narrower than "implement credential reload" — it's specifically detect kubeconfig-on-disk changes that client-go cannot pick up automatically (endpoint, embedded-bytes CA), and recover gracefully.

Describe the feature

Add an opt-in watcher that detects changes to a configurable kubeconfig path (and any credential mount directories it references), and triggers a graceful manager shutdown so the pod restarts with fresh configuration.

Concretely:

  1. New flag --watch-kubeconfig (default false, off-by-default to preserve existing behaviour).
  2. New flag --kubeconfig-watch-paths (comma-separated, defaults to the directory containing KUBECONFIG) so deployers with split mounts (e.g. ConfigMap kubeconfig + Secret-mounted credentials) can list both directories.
  3. New internal package internal/kubeconfigwatcher that uses fsnotify to watch each directory, tracks the resolved symlink targets of each watched file, and signals via a channel when any target changes. This mirrors the proven pattern in mcm-provider-ironcore-metal/pkg/client/provider.go:132-169 (Kubernetes secret/configmap mounts use ..data symlink swaps, so directory watching with target comparison is required — file-level watches miss the events).
  4. On change: log the event, cancel the manager's root context, allow mgr.Start() to return, and os.Exit(0). Kubelet recreates the pod with a fresh kubeconfig.
  5. Set LeaderElectionReleaseOnCancel: true so the leader lease is released cleanly during shutdown — collapses the new pod's lease-acquisition wait from ~15 s to ~0 s. Currently commented out at cmd/main.go:397; safe to enable because main does no post-shutdown work.

End-to-end downtime per rotation event: ~10–30 s (reconcile drain + image-cache pod start + cache warm-up). Comparable to any other operator that handles credential rotation by restart.

Proposed API or behavior changes

No CRD changes. Two new cmd/main.go flags:

--watch-kubeconfig                        bool, default false
    If true, watch the kubeconfig (and any directories listed in
    --kubeconfig-watch-paths) for changes and gracefully shut down on
    detected change so kubelet restarts the pod with fresh credentials.

--kubeconfig-watch-paths                  string, default ""
    Comma-separated list of additional directories to watch for credential
    files referenced by the kubeconfig (e.g. CA bundle, token file). If
    empty, only the directory containing the kubeconfig is watched.

Deployment-side recommendation that the README / chart can document:

  • Prefer kubeconfig path form (certificate-authority: + tokenFile:) over embedded-bytes form. With path form on client-go ≥ v0.36, ClientsAllowCARotation handles CA rotation automatically; no watcher needed for that case.

Alternatives considered

  1. In-process manager rebuild (cancel inner ctx, build a new manager, re-register all controllers + webhooks + indexers + runnables). Saves ~5 s of downtime versus restart but adds significant code (~500 LOC) and many failure modes (leaked goroutines, webhook port re-bind contention, partial re-registration). Webhook listener still has to tear down and rebind on the same port, so the gap reduction over restart is small. Not worth the complexity.
  2. Hot-swap client.Client only (literal mcm-provider-ironcore-metal pattern). Insufficient: controller-runtime's caches and informers hold transports built from the original rest.Config. Swapping the user-facing client doesn't replace those, so endpoint changes wouldn't be picked up by mgr.GetCache()-backed reads or ongoing watches. Works for MCM because MCM doesn't run a controller-runtime manager.
  3. Periodic re-read of the kubeconfig instead of fsnotify. Polling adds detection latency for no code-complexity win; fsnotify is already a transitive dependency via controller-runtime's pkg/certwatcher.
  4. Helm chart annotation-driven reload (e.g. stakater/reloader, no Go code change). Sufficient for deployments where the kubeconfig ConfigMap/Secret is updated via helm rollouts that the reloader can observe, but doesn't help when credentials are rotated out-of-band by an external controller (e.g. Gardener token-requestor writing directly to the Secret). A Go-side watcher complements rather than competes with this.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions