Fix certificate rotation race condition in per-node certificates by bpickard22 · Pull Request #1483 · k8snetworkplumbingwg/multus-cni

bpickard22 · 2026-03-10T19:02:55Z

Add exponential backoff retry logic to handle transient certificate errors during daemon startup. This prevents crash-loops when pods restart during the certificate rotation window (~3 seconds) where the symlink is temporarily removed or the file is empty.

We accomplish this by retrying if we get a transient error, but still fail fast on permanent errors such as an RBAC issue or NoCertKey error

Changes:

Add isTransientCertError() to classify errors as transient vs permanent
Add exponential backoff retry (100ms, 200ms, 400ms, 800ms, 1.6s) within the existing 2-minute CSR approval timeout
Transient errors (PathError, 'failed to find any PEM data', etc.) are retried automatically
Permanent errors (NoCertKeyError, RBAC issues) fail fast as before
Add verbose logging on successful retry to aid debugging

Testing

Tested with missing certificate file during pod restart
Pods successfully recover and start normally

Assisted by Claude Sonnet 4.5

Add exponential backoff retry logic to handle transient certificate errors during daemon startup. This prevents crash-loops when pods restart during the certificate rotation window (~3 seconds) where the symlink is temporarily removed or the file is empty. We accomplish this by retrying if we get a tranient error, but still fail fast on permanent errors such as an RBAC issue or NoCertKey error Changes: - Add isTransientCertError() to classify errors as transient vs permanent - Add exponential backoff retry (100ms, 200ms, 400ms, 800ms, 1.6s) within the existing 2-minute CSR approval timeout - Transient errors (PathError, 'failed to find any PEM data', etc.) are retried automatically - Permanent errors (NoCertKeyError, RBAC issues) fail fast as before - Add verbose logging on successful retry to aid debugging Testing - Tested with missing certificate file during pod restart - Pods successfully recover and start normally - All existing unit tests pass (44/44) Assisted by Claude Sonnet 4.5 Signed-off-by: Benjamin Pickard <bpickard@redhat.com>

gemini-code-assist · 2026-03-10T19:03:09Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of per-node certificate loading by introducing intelligent error handling and retry logic. It addresses a critical race condition that could occur during certificate rotation, where temporary unavailability or corruption of certificate files might lead to daemon crash-loops. By distinguishing between transient and permanent errors and applying an exponential backoff strategy for the former, the system can now gracefully recover from temporary certificate issues, thereby improving overall stability and reliability during the certificate lifecycle management.

Highlights

Error Classification: Introduced isTransientCertError() to classify certificate loading errors as transient (e.g., temporary file issues during rotation) or permanent (e.g., missing certificate key, RBAC issues).
Exponential Backoff Retry: Implemented an exponential backoff retry mechanism (100ms, 200ms, 400ms, 800ms, 1.6s) for transient certificate errors within the existing 2-minute CSR approval timeout.
Resilient Certificate Loading: Ensured that transient errors (e.g., PathError, PEM parsing issues, unknown authority) are retried automatically, while permanent errors (e.g., NoCertKeyError) continue to fail fast.
Improved Logging: Added verbose logging to indicate successful certificate loading after retries, which aids in debugging race conditions during certificate rotation.

Changelog

pkg/k8sclient/kubeconfig.go
- Added isTransientCertError function to categorize certificate errors.
- Modified PerNodeK8sClient to incorporate exponential backoff retry logic for transient certificate loading failures.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces exponential backoff to handle transient errors during certificate loading, which is a solid approach to improve robustness during certificate rotation. The implementation is mostly sound, but I've identified a couple of areas for improvement. One of the error conditions being checked as transient seems incorrect for the given context, and there's also a block of unreachable code that can be removed to enhance clarity. My specific comments provide more details and suggestions for these points.

gemini-code-assist · 2026-03-10T19:05:02Z

+	if strings.Contains(errMsg, "failed to find any PEM data") ||
+		strings.Contains(errMsg, "invalid PEM") ||
+		strings.Contains(errMsg, "certificate signed by unknown authority") {
+		return true
+	}


The check for "certificate signed by unknown authority" appears to be incorrect in this context. This error typically arises during a TLS handshake when a server's certificate cannot be verified, not when parsing a certificate file from disk as certificateStore.Current() does. Including this check could lead to misclassifying other, non-transient errors. This might be a confusion with the check in newClientsetFn, where it is correctly applied to an API call. I recommend removing this condition.

if strings.Contains(errMsg, "failed to find any PEM data") || strings.Contains(errMsg, "invalid PEM") { return true }

gemini-code-assist · 2026-03-10T19:05:02Z

+		if err == nil && retryCount > 0 && currentCert == nil {
+			logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr)
+		}


This conditional block appears to be unreachable. wait.ExponentialBackoff returns a nil error only when its condition function has returned (true, nil). In this implementation, that happens only when currentCert != nil. Therefore, the condition err == nil && currentCert == nil can never be met. This block is dead code and should be removed to improve clarity. Removing this block will also make the firstTransientErr variable unused, so it and its assignments should be removed as well.

This needs to be addressed.

pliurh · 2026-03-20T03:09:12Z

+
+	// PathError (file not found, permission denied) could be transient during rotation
+	var pathErr *fs.PathError
+	if errors.As(err, &pathErr) {


the certificate store's loadFile function https://github.com/pliurh/multus-cni/blob/ddd00fe48bda0bd13902d97ef02b3a2ec4a03c92/vendor/k8s.io/client-go/util/certificate/certificate_store.go#L194 wraps errors with fmt.Errorf("could not convert data from %q into cert/key pair: %v", ...) using %v not %w, which destroys the type information.

pliurh · 2026-03-20T03:11:01Z

+		if err == nil && retryCount > 0 && currentCert == nil {
+			logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr)
+		}


This needs to be addressed.

pliurh · 2026-03-20T03:27:25Z

+		var retryCount int
+		var firstTransientErr error
+
+		err := wait.ExponentialBackoff(backoff, func() (bool, error) {


Is this inner polling really necessary?

pliurh · 2026-03-20T03:48:41Z

+			logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr)
+		}
+
+		return currentCert != nil && err == nil, nil


The "permanent" errors don't actually fail fast; the outer callback always returns nil as its error.

bpickard22 · 2026-05-07T15:27:10Z

/retest-failed

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

pliurh reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix certificate rotation race condition in per-node certificates#1483

Fix certificate rotation race condition in per-node certificates#1483
bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
bpickard22:cert-fix-upstream

bpickard22 commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 10, 2026

Uh oh!

gemini-code-assist Bot Mar 10, 2026

Uh oh!

pliurh Mar 20, 2026

Uh oh!

pliurh Mar 20, 2026

Uh oh!

pliurh Mar 20, 2026

Uh oh!

pliurh Mar 20, 2026

Uh oh!

pliurh Mar 20, 2026

Uh oh!

bpickard22 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bpickard22 commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

pliurh Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

pliurh Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

pliurh Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

pliurh Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

pliurh Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

bpickard22 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants