Fix certificate rotation race condition in per-node certificates#1483
Fix certificate rotation race condition in per-node certificates#1483bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
Conversation
Add exponential backoff retry logic to handle transient certificate errors during daemon startup. This prevents crash-loops when pods restart during the certificate rotation window (~3 seconds) where the symlink is temporarily removed or the file is empty. We accomplish this by retrying if we get a tranient error, but still fail fast on permanent errors such as an RBAC issue or NoCertKey error Changes: - Add isTransientCertError() to classify errors as transient vs permanent - Add exponential backoff retry (100ms, 200ms, 400ms, 800ms, 1.6s) within the existing 2-minute CSR approval timeout - Transient errors (PathError, 'failed to find any PEM data', etc.) are retried automatically - Permanent errors (NoCertKeyError, RBAC issues) fail fast as before - Add verbose logging on successful retry to aid debugging Testing - Tested with missing certificate file during pod restart - Pods successfully recover and start normally - All existing unit tests pass (44/44) Assisted by Claude Sonnet 4.5 Signed-off-by: Benjamin Pickard <bpickard@redhat.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of per-node certificate loading by introducing intelligent error handling and retry logic. It addresses a critical race condition that could occur during certificate rotation, where temporary unavailability or corruption of certificate files might lead to daemon crash-loops. By distinguishing between transient and permanent errors and applying an exponential backoff strategy for the former, the system can now gracefully recover from temporary certificate issues, thereby improving overall stability and reliability during the certificate lifecycle management. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces exponential backoff to handle transient errors during certificate loading, which is a solid approach to improve robustness during certificate rotation. The implementation is mostly sound, but I've identified a couple of areas for improvement. One of the error conditions being checked as transient seems incorrect for the given context, and there's also a block of unreachable code that can be removed to enhance clarity. My specific comments provide more details and suggestions for these points.
| if strings.Contains(errMsg, "failed to find any PEM data") || | ||
| strings.Contains(errMsg, "invalid PEM") || | ||
| strings.Contains(errMsg, "certificate signed by unknown authority") { | ||
| return true | ||
| } |
There was a problem hiding this comment.
The check for "certificate signed by unknown authority" appears to be incorrect in this context. This error typically arises during a TLS handshake when a server's certificate cannot be verified, not when parsing a certificate file from disk as certificateStore.Current() does. Including this check could lead to misclassifying other, non-transient errors. This might be a confusion with the check in newClientsetFn, where it is correctly applied to an API call. I recommend removing this condition.
if strings.Contains(errMsg, "failed to find any PEM data") ||
strings.Contains(errMsg, "invalid PEM") {
return true
}| if err == nil && retryCount > 0 && currentCert == nil { | ||
| logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr) | ||
| } |
There was a problem hiding this comment.
This conditional block appears to be unreachable. wait.ExponentialBackoff returns a nil error only when its condition function has returned (true, nil). In this implementation, that happens only when currentCert != nil. Therefore, the condition err == nil && currentCert == nil can never be met. This block is dead code and should be removed to improve clarity. Removing this block will also make the firstTransientErr variable unused, so it and its assignments should be removed as well.
There was a problem hiding this comment.
This needs to be addressed.
|
|
||
| // PathError (file not found, permission denied) could be transient during rotation | ||
| var pathErr *fs.PathError | ||
| if errors.As(err, &pathErr) { |
There was a problem hiding this comment.
the certificate store's loadFile function https://github.com/pliurh/multus-cni/blob/ddd00fe48bda0bd13902d97ef02b3a2ec4a03c92/vendor/k8s.io/client-go/util/certificate/certificate_store.go#L194 wraps errors with fmt.Errorf("could not convert data from %q into cert/key pair: %v", ...) using %v not %w, which destroys the type information.
| if err == nil && retryCount > 0 && currentCert == nil { | ||
| logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr) | ||
| } |
There was a problem hiding this comment.
This needs to be addressed.
| var retryCount int | ||
| var firstTransientErr error | ||
|
|
||
| err := wait.ExponentialBackoff(backoff, func() (bool, error) { |
There was a problem hiding this comment.
Is this inner polling really necessary?
| logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr) | ||
| } | ||
|
|
||
| return currentCert != nil && err == nil, nil |
There was a problem hiding this comment.
The "permanent" errors don't actually fail fast; the outer callback always returns nil as its error.
Add exponential backoff retry logic to handle transient certificate errors during daemon startup. This prevents crash-loops when pods restart during the certificate rotation window (~3 seconds) where the symlink is temporarily removed or the file is empty.
We accomplish this by retrying if we get a transient error, but still fail fast on permanent errors such as an RBAC issue or NoCertKey error
Changes:
Testing
Assisted by Claude Sonnet 4.5