Skip to content

Fix certificate rotation race condition in per-node certificates#1483

Open
bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
bpickard22:cert-fix-upstream
Open

Fix certificate rotation race condition in per-node certificates#1483
bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
bpickard22:cert-fix-upstream

Conversation

@bpickard22
Copy link
Copy Markdown
Collaborator

Add exponential backoff retry logic to handle transient certificate errors during daemon startup. This prevents crash-loops when pods restart during the certificate rotation window (~3 seconds) where the symlink is temporarily removed or the file is empty.

We accomplish this by retrying if we get a transient error, but still fail fast on permanent errors such as an RBAC issue or NoCertKey error

Changes:

  • Add isTransientCertError() to classify errors as transient vs permanent
  • Add exponential backoff retry (100ms, 200ms, 400ms, 800ms, 1.6s) within the existing 2-minute CSR approval timeout
  • Transient errors (PathError, 'failed to find any PEM data', etc.) are retried automatically
  • Permanent errors (NoCertKeyError, RBAC issues) fail fast as before
  • Add verbose logging on successful retry to aid debugging

Testing

  • Tested with missing certificate file during pod restart
  • Pods successfully recover and start normally

Assisted by Claude Sonnet 4.5

Add exponential backoff retry logic to handle transient certificate errors
during daemon startup. This prevents crash-loops when pods restart during
the certificate rotation window (~3 seconds) where the symlink is temporarily
removed or the file is empty.

We accomplish this by retrying if we get a tranient error, but still
fail fast on permanent errors such as an RBAC issue or NoCertKey error

Changes:
- Add isTransientCertError() to classify errors as transient vs permanent
- Add exponential backoff retry (100ms, 200ms, 400ms, 800ms, 1.6s) within
  the existing 2-minute CSR approval timeout
- Transient errors (PathError, 'failed to find any PEM data', etc.) are
  retried automatically
- Permanent errors (NoCertKeyError, RBAC issues) fail fast as before
- Add verbose logging on successful retry to aid debugging

Testing
- Tested with missing certificate file during pod restart
- Pods successfully recover and start normally
- All existing unit tests pass (44/44)

Assisted by Claude Sonnet 4.5

Signed-off-by: Benjamin Pickard <bpickard@redhat.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of per-node certificate loading by introducing intelligent error handling and retry logic. It addresses a critical race condition that could occur during certificate rotation, where temporary unavailability or corruption of certificate files might lead to daemon crash-loops. By distinguishing between transient and permanent errors and applying an exponential backoff strategy for the former, the system can now gracefully recover from temporary certificate issues, thereby improving overall stability and reliability during the certificate lifecycle management.

Highlights

  • Error Classification: Introduced isTransientCertError() to classify certificate loading errors as transient (e.g., temporary file issues during rotation) or permanent (e.g., missing certificate key, RBAC issues).
  • Exponential Backoff Retry: Implemented an exponential backoff retry mechanism (100ms, 200ms, 400ms, 800ms, 1.6s) for transient certificate errors within the existing 2-minute CSR approval timeout.
  • Resilient Certificate Loading: Ensured that transient errors (e.g., PathError, PEM parsing issues, unknown authority) are retried automatically, while permanent errors (e.g., NoCertKeyError) continue to fail fast.
  • Improved Logging: Added verbose logging to indicate successful certificate loading after retries, which aids in debugging race conditions during certificate rotation.
Changelog
  • pkg/k8sclient/kubeconfig.go
    • Added isTransientCertError function to categorize certificate errors.
    • Modified PerNodeK8sClient to incorporate exponential backoff retry logic for transient certificate loading failures.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces exponential backoff to handle transient errors during certificate loading, which is a solid approach to improve robustness during certificate rotation. The implementation is mostly sound, but I've identified a couple of areas for improvement. One of the error conditions being checked as transient seems incorrect for the given context, and there's also a block of unreachable code that can be removed to enhance clarity. My specific comments provide more details and suggestions for these points.

Comment on lines +81 to +85
if strings.Contains(errMsg, "failed to find any PEM data") ||
strings.Contains(errMsg, "invalid PEM") ||
strings.Contains(errMsg, "certificate signed by unknown authority") {
return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check for "certificate signed by unknown authority" appears to be incorrect in this context. This error typically arises during a TLS handshake when a server's certificate cannot be verified, not when parsing a certificate file from disk as certificateStore.Current() does. Including this check could lead to misclassifying other, non-transient errors. This might be a confusion with the check in newClientsetFn, where it is correctly applied to an API call. I recommend removing this condition.

    if strings.Contains(errMsg, "failed to find any PEM data") ||
        strings.Contains(errMsg, "invalid PEM") {
        return true
    }

Comment on lines +220 to +222
if err == nil && retryCount > 0 && currentCert == nil {
logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This conditional block appears to be unreachable. wait.ExponentialBackoff returns a nil error only when its condition function has returned (true, nil). In this implementation, that happens only when currentCert != nil. Therefore, the condition err == nil && currentCert == nil can never be met. This block is dead code and should be removed to improve clarity. Removing this block will also make the firstTransientErr variable unused, so it and its assignments should be removed as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be addressed.


// PathError (file not found, permission denied) could be transient during rotation
var pathErr *fs.PathError
if errors.As(err, &pathErr) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the certificate store's loadFile function https://github.com/pliurh/multus-cni/blob/ddd00fe48bda0bd13902d97ef02b3a2ec4a03c92/vendor/k8s.io/client-go/util/certificate/certificate_store.go#L194 wraps errors with fmt.Errorf("could not convert data from %q into cert/key pair: %v", ...) using %v not %w, which destroys the type information.

Comment on lines +220 to +222
if err == nil && retryCount > 0 && currentCert == nil {
logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be addressed.

var retryCount int
var firstTransientErr error

err := wait.ExponentialBackoff(backoff, func() (bool, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this inner polling really necessary?

logging.Debugf("Exhausted %d retries for transient cert error, will retry poll: %v", retryCount, firstTransientErr)
}

return currentCert != nil && err == nil, nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "permanent" errors don't actually fail fast; the outer callback always returns nil as its error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants