Skip to content

[WIP-DNM] Add retry logic to kubeclient#1473

Open
bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
bpickard22:upstream-timeout
Open

[WIP-DNM] Add retry logic to kubeclient#1473
bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
bpickard22:upstream-timeout

Conversation

@bpickard22
Copy link
Collaborator

Add retry logic to kubeletClient to handle resource exhausted grpc errors when churning many pods.

Multus currently fails on the first rate-limit error, which causes crio to delete and recreate the pod, which adds a signifigant amount of churn

Now we will retry with an increasing delay, giving the rate limiter time to reset

Assisted by: Claude Sonnet 4.5

(cherry picked from commit a02aec23f888c5036579cbc480fdec0373a2ffdf)

Add retry logic to kubeletClient to handle resource exhausted grpc
errors when churning many pods.

Multus currently fails on the first rate-limit error, which causes crio
to delete and recreate the pod, which adds a signifigant amount of churn

Now we will retry with an increasing delay, giving the rate limiter time
to reset

Assisted by: Claude Sonnet 4.5

Signed-off-by: Benjamin Pickard <bpickard@redhat.com>
(cherry picked from commit a02aec23f888c5036579cbc480fdec0373a2ffdf)
@gemini-code-assist
Copy link

Summary of Changes

Hello @bpickard22, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the kubeletClient by integrating retry logic to gracefully handle ResourceExhausted gRPC errors. Previously, Multus would fail immediately upon encountering such an error, leading to unnecessary pod churn. The new mechanism employs an exponential backoff strategy, allowing the client to reattempt resource listing with increasing delays, thereby improving the stability and resilience of Multus in environments experiencing temporary rate limiting from the kubelet.

Highlights

  • Retry Logic Implementation: Implemented retry logic with exponential backoff for getPodResources calls in kubeletClient when encountering ResourceExhausted gRPC errors.
  • Configurable Retry Parameters: Introduced new constants to configure the retry mechanism, including maximum retries, initial delay, maximum delay, and backoff factor.
  • Comprehensive Test Coverage: Added comprehensive unit tests to validate the retry behavior, covering scenarios where retries succeed after initial failures and where they eventually fail after exceeding the maximum attempts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable retry logic with exponential backoff to the kubelet client, making it more resilient to transient ResourceExhausted gRPC errors. This is a good approach to improve the system's stability under high load. My review includes suggestions to refine the error handling for better clarity and correctness, remove a piece of unreachable code, and refactor the new tests to improve maintainability by reducing code duplication. Overall, the changes are positive and with a few adjustments will be a solid improvement.

Comment on lines +137 to 154
if st, ok := status.FromError(err); ok && st.Code() == codes.ResourceExhausted {
if attempt < maxRetries {
logging.Debugf("getPodResources: rate limit hit (attempt %d/%d), retrying after %v: %v",
attempt+1, maxRetries+1, retryDelay, err)
time.Sleep(retryDelay)

// Exponential backoff with cap
retryDelay *= retryBackoffFactor
if retryDelay > maxRetryDelay {
retryDelay = maxRetryDelay
}
continue
}
logging.Errorf("getPodResources: rate limit exceeded after %d attempts", maxRetries+1)
}

// For non-rate-limit errors or final retry attempt, return the error
return logging.Errorf("getPodResources: failed to list pod resources, %v.Get(_) = _, %v", client, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The error handling for retries can be simplified and made more robust. Currently, when the final retry attempt fails due to rate limiting, a specific error message is logged (line 150) but its value is discarded. Instead, a more generic error from line 154 is returned, which can be misleading during debugging. It would be better to return the more specific error message that indicates the rate limit was exceeded after all attempts.

Additionally, the format string "%v.Get(_) = _, %v" on line 154 is unconventional and could be simplified to just include the error itself for better readability.

The suggested change addresses these points. Note that applying this suggestion will require updating the assertion in the test should fail after max retries with continuous rate limiting to check for the new, more specific error message.

                if attempt < maxRetries {
                    logging.Debugf("getPodResources: rate limit hit (attempt %d/%d), retrying after %v: %v",
                        attempt+1, maxRetries+1, retryDelay, err)
                    time.Sleep(retryDelay)

                    // Exponential backoff with cap
                    retryDelay *= retryBackoffFactor
                    if retryDelay > maxRetryDelay {
                        retryDelay = maxRetryDelay
                    }
                    continue
                }
                return logging.Errorf("getPodResources: rate limit exceeded after %d attempts: %v", maxRetries+1, err)
            }

            // For non-rate-limit errors, return the error
            return logging.Errorf("getPodResources: failed to list pod resources: %v", err)


rc.resources = resp.PodResources
return nil
return logging.Errorf("getPodResources: failed to list pod resources, %v.Get(_) = _, %v", client, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This return statement appears to be unreachable. The for loop above is structured to always exit via a return statement within the loop body if an error occurs, or if the call is successful. The loop will never complete its iterations and fall through to this line. This statement can be safely removed.

Comment on lines +395 to +427
It("should fail after max retries with continuous rate limiting", func() {
// Create a server that always fails
alwaysFailServer := &rateLimitResourceServer{
server: grpc.NewServer(),
failCount: 100, // Always fail
currentCount: 0,
}

tempSocketDir, err := os.MkdirTemp("", "kubelet-always-fail-test")
Expect(err).NotTo(HaveOccurred())
defer os.RemoveAll(tempSocketDir)

testingPodResourcesPath := filepath.Join(tempSocketDir, defaultPodResourcesPath)
err = os.MkdirAll(testingPodResourcesPath, os.ModeDir)
Expect(err).NotTo(HaveOccurred())

alwaysFailSocketName := filepath.Join(testingPodResourcesPath, "kubelet-always-fail.sock")
alwaysFailSocket := localEndpoint(filepath.Join(testingPodResourcesPath, "kubelet-always-fail"))

podresourcesapi.RegisterPodResourcesListerServer(alwaysFailServer.server, alwaysFailServer)
lis, err := CreateListener(alwaysFailSocketName)
Expect(err).NotTo(HaveOccurred())
go alwaysFailServer.server.Serve(lis)
defer alwaysFailServer.server.Stop()

_, err = getKubeletClient(alwaysFailSocket)
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("failed to list pod resources"))

// Verify that max retries were attempted
finalCount := atomic.LoadInt32(&alwaysFailServer.currentCount)
Expect(finalCount).To(Equal(int32(maxRetries + 1)))
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test case duplicates a significant amount of server setup logic that is already present in the BeforeEach block for the preceding test context. This repetition makes the tests harder to read and maintain.

Consider refactoring the common server setup into a helper function. This function could accept parameters like failCount and return the necessary components for the test (e.g., the server URL, the server instance, and a cleanup function), which would reduce code duplication and improve the overall structure of the test file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant