[WIP-DNM] Add retry logic to kubeclient#1473
[WIP-DNM] Add retry logic to kubeclient#1473bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
Conversation
Add retry logic to kubeletClient to handle resource exhausted grpc errors when churning many pods. Multus currently fails on the first rate-limit error, which causes crio to delete and recreate the pod, which adds a signifigant amount of churn Now we will retry with an increasing delay, giving the rate limiter time to reset Assisted by: Claude Sonnet 4.5 Signed-off-by: Benjamin Pickard <bpickard@redhat.com> (cherry picked from commit a02aec23f888c5036579cbc480fdec0373a2ffdf)
Summary of ChangesHello @bpickard22, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces valuable retry logic with exponential backoff to the kubelet client, making it more resilient to transient ResourceExhausted gRPC errors. This is a good approach to improve the system's stability under high load. My review includes suggestions to refine the error handling for better clarity and correctness, remove a piece of unreachable code, and refactor the new tests to improve maintainability by reducing code duplication. Overall, the changes are positive and with a few adjustments will be a solid improvement.
| if st, ok := status.FromError(err); ok && st.Code() == codes.ResourceExhausted { | ||
| if attempt < maxRetries { | ||
| logging.Debugf("getPodResources: rate limit hit (attempt %d/%d), retrying after %v: %v", | ||
| attempt+1, maxRetries+1, retryDelay, err) | ||
| time.Sleep(retryDelay) | ||
|
|
||
| // Exponential backoff with cap | ||
| retryDelay *= retryBackoffFactor | ||
| if retryDelay > maxRetryDelay { | ||
| retryDelay = maxRetryDelay | ||
| } | ||
| continue | ||
| } | ||
| logging.Errorf("getPodResources: rate limit exceeded after %d attempts", maxRetries+1) | ||
| } | ||
|
|
||
| // For non-rate-limit errors or final retry attempt, return the error | ||
| return logging.Errorf("getPodResources: failed to list pod resources, %v.Get(_) = _, %v", client, err) |
There was a problem hiding this comment.
The error handling for retries can be simplified and made more robust. Currently, when the final retry attempt fails due to rate limiting, a specific error message is logged (line 150) but its value is discarded. Instead, a more generic error from line 154 is returned, which can be misleading during debugging. It would be better to return the more specific error message that indicates the rate limit was exceeded after all attempts.
Additionally, the format string "%v.Get(_) = _, %v" on line 154 is unconventional and could be simplified to just include the error itself for better readability.
The suggested change addresses these points. Note that applying this suggestion will require updating the assertion in the test should fail after max retries with continuous rate limiting to check for the new, more specific error message.
if attempt < maxRetries {
logging.Debugf("getPodResources: rate limit hit (attempt %d/%d), retrying after %v: %v",
attempt+1, maxRetries+1, retryDelay, err)
time.Sleep(retryDelay)
// Exponential backoff with cap
retryDelay *= retryBackoffFactor
if retryDelay > maxRetryDelay {
retryDelay = maxRetryDelay
}
continue
}
return logging.Errorf("getPodResources: rate limit exceeded after %d attempts: %v", maxRetries+1, err)
}
// For non-rate-limit errors, return the error
return logging.Errorf("getPodResources: failed to list pod resources: %v", err)|
|
||
| rc.resources = resp.PodResources | ||
| return nil | ||
| return logging.Errorf("getPodResources: failed to list pod resources, %v.Get(_) = _, %v", client, err) |
There was a problem hiding this comment.
This return statement appears to be unreachable. The for loop above is structured to always exit via a return statement within the loop body if an error occurs, or if the call is successful. The loop will never complete its iterations and fall through to this line. This statement can be safely removed.
| It("should fail after max retries with continuous rate limiting", func() { | ||
| // Create a server that always fails | ||
| alwaysFailServer := &rateLimitResourceServer{ | ||
| server: grpc.NewServer(), | ||
| failCount: 100, // Always fail | ||
| currentCount: 0, | ||
| } | ||
|
|
||
| tempSocketDir, err := os.MkdirTemp("", "kubelet-always-fail-test") | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| defer os.RemoveAll(tempSocketDir) | ||
|
|
||
| testingPodResourcesPath := filepath.Join(tempSocketDir, defaultPodResourcesPath) | ||
| err = os.MkdirAll(testingPodResourcesPath, os.ModeDir) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
|
|
||
| alwaysFailSocketName := filepath.Join(testingPodResourcesPath, "kubelet-always-fail.sock") | ||
| alwaysFailSocket := localEndpoint(filepath.Join(testingPodResourcesPath, "kubelet-always-fail")) | ||
|
|
||
| podresourcesapi.RegisterPodResourcesListerServer(alwaysFailServer.server, alwaysFailServer) | ||
| lis, err := CreateListener(alwaysFailSocketName) | ||
| Expect(err).NotTo(HaveOccurred()) | ||
| go alwaysFailServer.server.Serve(lis) | ||
| defer alwaysFailServer.server.Stop() | ||
|
|
||
| _, err = getKubeletClient(alwaysFailSocket) | ||
| Expect(err).To(HaveOccurred()) | ||
| Expect(err.Error()).To(ContainSubstring("failed to list pod resources")) | ||
|
|
||
| // Verify that max retries were attempted | ||
| finalCount := atomic.LoadInt32(&alwaysFailServer.currentCount) | ||
| Expect(finalCount).To(Equal(int32(maxRetries + 1))) | ||
| }) |
There was a problem hiding this comment.
This test case duplicates a significant amount of server setup logic that is already present in the BeforeEach block for the preceding test context. This repetition makes the tests harder to read and maintain.
Consider refactoring the common server setup into a helper function. This function could accept parameters like failCount and return the necessary components for the test (e.g., the server URL, the server instance, and a cleanup function), which would reduce code duplication and improve the overall structure of the test file.
Add retry logic to kubeletClient to handle resource exhausted grpc errors when churning many pods.
Multus currently fails on the first rate-limit error, which causes crio to delete and recreate the pod, which adds a signifigant amount of churn
Now we will retry with an increasing delay, giving the rate limiter time to reset
Assisted by: Claude Sonnet 4.5
(cherry picked from commit a02aec23f888c5036579cbc480fdec0373a2ffdf)