[WIP-DNM] Add retry logic to kubeclient by bpickard22 · Pull Request #1473 · k8snetworkplumbingwg/multus-cni

bpickard22 · 2026-01-26T15:12:15Z

Add retry logic to kubeletClient to handle resource exhausted grpc errors when churning many pods.

Multus currently fails on the first rate-limit error, which causes crio to delete and recreate the pod, which adds a signifigant amount of churn

Now we will retry with an increasing delay, giving the rate limiter time to reset

Assisted by: Claude Sonnet 4.5

(cherry picked from commit a02aec23f888c5036579cbc480fdec0373a2ffdf)

Add retry logic to kubeletClient to handle resource exhausted grpc errors when churning many pods. Multus currently fails on the first rate-limit error, which causes crio to delete and recreate the pod, which adds a signifigant amount of churn Now we will retry with an increasing delay, giving the rate limiter time to reset Assisted by: Claude Sonnet 4.5 Signed-off-by: Benjamin Pickard <bpickard@redhat.com> (cherry picked from commit a02aec23f888c5036579cbc480fdec0373a2ffdf)

gemini-code-assist · 2026-01-26T15:12:35Z

Summary of Changes

Hello @bpickard22, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the kubeletClient by integrating retry logic to gracefully handle ResourceExhausted gRPC errors. Previously, Multus would fail immediately upon encountering such an error, leading to unnecessary pod churn. The new mechanism employs an exponential backoff strategy, allowing the client to reattempt resource listing with increasing delays, thereby improving the stability and resilience of Multus in environments experiencing temporary rate limiting from the kubelet.

Highlights

Retry Logic Implementation: Implemented retry logic with exponential backoff for getPodResources calls in kubeletClient when encountering ResourceExhausted gRPC errors.
Configurable Retry Parameters: Introduced new constants to configure the retry mechanism, including maximum retries, initial delay, maximum delay, and backoff factor.
Comprehensive Test Coverage: Added comprehensive unit tests to validate the retry behavior, covering scenarios where retries succeed after initial failures and where they eventually fail after exceeding the maximum attempts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces valuable retry logic with exponential backoff to the kubelet client, making it more resilient to transient ResourceExhausted gRPC errors. This is a good approach to improve the system's stability under high load. My review includes suggestions to refine the error handling for better clarity and correctness, remove a piece of unreachable code, and refactor the new tests to improve maintainability by reducing code duplication. Overall, the changes are positive and with a few adjustments will be a solid improvement.

gemini-code-assist · 2026-01-26T15:14:34Z

pkg/kubeletclient/kubeletclient.go

+		if st, ok := status.FromError(err); ok && st.Code() == codes.ResourceExhausted {
+			if attempt < maxRetries {
+				logging.Debugf("getPodResources: rate limit hit (attempt %d/%d), retrying after %v: %v",
+					attempt+1, maxRetries+1, retryDelay, err)
+				time.Sleep(retryDelay)
+
+				// Exponential backoff with cap
+				retryDelay *= retryBackoffFactor
+				if retryDelay > maxRetryDelay {
+					retryDelay = maxRetryDelay
+				}
+				continue
+			}
+			logging.Errorf("getPodResources: rate limit exceeded after %d attempts", maxRetries+1)
+		}
+
+		// For non-rate-limit errors or final retry attempt, return the error
 		return logging.Errorf("getPodResources: failed to list pod resources, %v.Get(_) = _, %v", client, err)


The error handling for retries can be simplified and made more robust. Currently, when the final retry attempt fails due to rate limiting, a specific error message is logged (line 150) but its value is discarded. Instead, a more generic error from line 154 is returned, which can be misleading during debugging. It would be better to return the more specific error message that indicates the rate limit was exceeded after all attempts.

Additionally, the format string "%v.Get(_) = _, %v" on line 154 is unconventional and could be simplified to just include the error itself for better readability.

The suggested change addresses these points. Note that applying this suggestion will require updating the assertion in the test should fail after max retries with continuous rate limiting to check for the new, more specific error message.

if attempt < maxRetries { logging.Debugf("getPodResources: rate limit hit (attempt %d/%d), retrying after %v: %v", attempt+1, maxRetries+1, retryDelay, err) time.Sleep(retryDelay) // Exponential backoff with cap retryDelay *= retryBackoffFactor if retryDelay > maxRetryDelay { retryDelay = maxRetryDelay } continue } return logging.Errorf("getPodResources: rate limit exceeded after %d attempts: %v", maxRetries+1, err) } // For non-rate-limit errors, return the error return logging.Errorf("getPodResources: failed to list pod resources: %v", err)

gemini-code-assist · 2026-01-26T15:14:34Z

pkg/kubeletclient/kubeletclient.go


-	rc.resources = resp.PodResources
-	return nil
+	return logging.Errorf("getPodResources: failed to list pod resources, %v.Get(_) = _, %v", client, err)


This return statement appears to be unreachable. The for loop above is structured to always exit via a return statement within the loop body if an error occurs, or if the call is successful. The loop will never complete its iterations and fall through to this line. This statement can be safely removed.

gemini-code-assist · 2026-01-26T15:14:35Z

pkg/kubeletclient/kubeletclient_test.go

+		It("should fail after max retries with continuous rate limiting", func() {
+			// Create a server that always fails
+			alwaysFailServer := &rateLimitResourceServer{
+				server:       grpc.NewServer(),
+				failCount:    100, // Always fail
+				currentCount: 0,
+			}
+
+			tempSocketDir, err := os.MkdirTemp("", "kubelet-always-fail-test")
+			Expect(err).NotTo(HaveOccurred())
+			defer os.RemoveAll(tempSocketDir)
+
+			testingPodResourcesPath := filepath.Join(tempSocketDir, defaultPodResourcesPath)
+			err = os.MkdirAll(testingPodResourcesPath, os.ModeDir)
+			Expect(err).NotTo(HaveOccurred())
+
+			alwaysFailSocketName := filepath.Join(testingPodResourcesPath, "kubelet-always-fail.sock")
+			alwaysFailSocket := localEndpoint(filepath.Join(testingPodResourcesPath, "kubelet-always-fail"))
+
+			podresourcesapi.RegisterPodResourcesListerServer(alwaysFailServer.server, alwaysFailServer)
+			lis, err := CreateListener(alwaysFailSocketName)
+			Expect(err).NotTo(HaveOccurred())
+			go alwaysFailServer.server.Serve(lis)
+			defer alwaysFailServer.server.Stop()
+
+			_, err = getKubeletClient(alwaysFailSocket)
+			Expect(err).To(HaveOccurred())
+			Expect(err.Error()).To(ContainSubstring("failed to list pod resources"))
+
+			// Verify that max retries were attempted
+			finalCount := atomic.LoadInt32(&alwaysFailServer.currentCount)
+			Expect(finalCount).To(Equal(int32(maxRetries + 1)))
+		})


This test case duplicates a significant amount of server setup logic that is already present in the BeforeEach block for the preceding test context. This repetition makes the tests harder to read and maintain.

Consider refactoring the common server setup into a helper function. This function could accept parameters like failCount and return the necessary components for the test (e.g., the server URL, the server instance, and a cleanup function), which would reduce code duplication and improve the overall structure of the test file.

gemini-code-assist bot reviewed Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP-DNM] Add retry logic to kubeclient#1473

[WIP-DNM] Add retry logic to kubeclient#1473
bpickard22 wants to merge 1 commit intok8snetworkplumbingwg:masterfrom
bpickard22:upstream-timeout

bpickard22 commented Jan 26, 2026

Uh oh!

gemini-code-assist bot commented Jan 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

gemini-code-assist bot Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bpickard22 commented Jan 26, 2026

Uh oh!

gemini-code-assist bot commented Jan 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant