Skip to content

Implement exponential backoff by configuring kube-client env vars #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

musa-asad
Copy link
Collaborator

Description of changes:
The CloudWatch Agent overloads the API server due to there not being an exponential back-off strategy implemented in the Kubernetes client when there's a timeout issue, as described in cilium/cilium#36525 (comment).

This change configures environmental variables that enables exponential back-off for timeout issues.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@musa-asad musa-asad requested review from movence and lisguo March 27, 2025 06:16
@musa-asad musa-asad self-assigned this Mar 27, 2025
@@ -1451,3 +1451,7 @@ neuronMonitor:
- SYS_ADMIN
serviceAccount:
name: # override exporter service account name
k8sClientExponentialBackoff:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining this at the global level sets an expectation that this will be respected for all k8s clients across every component deployed by this chart, including the operator, agent, fluentbit and anything else - but we arent doing that.

So making this agent specific might be better - but then why is this being defined as a helm override for the agent and not part of the agent json config like everything else.

We need more discussions on where these belong.

Copy link
Collaborator Author

@musa-asad musa-asad Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, completely agree. We should discuss this.

@lisguo
Copy link
Contributor

lisguo commented Mar 28, 2025

We should test this to ensure that it does actually reduce the number of control plane api calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants