High frequency "Failed to watch (...) resyncing (...) due to timeout, retrying in 1s" debug logs and CPU spikes in application-controller #27072
Replies: 1 comment 1 reply
-
|
The logs and CPU spikes you are seeing in the 1. Understand the "Watch Timeout"The message 2. Verify the Controller ShardingIf you have a large number of applications, a single controller pod might be struggling to maintain the watch state for all resources, leading to the CPU spikes you observed. You should consider increasing the number of replicas for the application controller and enabling sharding. Update your Helm values: controller:
replicas: 2 # Increase based on load
env:
- name: ARGOCD_CONTROLLER_REPLICAS
value: "2"3. Adjust the Status and Operation ProcessorsHigh CPU during resyncs often points to the controller trying to process a large queue of application updates simultaneously. You can tune the parallelism of the controller to smooth out these spikes. In your configs:
cm:
# Increase workers to process items faster, reducing queue buildup
status.processors: "50"
operation.processors: "25"4. Optimize Watch SettingsIf the environment (vSphere/EKS Anywhere) has strict network timeouts, you can try to force a more frequent but controlled resync to avoid hit-and-miss connection drops. Tip Check your API Server logs or the Load Balancer (if using one for the K8s API) for 5. Review Resource LimitsEnsure the controller isn't hitting a CPU throttle limit. When the watch breaks and restarts, Argo CD performs a full list/re-cache of applications, which is CPU-intensive. If the pod is throttled, it takes longer to recover, leading to more timeouts. controller:
resources:
limits:
cpu: "2" # Ensure this is high enough to handle burst resyncs
memory: "2Gi"How many Applications are currently managed by this Argo CD instance, and do you notice these logs appearing more frequently for specific clusters? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all,
I’m trying to understand whether the behavior I’m observing in my Argo CD instance is expected or could indicate a potential issue.
Context
I have debug logging enabled in Argo CD, specifically in the argocd-application-controller.
Observed behavior
I’m seeing a high frequency of log entries like the following:
These messages appear continuously and at a high rate.
Additional observations
When correlating with metrics, I can see CPU spikes in the argocd-application-controller pod.
The timing of these CPU spikes seems to align with the repeated watch failures and retries.
Questions
Is this behavior expected when running with debug logging enabled?
Could this indicate an issue with the Kubernetes API server watch mechanism (e.g., timeouts or connectivity)?
Are there any recommended configurations to reduce the frequency of these retries or mitigate CPU impact?
Has anyone experienced similar behavior in production environments?
Environment
Argo CD version: v3.3.4
Kubernetes version: v1.33.1
Deployment method: Helm (Chart version: 9.4.11)
Cluster type: EKS Anywhere on vSphere
Beta Was this translation helpful? Give feedback.
All reactions