High frequency "Failed to watch (...) resyncing (...) due to timeout, retrying in 1s" debug logs and CPU spikes in application-controller #27072

wdabrowski-odec · 2026-03-30T10:53:22Z

wdabrowski-odec
Mar 30, 2026

Hi all,

I’m trying to understand whether the behavior I’m observing in my Argo CD instance is expected or could indicate a potential issue.

Context

I have debug logging enabled in Argo CD, specifically in the argocd-application-controller.

Observed behavior

I’m seeing a high frequency of log entries like the following:

level=debug msg="Start watch Application.argoproj.io on https://10.96.0.1:443" server="https://kubernetes.default.svc"
level=debug msg="Failed to watch Application.argoproj.io on https://10.96.0.1:443: resyncing Application.argoproj.io on https://10.96.0.1:443 due to timeout, retrying in 1s" server="https://kubernetes.default.svc"

These messages appear continuously and at a high rate.

Additional observations

When correlating with metrics, I can see CPU spikes in the argocd-application-controller pod.
The timing of these CPU spikes seems to align with the repeated watch failures and retries.

Questions

Is this behavior expected when running with debug logging enabled?
Could this indicate an issue with the Kubernetes API server watch mechanism (e.g., timeouts or connectivity)?
Are there any recommended configurations to reduce the frequency of these retries or mitigate CPU impact?
Has anyone experienced similar behavior in production environments?

Environment

Argo CD version: v3.3.4
Kubernetes version: v1.33.1
Deployment method: Helm (Chart version: 9.4.11)
Cluster type: EKS Anywhere on vSphere

Nickalus12 · 2026-04-01T01:10:49Z

Nickalus12
Apr 1, 2026

The logs and CPU spikes you are seeing in the argocd-application-controller are likely related to the Kubernetes API server's watch timeout mechanism, which is intensified by how Argo CD handles high-frequency reconciliation in large or busy clusters.

1. Understand the "Watch Timeout"

The message resyncing (...) due to timeout is a standard part of the Kubernetes client-go library logic. Kubernetes watches have a mandatory timeout (typically randomized between 5–10 minutes) to prevent stale connections. However, when you see this "at a high rate" with a "retry in 1s," it usually indicates that the connection is being severed prematurely by an intermediary (like a load balancer or the EKS control plane) or the API server is under heavy load and failing to sustain the stream.

2. Verify the Controller Sharding

If you have a large number of applications, a single controller pod might be struggling to maintain the watch state for all resources, leading to the CPU spikes you observed. You should consider increasing the number of replicas for the application controller and enabling sharding.

Update your Helm values:

controller:
  replicas: 2 # Increase based on load
  env:
    - name: ARGOCD_CONTROLLER_REPLICAS
      value: "2"

3. Adjust the Status and Operation Processors

High CPU during resyncs often points to the controller trying to process a large queue of application updates simultaneously. You can tune the parallelism of the controller to smooth out these spikes.

In your argocd-cm (via Helm configs.cm):

configs:
  cm:
    # Increase workers to process items faster, reducing queue buildup
    status.processors: "50" 
    operation.processors: "25"

4. Optimize Watch Settings

If the environment (vSphere/EKS Anywhere) has strict network timeouts, you can try to force a more frequent but controlled resync to avoid hit-and-miss connection drops.

Tip

Check your API Server logs or the Load Balancer (if using one for the K8s API) for 408 Request Timeout or Client Connection Idle errors. If the LB drops the connection before the K8s watch timeout, Argo CD will trigger the "retry in 1s" loop.

5. Review Resource Limits

Ensure the controller isn't hitting a CPU throttle limit. When the watch breaks and restarts, Argo CD performs a full list/re-cache of applications, which is CPU-intensive. If the pod is throttled, it takes longer to recover, leading to more timeouts.

controller:
  resources:
    limits:
      cpu: "2" # Ensure this is high enough to handle burst resyncs
      memory: "2Gi"

How many Applications are currently managed by this Argo CD instance, and do you notice these logs appearing more frequently for specific clusters?

1 reply

wdabrowski-odec Apr 2, 2026
Author

Thanks for the detailed response! Let me share our specific configuration and findings for each point.

Regarding point 2 (Controller Sharding): We already have 3 replicas configured for the argocd-application-controller with sharding enabled, so this doesn't seem to be the bottleneck in our case.

Regarding point 3 (Status and Operation Processors): We're currently running with:

controller.status.processors: 20
controller.operation.processors: 10

However, we previously had these set to much higher values (100 and 50 respectively) and observed exactly the same behavior — same frequency of Start watch / Failed to watch cycles, same CPU spike pattern. Tuning these processors had no noticeable effect on the watch retry loop, which makes us think the processors are not the root cause here.

Regarding point 4 (Network Timeouts): We've checked the kube-apiserver logs and our kube-vip logs thoroughly and found no 408 Request Timeout or Client Connection Idle errors on either side. The API server does not appear to be actively terminating the watch connections prematurely from its side.

Regarding point 5 (Resource Limits): Our current configuration is:

resources:
  limits:
    memory: 6Gi
  requests:
    cpu: 250m
    memory: 2Gi

Note that we intentionally do not set a CPU limit to avoid CPU throttling, which could otherwise mask or amplify this problem.

To answer your final question: We manage approximately ~250 Applications across 32 clusters. The clusters span multiple network environments:

Local in-cluster: https://kubernetes.default.svc
Private network clusters: https://172.16.XX.XX:443
AWS EKS: https://<REDACTED>.eks.amazonaws.com
Azure AKS: https://<REDACTED>.azmk8s.io:443

The key observation here is that we see the same Start watch / Failed to watch pattern at a similar rate across all cluster types, including the local https://kubernetes.default.svc endpoint. This strongly suggests that the issue is not caused by an external intermediary (firewall, load balancer, cloud NAT gateway, etc.) severing the connections, since the local API server communication doesn't traverse any of those. The behavior is consistent regardless of whether Argo CD is talking to an in-cluster API server, a private on-prem cluster, an AWS EKS endpoint, or an Azure AKS endpoint.

This makes us lean toward the conclusion that this is inherent behavior in how client-go handles watch timeouts under the load of 250 apps × 32 clusters, rather than an environment-specific connectivity issue. The CPU spikes seem to correlate with the full list/re-cache cycle that happens after each watch failure, not with the watch failure itself being abnormal.

Has anyone observed this same pattern at similar scale (large number of apps + many external clusters) and confirmed whether this is simply expected behavior at this load level? Any insight into whether the watch timeout interval is tunable at the client-go or Argo CD level would be very helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High frequency "Failed to watch (...) resyncing (...) due to timeout, retrying in 1s" debug logs and CPU spikes in application-controller #27072

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

High frequency "Failed to watch (...) resyncing (...) due to timeout, retrying in 1s" debug logs and CPU spikes in application-controller #27072

Uh oh!

wdabrowski-odec Mar 30, 2026

Context

Observed behavior

Additional observations

Questions

Environment

Replies: 1 comment · 1 reply

Uh oh!

Nickalus12 Apr 1, 2026

1. Understand the "Watch Timeout"

2. Verify the Controller Sharding

3. Adjust the Status and Operation Processors

4. Optimize Watch Settings

5. Review Resource Limits

Uh oh!

wdabrowski-odec Apr 2, 2026 Author

wdabrowski-odec
Mar 30, 2026

Replies: 1 comment 1 reply

Nickalus12
Apr 1, 2026

wdabrowski-odec Apr 2, 2026
Author