Bug: application-controller panics with nil pointer dereference in getSyncTasks when target cluster API server is unreachable
Environment
- ArgoCD: v3.3.8 (image
quay.io/argoproj/argocd:v3.3.8)
- Kubernetes: v1.34.5 (k3s)
- Topology: ArgoCD running on cluster A, managing both A (in-cluster) and a remote cluster B; B's
Cluster registration uses an external IP (Tailscale tailnet IP). Cluster A → cluster B's API server is reachable from the host network but NOT from the pod network in this environment (pre-existing pod→underlay routing bug, separate from this issue).
- Connectivity from
argocd-application-controller pod to remote API server times out at ~32s with dial tcp <ip>:6443: i/o timeout.
What happens
When application-controller attempts to sync any Application whose destination resolves to an unreachable cluster (or where API discovery times out for any other reason), the controller logs the discovery timeout and then panics during sync-task generation:
"failed to discover server resources for group version v1: Get \"https://<remote-ip>:6443/api/v1?timeout=32s\": dial tcp <remote-ip>:6443: i/o timeout"
level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 296 [running]:
runtime/debug.Stack()
github.com/argoproj/argo-cd/v3/controller.(*ApplicationController).processRequestedAppOperation.func1()
/go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1431 +0x69
panic({0x47c6ea0?, 0x9740c50?})
github.com/argoproj/argo-cd/v3/controller.(*appStateManager).SyncAppState.func1(0xc00d61a740, 0x0)
/go/src/github.com/argoproj/argo-cd/controller/sync.go:311 +0x1e3
github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).getSyncTasks(0xc012e581a0)
/go/src/github.com/argoproj/argo-cd/gitops-engine/pkg/sync/sync_context.go:907 +0x1115
github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).Sync(0xc019054b60)
/go/src/github.com/argoproj/argo-cd/gitops-engine/pkg/sync/sync_context.go:421 +0x14a
github.com/argoproj/argo-cd/v3/controller.(*appStateManager).SyncAppState(...)
/go/src/github.com/argoproj/argo-cd/controller/sync.go:380 +0x2f31
"
level=info msg="Sync operation to <sha> failed: runtime error: invalid memory address or nil pointer dereference"
The deferred panic-recovery in processRequestedAppOperation.func1 catches it and marks the operation phase: Error, but the panic itself indicates a nil deref earlier in getSyncTasks (gitops-engine/pkg/sync/sync_context.go:907). The trigger we observed is consistent: every panic was preceded by a discovery timeout against the same unreachable API server.
Expected behavior
API discovery failure should be surfaced as a structured sync error (e.g. SyncError: cluster B API server unreachable: dial tcp ... i/o timeout) rather than allowing a nil to flow into getSyncTasks and trigger a panic. The recovery layer keeps the controller alive but the user-facing signal (phase: Error with no message hint about what failed) makes diagnosis hard — operators see "nil pointer dereference" with no indication that the actual cause is "remote cluster unreachable".
Reproduction
- Register a remote cluster in ArgoCD whose API endpoint is reachable from the host but blocked from the pod (e.g., a firewall, a CNI routing gap).
- Create an
Application targeting that cluster.
- Trigger a sync.
- Observe the controller log: discovery timeout followed by
Recovered from panic: runtime error: invalid memory address or nil pointer dereference from getSyncTasks.
Workaround we used
We unblocked our environment two ways:
- Short-term: set
hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet on argocd-application-controller. With the host's tailnet interface available, the controller reaches the remote API normally; the panic stops.
- Long-term: fix the underlying pod→underlay routing (Cilium
kubeProxyReplacement: true + bpf-host-routing: true retires the routing collision that blocked the original path).
Neither workaround is the right upstream fix — the upstream fix is the nil guard / structured error in gitops-engine/pkg/sync/sync_context.go:907.
Pointers
- Panic origin:
gitops-engine/pkg/sync/sync_context.go:907 (call inside getSyncTasks)
- Triggering call site:
argo-cd/controller/sync.go:311 (closure in SyncAppState.func1)
- The empty
failedJobsHistoryLimit / nil resource list returned from API discovery seems the most likely path to the nil deref, but I haven't traced it to the specific dereference yet.
Happy to provide additional logs / reproduce in CI if useful.
Bug: application-controller panics with nil pointer dereference in
getSyncTaskswhen target cluster API server is unreachableEnvironment
quay.io/argoproj/argocd:v3.3.8)Clusterregistration uses an external IP (Tailscale tailnet IP). Cluster A → cluster B's API server is reachable from the host network but NOT from the pod network in this environment (pre-existing pod→underlay routing bug, separate from this issue).argocd-application-controllerpod to remote API server times out at ~32s withdial tcp <ip>:6443: i/o timeout.What happens
When application-controller attempts to sync any
Applicationwhosedestinationresolves to an unreachable cluster (or where API discovery times out for any other reason), the controller logs the discovery timeout and then panics during sync-task generation:The deferred panic-recovery in
processRequestedAppOperation.func1catches it and marks the operationphase: Error, but the panic itself indicates a nil deref earlier ingetSyncTasks(gitops-engine/pkg/sync/sync_context.go:907). The trigger we observed is consistent: every panic was preceded by a discovery timeout against the same unreachable API server.Expected behavior
API discovery failure should be surfaced as a structured sync error (e.g.
SyncError: cluster B API server unreachable: dial tcp ... i/o timeout) rather than allowing a nil to flow intogetSyncTasksand trigger a panic. The recovery layer keeps the controller alive but the user-facing signal (phase: Errorwith no message hint about what failed) makes diagnosis hard — operators see "nil pointer dereference" with no indication that the actual cause is "remote cluster unreachable".Reproduction
Applicationtargeting that cluster.Recovered from panic: runtime error: invalid memory address or nil pointer dereferencefromgetSyncTasks.Workaround we used
We unblocked our environment two ways:
hostNetwork: true+dnsPolicy: ClusterFirstWithHostNetonargocd-application-controller. With the host's tailnet interface available, the controller reaches the remote API normally; the panic stops.kubeProxyReplacement: true+bpf-host-routing: trueretires the routing collision that blocked the original path).Neither workaround is the right upstream fix — the upstream fix is the nil guard / structured error in
gitops-engine/pkg/sync/sync_context.go:907.Pointers
gitops-engine/pkg/sync/sync_context.go:907(call insidegetSyncTasks)argo-cd/controller/sync.go:311(closure inSyncAppState.func1)failedJobsHistoryLimit/ nil resource list returned from API discovery seems the most likely path to the nil deref, but I haven't traced it to the specific dereference yet.Happy to provide additional logs / reproduce in CI if useful.