Skip to content

application-controller: nil pointer panic in getSyncTasks when target cluster API discovery times out #27659

@AloMartinP

Description

@AloMartinP

Bug: application-controller panics with nil pointer dereference in getSyncTasks when target cluster API server is unreachable

Environment

  • ArgoCD: v3.3.8 (image quay.io/argoproj/argocd:v3.3.8)
  • Kubernetes: v1.34.5 (k3s)
  • Topology: ArgoCD running on cluster A, managing both A (in-cluster) and a remote cluster B; B's Cluster registration uses an external IP (Tailscale tailnet IP). Cluster A → cluster B's API server is reachable from the host network but NOT from the pod network in this environment (pre-existing pod→underlay routing bug, separate from this issue).
  • Connectivity from argocd-application-controller pod to remote API server times out at ~32s with dial tcp <ip>:6443: i/o timeout.

What happens

When application-controller attempts to sync any Application whose destination resolves to an unreachable cluster (or where API discovery times out for any other reason), the controller logs the discovery timeout and then panics during sync-task generation:

"failed to discover server resources for group version v1: Get \"https://<remote-ip>:6443/api/v1?timeout=32s\": dial tcp <remote-ip>:6443: i/o timeout"

level=error msg="Recovered from panic: runtime error: invalid memory address or nil pointer dereference
goroutine 296 [running]:
runtime/debug.Stack()
github.com/argoproj/argo-cd/v3/controller.(*ApplicationController).processRequestedAppOperation.func1()
  /go/src/github.com/argoproj/argo-cd/controller/appcontroller.go:1431 +0x69
panic({0x47c6ea0?, 0x9740c50?})
github.com/argoproj/argo-cd/v3/controller.(*appStateManager).SyncAppState.func1(0xc00d61a740, 0x0)
  /go/src/github.com/argoproj/argo-cd/controller/sync.go:311 +0x1e3
github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).getSyncTasks(0xc012e581a0)
  /go/src/github.com/argoproj/argo-cd/gitops-engine/pkg/sync/sync_context.go:907 +0x1115
github.com/argoproj/gitops-engine/pkg/sync.(*syncContext).Sync(0xc019054b60)
  /go/src/github.com/argoproj/argo-cd/gitops-engine/pkg/sync/sync_context.go:421 +0x14a
github.com/argoproj/argo-cd/v3/controller.(*appStateManager).SyncAppState(...)
  /go/src/github.com/argoproj/argo-cd/controller/sync.go:380 +0x2f31
"

level=info msg="Sync operation to <sha> failed: runtime error: invalid memory address or nil pointer dereference"

The deferred panic-recovery in processRequestedAppOperation.func1 catches it and marks the operation phase: Error, but the panic itself indicates a nil deref earlier in getSyncTasks (gitops-engine/pkg/sync/sync_context.go:907). The trigger we observed is consistent: every panic was preceded by a discovery timeout against the same unreachable API server.

Expected behavior

API discovery failure should be surfaced as a structured sync error (e.g. SyncError: cluster B API server unreachable: dial tcp ... i/o timeout) rather than allowing a nil to flow into getSyncTasks and trigger a panic. The recovery layer keeps the controller alive but the user-facing signal (phase: Error with no message hint about what failed) makes diagnosis hard — operators see "nil pointer dereference" with no indication that the actual cause is "remote cluster unreachable".

Reproduction

  1. Register a remote cluster in ArgoCD whose API endpoint is reachable from the host but blocked from the pod (e.g., a firewall, a CNI routing gap).
  2. Create an Application targeting that cluster.
  3. Trigger a sync.
  4. Observe the controller log: discovery timeout followed by Recovered from panic: runtime error: invalid memory address or nil pointer dereference from getSyncTasks.

Workaround we used

We unblocked our environment two ways:

  • Short-term: set hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet on argocd-application-controller. With the host's tailnet interface available, the controller reaches the remote API normally; the panic stops.
  • Long-term: fix the underlying pod→underlay routing (Cilium kubeProxyReplacement: true + bpf-host-routing: true retires the routing collision that blocked the original path).

Neither workaround is the right upstream fix — the upstream fix is the nil guard / structured error in gitops-engine/pkg/sync/sync_context.go:907.

Pointers

  • Panic origin: gitops-engine/pkg/sync/sync_context.go:907 (call inside getSyncTasks)
  • Triggering call site: argo-cd/controller/sync.go:311 (closure in SyncAppState.func1)
  • The empty failedJobsHistoryLimit / nil resource list returned from API discovery seems the most likely path to the nil deref, but I haven't traced it to the specific dereference yet.

Happy to provide additional logs / reproduce in CI if useful.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions