Skip to content

[Bug] fleet-agent and controller fail to sync BundleDeployment cache due to "event bookmark expired" / WatchListClient (WaitApplied(1)) #5059

@francesco-z

Description

@francesco-z

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

After updating Rancher to 2.14.0 and upgrading the underlying clusters (upstream and downstreams) from RKE2 1.34.3 to 1.35.3, the fleet-agent and fleet-controller frequently fail to sync their caches for BundleDeployment.

The fleet bundles and BundleDeployment resources get stuck in a WaitApplied(1) state. The logs show that the controller fails to receive the required bookmark event marking the end of the initial list stream (due to the new WatchListClient streaming behavior in client-go v0.35). The cache sync eventually times out, causing the manager to completely shut down.

Expected Behavior

Fleet successfully syncs its informer caches using the Kubernetes API, continuously watches the BundleDeployment resources, and transitions bundles from WaitApplied to Active without timing out or crashing.

Steps To Reproduce

  1. Update Rancher server to v2.14.0.
  2. Upgrade the upstream and downstream clusters to RKE2 v1.35.3+rke2r3.
  3. Have active Fleet bundles deployed to the downstream or local clusters.
  4. Check the state of the deployed bundles in the Rancher UI or via kubectl get bundles -A.
  5. Observe the WaitApplied(1) state and check the fleet-agent (downstream) or fleet-controller (upstream) logs for bookmark expiration and cache sync timeout errors.

Environment

- Architecture: [e.g., amd64/x86_64]
- Fleet Version: Bundled with Rancher 2.14.0
- Cluster:
  - Provider: RKE2 (Local and Downstream)
  - Options: Helm Chart RKE2 install, CNI: rke2-cilium-1.19.101
  - Kubernetes Version: v1.35.3+rke2r3

Logs

{"level":"info","ts":"2026-04-28T16:53:50Z","logger":"controller-runtime.cache","msg":"Warning: event bookmark expired","err":"pkg/mod/k8s.io/client-go@v0.35.2/tools/cache/reflector.go:289: hasn't received required bookmark event marking the en...
{"level":"error","ts":"2026-04-28T16:55:30Z","msg":"Could not wait for Cache to sync","controller":"bundledeployment","controllerGroup":"fleet.cattle.io","controllerKind":"BundleDeployment","source":"kind source: *v1alpha1.BundleDeployment","er...
{"level":"info","ts":"2026-04-28T16:55:30Z","msg":"Stopping and waiting for non leader election runnables"} 
{"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *...
{"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *...
{"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *...

Anything else?

Anything else?

Workaround Found:
The issue can be completely mitigated by disabling the WatchListClient feature gate in the client-go library for the Fleet pods. Setting the following environment variable forces the client to fall back to the standard List+Watch mechanism:

env:
  - name: KUBE_FEATURE_WatchListClient
    value: "false"

Related Upstream/Tracking Issues:
This appears to be heavily tied to upstream API server streaming behaviors introduced in Kubernetes 1.35:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    Status

    📋 Backlog

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions