Is there an existing issue for this?
Current Behavior
After updating Rancher to 2.14.0 and upgrading the underlying clusters (upstream and downstreams) from RKE2 1.34.3 to 1.35.3, the fleet-agent and fleet-controller frequently fail to sync their caches for BundleDeployment.
The fleet bundles and BundleDeployment resources get stuck in a WaitApplied(1) state. The logs show that the controller fails to receive the required bookmark event marking the end of the initial list stream (due to the new WatchListClient streaming behavior in client-go v0.35). The cache sync eventually times out, causing the manager to completely shut down.
Expected Behavior
Fleet successfully syncs its informer caches using the Kubernetes API, continuously watches the BundleDeployment resources, and transitions bundles from WaitApplied to Active without timing out or crashing.
Steps To Reproduce
- Update Rancher server to
v2.14.0.
- Upgrade the upstream and downstream clusters to RKE2
v1.35.3+rke2r3.
- Have active Fleet bundles deployed to the downstream or local clusters.
- Check the state of the deployed bundles in the Rancher UI or via
kubectl get bundles -A.
- Observe the
WaitApplied(1) state and check the fleet-agent (downstream) or fleet-controller (upstream) logs for bookmark expiration and cache sync timeout errors.
Environment
- Architecture: [e.g., amd64/x86_64]
- Fleet Version: Bundled with Rancher 2.14.0
- Cluster:
- Provider: RKE2 (Local and Downstream)
- Options: Helm Chart RKE2 install, CNI: rke2-cilium-1.19.101
- Kubernetes Version: v1.35.3+rke2r3
Logs
{"level":"info","ts":"2026-04-28T16:53:50Z","logger":"controller-runtime.cache","msg":"Warning: event bookmark expired","err":"pkg/mod/k8s.io/client-go@v0.35.2/tools/cache/reflector.go:289: hasn't received required bookmark event marking the en...
{"level":"error","ts":"2026-04-28T16:55:30Z","msg":"Could not wait for Cache to sync","controller":"bundledeployment","controllerGroup":"fleet.cattle.io","controllerKind":"BundleDeployment","source":"kind source: *v1alpha1.BundleDeployment","er...
{"level":"info","ts":"2026-04-28T16:55:30Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *...
{"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *...
{"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *...
Anything else?
Anything else?
Workaround Found:
The issue can be completely mitigated by disabling the WatchListClient feature gate in the client-go library for the Fleet pods. Setting the following environment variable forces the client to fall back to the standard List+Watch mechanism:
env:
- name: KUBE_FEATURE_WatchListClient
value: "false"
Related Upstream/Tracking Issues:
This appears to be heavily tied to upstream API server streaming behaviors introduced in Kubernetes 1.35:
Is there an existing issue for this?
Current Behavior
After updating Rancher to 2.14.0 and upgrading the underlying clusters (upstream and downstreams) from RKE2 1.34.3 to 1.35.3, the
fleet-agentandfleet-controllerfrequently fail to sync their caches forBundleDeployment.The fleet bundles and
BundleDeploymentresources get stuck in aWaitApplied(1)state. The logs show that the controller fails to receive the required bookmark event marking the end of the initial list stream (due to the newWatchListClientstreaming behavior inclient-gov0.35). The cache sync eventually times out, causing the manager to completely shut down.Expected Behavior
Fleet successfully syncs its informer caches using the Kubernetes API, continuously watches the
BundleDeploymentresources, and transitions bundles fromWaitAppliedtoActivewithout timing out or crashing.Steps To Reproduce
v2.14.0.v1.35.3+rke2r3.kubectl get bundles -A.WaitApplied(1)state and check thefleet-agent(downstream) orfleet-controller(upstream) logs for bookmark expiration and cache sync timeout errors.Environment
Logs
{"level":"info","ts":"2026-04-28T16:53:50Z","logger":"controller-runtime.cache","msg":"Warning: event bookmark expired","err":"pkg/mod/k8s.io/client-go@v0.35.2/tools/cache/reflector.go:289: hasn't received required bookmark event marking the en... {"level":"error","ts":"2026-04-28T16:55:30Z","msg":"Could not wait for Cache to sync","controller":"bundledeployment","controllerGroup":"fleet.cattle.io","controllerKind":"BundleDeployment","source":"kind source: *v1alpha1.BundleDeployment","er... {"level":"info","ts":"2026-04-28T16:55:30Z","msg":"Stopping and waiting for non leader election runnables"} {"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *... {"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *... {"level":"error","ts":"2026-04-28T16:55:30Z","logger":"setup","msg":"failed to start agent","error":"failed to wait for bundledeployment caches to sync kind source: *v1alpha1.BundleDeployment: timed out waiting for cache to be synced for Kind *...Anything else?
Anything else?
Workaround Found:
The issue can be completely mitigated by disabling the
WatchListClientfeature gate in theclient-golibrary for the Fleet pods. Setting the following environment variable forces the client to fall back to the standard List+Watch mechanism:Related Upstream/Tracking Issues:
This appears to be heavily tied to upstream API server streaming behaviors introduced in Kubernetes 1.35: