Detect offline clusters by weyfonk · Pull Request #2933 · rancher/fleet

weyfonk · 2024-10-07T10:15:40Z

This adds a cluster status monitor to the Fleet controller, which checks when each cluster last saw its agent online. If more than the expected interval elapses, that cluster is considered offline, and the monitor updates its bundle deployments' statuses to reflect that. This will trigger status updates to bundles, GitRepos, clusters and cluster groups.

Refers to #594.

Open points:

how far, and how fine-grained, do we want to make bundle deployment status updates for offline clusters? This currently takes a fairly basic approach, updating both Ready and Monitored conditions while clearing modified and non-ready statuses, to prevent outdated messages from appearing in a bundle deployment's display status and further up the chain of status updates (to bundles, then upwards to GitRepos, etc)
should we make the frequency of monitoring configurable?
do we have a way to exclude the local/management cluster from agent-last-seen checks?

Update 2026-03

the cluster controller checks for offline bundle deployments, to mark the corresponding cluster as offline. It also supports the opposite, namely marking a cluster back online when its bundle deployments are detected as such.
this is covered by integration tests, not involving the cluster monitor, while end-to-end tests validate this behaviour when the cluster monitor is involved.

0xavi0

I don't know if you want to merge this PR with the comments with open questions, maybe better to clarify first.

Just a couple of observations, but overall LGTM

weyfonk · 2024-10-25T09:10:39Z

This needs more discussion around UI/UX.

This allows the Fleet controller to detect offline clusters and update statuses of bundle deployments targeting offline clusters. Next to do: * understand how bundle deployment status updates should be propagated to bundles (which currently simply appear as `Modified`) and further up * write tests (eg. integration tests, updating cluster status by hand?) * set sensible defaults (eg. monitoring interval higher, and threshold higher than agent's cluster status reporting interval)

This enables that state to be reflected upwards in bundle, GitRepo, cluster and cluster group statuses.

This also provides unit tests for detecting offline clusters.

The frequency of cluster status checks is currently hard-coded to 1 minute, but could be made configurable. The threshold for considering a cluster offline now explicitly depends on how often agents report their statuses to the management cluster. Changes to that configured interval should impact the cluster status monitor, which would take the new value into account from its next run onwards.

This fixes an error and ignores a few others to make the linter happy.

This adds checks ensuring that for offline clusters, for which calls to update bundle deployment statuses are expected, those statuses contain `Ready` and `Monitored` conditions with status `False` and reasons reflecting the cluster's offline status.

This ensures that creating a new agent bundle fails with an agent check-in interval set to 0.

This adds a check on the agent check-in interval to cluster import, for consistency with agent bundle updates.

This enables users to determine how often the Fleet controller will check for offline clusters, and based on which threshold. If the configured threshold is below the triple of the check-in interval, that tripled value will be used instead.

This optimises updates to bundle deployments, running them only against clusters which bundle deployments are not yet marked as offline.

This better reflects what is then known about workloads running in such clusters than `False`.

This should fix Fleet controller deployments complaining about the interval being 0 when it should never be.

Running one cluster status monitor per Fleet controller pod is not necessary and may cause conflicts in sharded setups.

Omitting the agent check-in interval when patching the `fleet-controller` config map would now lead to errors when setting up agents with a check-in interval bearing the default value for a duration, ie 0s. That interval is now set with a hard-coded value, which is of no importance for such tests, for the sake of not being zero.

The cluster monitor is disabled by default, but may be enabled by setting Helm value `clusterMonitor.enabled` to `true`.

This may help troubleshoot unexpected threshold values, for instance caused by high agent check-in intervals.

While the cluster monitor only updates bundle deployment statuses, the cluster reconciler handles updates to cluster statuses, to prevent concurrent status updates.

Similarly to how a bundle should not be reconciled based on its own status updates, neither should a cluster.

Updates to bundle deployment statuses should be more lightweight and less invasive. This commit also adds WIP end-to-end tests to validate bundle deployment status updates for offline clusters, which the cluster monitor handles, followed by cluster status updates, triggered by the cluster reconciler.

New end-to-end tests cover a cluster becoming offline, checking that the cluster itself and its bundle deployments are recognised as such, brings the cluster back online, and runs those checks again. Actual clusters coming back online may not translate in `fleet-agent` deployments being scaled back up, as is the case in this simulation.

The cluster controller can now detect that a previously offline cluster has online bundle deployments, and marks the cluster as online accordingly.

Following a rebase against the latest state of `main`, a few errors had popped up.

Taking advantage of k3d, multi-cluster end-to-end tests are able to simulate disconnection of a cluster by disconnecting the cluster node container from its Docker network, and reconnecting it. Disconnecting the server node is enough, as this blocks outbound traffic, in this case the agent's check-in requests to the controller. Disconnecting the load balancer node, to block inbound traffic, is not necessary. This should represent a more realistic use case than scaling Fleet agent deployments.

A couple of comments had been made obsolete by the previous commit.

When a bundle deployment is already marked offline, updating it again to mark it offline would be wasteful. However, other bundle deployments may still require that update, especially those which may have been created since the offline bundle deployment update logic was last run.

A cluster should be marked offline if _any_ of its bundle deployments is offline, and back online when all of them are back online. This prevents flapping status updates in case not all bundle deployments have been updated by the cluster monitor.

weyfonk requested a review from a team as a code owner October 7, 2024 10:15

0xavi0 reviewed Oct 8, 2024

View reviewed changes

Comment thread internal/cmd/controller/clustermonitor/monitor.go Outdated

Comment thread internal/cmd/controller/clustermonitor/monitor_test.go Outdated

weyfonk force-pushed the 594-detect-offline-cluster branch 2 times, most recently from 38a697b to 452e56e Compare October 8, 2024 09:51

kkaempf added the kind/enhancement label Oct 8, 2024

kkaempf modified the milestones: v2.10.0, v2.9.4 Oct 8, 2024

kkaempf added kind/bug and removed kind/enhancement labels Oct 8, 2024

weyfonk force-pushed the 594-detect-offline-cluster branch 3 times, most recently from a1c1b94 to 8a32810 Compare October 18, 2024 14:51

weyfonk marked this pull request as draft October 18, 2024 16:06

weyfonk force-pushed the 594-detect-offline-cluster branch 3 times, most recently from 235303e to 8f07ba4 Compare October 22, 2024 09:21

weyfonk marked this pull request as ready for review October 22, 2024 09:40

manno mentioned this pull request Oct 23, 2024

Fleet clusters not reflecting disconnection under Continuous delivery rancher/dashboard#14966

Open

weyfonk marked this pull request as draft October 25, 2024 09:10

weyfonk modified the milestones: v2.9.4, v2.10.0 Oct 25, 2024

weyfonk modified the milestones: v2.10.0, v2.11.0 Oct 30, 2024

weyfonk force-pushed the 594-detect-offline-cluster branch from 8f07ba4 to 9f8db1c Compare December 18, 2024 17:08

weyfonk marked this pull request as ready for review December 18, 2024 17:22

manno removed this from the v2.11.0 milestone Jan 15, 2025

thardeck marked this pull request as draft August 7, 2025 09:01

weyfonk force-pushed the 594-detect-offline-cluster branch from 9f8db1c to e6ed107 Compare January 21, 2026 15:57

weyfonk added 28 commits May 8, 2026 16:28

Reflect offline cluster state in more bundle deployment status fields

9e364db

This enables that state to be reflected upwards in bundle, GitRepo, cluster and cluster group statuses.

Move cluster status monitor to separate package

3599997

This also provides unit tests for detecting offline clusters.

Eliminate linting errors

38ee90a

This fixes an error and ignores a few others to make the linter happy.

Prevent agent check-in interval from being 0

e1f3998

This ensures that creating a new agent bundle fails with an agent check-in interval set to 0.

Prevent check-in interval from being 0 when importing cluster

1224333

This adds a check on the agent check-in interval to cluster import, for consistency with agent bundle updates.

Skip bundle deployment updates for already offline clusters

1ad8b3b

This optimises updates to bundle deployments, running them only against clusters which bundle deployments are not yet marked as offline.

Set Ready condition to Unknown for offline clusters

0e53496

This better reflects what is then known about workloads running in such clusters than `False`.

Fix json attribute for cluster monitor interval

6fbe68e

This should fix Fleet controller deployments complaining about the interval being 0 when it should never be.

Run cluster status monitor on unsharded controller only

6c83845

Running one cluster status monitor per Fleet controller pod is not necessary and may cause conflicts in sharded setups.

Fix linting errors

5a2c6a3

Use feature flag for enabling cluster monitor

3784896

The cluster monitor is disabled by default, but may be enabled by setting Helm value `clusterMonitor.enabled` to `true`.

Add threshold to cluster status check log message

5299b88

This may help troubleshoot unexpected threshold values, for instance caused by high agent check-in intervals.

Mark Cluster as Offline when its bundle deployments are offline

4d523c2

While the cluster monitor only updates bundle deployment statuses, the cluster reconciler handles updates to cluster statuses, to prevent concurrent status updates.

Prevent cluster updates triggered by cluster status updates

ba25110

Similarly to how a bundle should not be reconciled based on its own status updates, neither should a cluster.

Detect online clusters from online bundle deployments

de3ad8e

The cluster controller can now detect that a previously offline cluster has online bundle deployments, and marks the cluster as online accordingly.

Make linters happy

94a188f

Following a rebase against the latest state of `main`, a few errors had popped up.

Address Copilot's comments

57ca339

Make linters happy

c419d10

A couple of comments had been made obsolete by the previous commit.

weyfonk force-pushed the 594-detect-offline-cluster branch from cae5a8d to 4a32093 Compare May 8, 2026 15:22

weyfonk requested a review from 0xavi0 May 8, 2026 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect offline clusters#2933

Detect offline clusters#2933
weyfonk wants to merge 28 commits into
rancher:mainfrom
weyfonk:594-detect-offline-cluster

weyfonk commented Oct 7, 2024 •

edited

Loading

Uh oh!

0xavi0 left a comment

Uh oh!

Uh oh!

Uh oh!

weyfonk commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

weyfonk commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update 2026-03

Uh oh!

0xavi0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

weyfonk commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

weyfonk commented Oct 7, 2024 •

edited

Loading