Federation and shovel: better network partition handling by mkuratczyk · Pull Request #16662 · rabbitmq/rabbitmq-server

mkuratczyk · 2026-06-11T13:47:29Z

This PR resolves a number of issues in shovel and federation in the context of network partition handling.
Issues I could observe, that I can no longer reproduce on this branch include:

ghost federation links: links that just won't go away, usually in the "starting" status (caused by stale ETS entries if the link terminated abruptly)
federation links and shovels not running (for different reasons)
federation linux and shovels still running on the minority side (they should step down and allow the majority side to take over)
federation links taking longer than necessary to step down on the minority side
Management UI returning 500 when reporting federation or shovel status

The federation status server keeps an ETS entry per link, reported by the link supervisor. When a link went away without removing itself - for example because its supervisor crashed - the entry was never deleted, so the link was left behind in the status table (typically stuck in a "starting" state) and kept being reported by the status API. Monitor each supervisor that reports a status, once per supervisor, and on its 'DOWN' drop every status entry it owns. Conversely, demonitor a supervisor once its last entry has been removed explicitly, so the monitor table does not grow without bound.

The top-level federation link supervisors (`x_links` and `q_links`) were registered as `transient` children of their respective federation supervisors. A `transient` child is only restarted after an abnormal exit, so whenever one of these supervisors stopped with reason `normal` or `shutdown` - for example when a peer asked it to die as part of the mirrored supervisor failover protocol - it was not restarted. The node was then left with no federation links until the next full restart. Mark both supervisors as `permanent` so they are always restarted regardless of the exit reason.

When a mirrored supervisor process goes down, the "first" surviving member of the group takes over its children in the `DOWN` handler. If that takeover fails - for example because Khepri is temporarily unavailable or electing a new leader - the child specs are left in Khepri still owned by the dead process. The restarted supervisor only boots its static `initial_childspecs`, so dynamic children such as federation links stay orphaned and are never restarted, sometimes leaving a link stuck without a running process. Add a periodic reconciliation loop. On the sorted-first active member of each group it scans Khepri for child specs whose owner is no longer a group member, takes over their ownership and restarts them locally. It also stops any local child the store says is owned by another node, so that exactly one node runs each child. The loop runs on every node and must tolerate Khepri being briefly unavailable - precisely the condition it recovers from - so the whole body is wrapped in a `try/catch`: any error is logged and retried on the next tick rather than crashing the supervisor.

During a network partition both sides of a mirrored supervisor group kept their children running, producing duplicates - for example two copies of the same federation link draining the same upstream. The majority side should own the children; the minority side should give them up. Extend the reconciliation loop to detect when the local node is in a minority partition, that is when fewer than a strict majority of the cluster members are reachable, and stop its local children in that case. The majority side then reclaims them through the orphan reconciliation. Clusters with fewer than three members never stop children this way, since they cannot establish a majority, and an even split stops both sides until the partition heals.

…e-node start On initialisation, when a mirroring process found no peers in its process group, it deleted every child record for the group from Khepri. Because Khepri is cluster-wide, this removed the records backing children running on all other nodes, not just stale local ones. An empty group is not a reliable signal that this node is alone: a peer's mirroring process may simply not have joined the group yet, or this node's supervisor may be restarting while the rest of the cluster keeps running its children. In particular, when a supervisor restarts after reaching its maximum restart intensity (for example following Khepri timeouts during a partition), it re-initialises with an empty group for a moment and wipes every node's records. The children are then gone from the store, so orphan reconciliation has nothing left to restart and federation links stop running cluster-wide. Only delete the records when this node is the sole reachable cluster member, i.e. a genuine single-node (re)start. Otherwise leave any genuinely orphaned records to the reconciliation loop, which restarts them on the majority side.

When a peer mirroring process goes down, the first member of the group takes over its children, which requires writing to Khepri. If that write failed or timed out, the handler stopped the mirroring process with `{shutdown, [timeout]}`. This is most likely precisely when a peer goes down: losing a node can trigger a Khepri leader election during which writes time out. The mirroring process then restarts, fails again, and quickly reaches its maximum restart intensity, tearing down the whole link supervisor for the group. Wrap the failover in a `try` and treat update or child-start failures as transient: log them and keep running rather than stopping. The periodic reconciliation loop takes over the orphaned children once Khepri is reachable again. The successful failover path is unchanged, so children still migrate immediately in the common case.

Clean up the reconciliation code added for orphan reclamation and minority handling: - Drop the `is_list/1` guard and the dead `_ -> ok` clause on `supervisor2:which_children/1`, whose return is always a list. - Use `lists:foreach/2` instead of side-effecting list comprehensions whose list result was discarded, so the functions return `ok` and no longer trigger `unmatched_returns`. No behaviour change.

The reconciliation loop reclaimed children owned by dead or unreachable nodes, but never handled a child a node owns in the store yet is not actually running locally. A node ends up in this state after leaving a minority partition: while in the minority it stops its children but, having no quorum, cannot give up their ownership in the store. If no majority node reclaims them before the partition heals, the owner is alive again, so the records are no longer orphans and nothing ever restarts the children. The result is records owned by a live node with no running child anywhere - observed with Shovel after a partition healed: every shovel was owned by the formerly-partitioned node but none were running. Add a reconciliation step, run on every node, that starts any child this node owns in the store but is not currently running locally. This complements the existing orphan reclamation (dead owners) and conflict resolution (children owned by another node) to keep the store and the running children consistent.

The dynamic Shovel supervisor (rabbit_shovel_dyn_worker_sup_sup, a mirrored supervisor) was a transient child of rabbit_shovel_sup. As with the federation link supervisors, a transient child is not restarted after a normal or shutdown exit, so if it stopped that way - for example after reaching its restart intensity during Khepri timeouts in a partition, or because a peer asked it to die - it would not come back and the node would run no shovels until a full restart. Its restart intensity is low ({one_for_one, 3, 10}), making this more likely than for federation. Mark it permanent so it is always restarted, mirroring the federation fix.

Failover is event-driven: when a peer goes down, the first surviving member takes over its children immediately. Minority-stop and conflict resolution, however, only ran on the 30s periodic reconciliation. So after a partition healed, a formerly-isolated node kept running (and reporting) its children until its next periodic tick noticed they were now owned by another node and stopped them - a window of up to 30s during which a federation link or shovel ran on two nodes at once. For shovels in particular this means duplicate message movement, not just duplicate reporting. Monitor the process group with pg:monitor/1 and, on a join (partition heal) or leave (partition), schedule a single debounced reconciliation. A short debounce coalesces bursts of membership events, and the membership-triggered reconciliation does not disturb the periodic timer, which remains as a backstop. As a result the minority side sheds its children shortly after losing quorum, and the rejoining side drops any duplicate shortly after healing, shrinking the duplicate window from up to the periodic interval to a couple of seconds. This lives in the shared mirrored_supervisor, so federation and shovel both benefit.

When a peer's mirroring process goes down, the first surviving member takes over its children via update_all/2, which writes to Khepri. On the minority side of a partition there is no quorum, so that write blocks the mirroring gen_server until it times out (~30s). While blocked, the process handles no other message - including the reconciliation that is meant to stop this node's children because it is in the minority. So minority-stop was delayed by the full Khepri timeout, and if the partition healed sooner it never ran at all, leaving the isolated node running duplicates of children the majority had already taken over (observed: a partitioned node kept all its shovels and federation links, and they stayed on it after the partition healed). Check for a minority partition before attempting the takeover and skip it when in the minority: such a node must not take over children (the majority owns them) and cannot (no quorum). This frees the gen_server to process the membership-triggered reconciliation promptly, so the minority side sheds its children within the debounce window instead of after a 30s stall or not at all. The minority test is factored into partition_status/0 and shared with the reconciliation loop. It relies only on Erlang distribution and locally-known membership, both of which were confirmed to stay responsive and accurate on a partitioned node (unlike Khepri writes).

Fix a 500 error from the Management UI during a network partition: ``` crasher: initial call: cowboy_stream_h:request_process/3 pid: <0.4677.0> registered_name: [] exception error: bad generator {badrpc,timeout} in function rabbit_shovel_mgmt_util:'-status/1-lc$^0/1-0-'/2 (rabbit_shovel_mgmt_util.erl:44) in call from rabbit_shovel_mgmt_util:'-status/2-lc$^0/1-0-'/1 (rabbit_shovel_mgmt_util.erl:34) in call from rabbit_shovel_mgmt_util:'-status/2-lc$^0/1-0-'/1 (rabbit_shovel_mgmt_util.erl:34) in call from rabbit_shovel_mgmt_util:status/2 (rabbit_shovel_mgmt_util.erl:34) in call from rabbit_shovel_mgmt_shovels:to_json/2 (rabbit_shovel_mgmt_shovels.erl:47) in call from cowboy_rest:call/3 (src/cowboy_rest.erl:1577) in call from cowboy_rest:set_resp_body/2 (src/cowboy_rest.erl:1455) in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl:281) ```

mkuratczyk added 11 commits June 12, 2026 13:04

mkuratczyk force-pushed the federation-fix-status branch from 0185e6b to 15274dc Compare June 12, 2026 11:04

mkuratczyk changed the title ~~Federation: better network partition handling~~ Federation and shovel: better network partition handling Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Federation and shovel: better network partition handling#16662

Federation and shovel: better network partition handling#16662
mkuratczyk wants to merge 12 commits into
mainfrom
federation-fix-status

mkuratczyk commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkuratczyk commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mkuratczyk commented Jun 11, 2026 •

edited

Loading