Skip to content

Federation and shovel: better network partition handling#16662

Draft
mkuratczyk wants to merge 12 commits into
mainfrom
federation-fix-status
Draft

Federation and shovel: better network partition handling#16662
mkuratczyk wants to merge 12 commits into
mainfrom
federation-fix-status

Conversation

@mkuratczyk

@mkuratczyk mkuratczyk commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This PR resolves a number of issues in shovel and federation in the context of network partition handling.
Issues I could observe, that I can no longer reproduce on this branch include:

  • ghost federation links: links that just won't go away, usually in the "starting" status (caused by stale ETS entries if the link terminated abruptly)
  • federation links and shovels not running (for different reasons)
  • federation linux and shovels still running on the minority side (they should step down and allow the majority side to take over)
  • federation links taking longer than necessary to step down on the minority side
  • Management UI returning 500 when reporting federation or shovel status

The federation status server keeps an ETS entry per link, reported by the
link supervisor. When a link went away without removing itself - for example
because its supervisor crashed - the entry was never deleted, so the link was
left behind in the status table (typically stuck in a "starting" state) and
kept being reported by the status API.

Monitor each supervisor that reports a status, once per supervisor, and on its
'DOWN' drop every status entry it owns. Conversely, demonitor a supervisor
once its last entry has been removed explicitly, so the monitor table does not
grow without bound.
The top-level federation link supervisors (`x_links` and `q_links`) were
registered as `transient` children of their respective federation
supervisors. A `transient` child is only restarted after an abnormal exit, so
whenever one of these supervisors stopped with reason `normal` or `shutdown` -
for example when a peer asked it to die as part of the mirrored supervisor
failover protocol - it was not restarted. The node was then left with no
federation links until the next full restart.

Mark both supervisors as `permanent` so they are always restarted regardless
of the exit reason.
When a mirrored supervisor process goes down, the "first" surviving member of
the group takes over its children in the `DOWN` handler. If that takeover
fails - for example because Khepri is temporarily unavailable or electing a
new leader - the child specs are left in Khepri still owned by the dead
process. The restarted supervisor only boots its static `initial_childspecs`,
so dynamic children such as federation links stay orphaned and are never
restarted, sometimes leaving a link stuck without a running process.

Add a periodic reconciliation loop. On the sorted-first active member of each
group it scans Khepri for child specs whose owner is no longer a group member,
takes over their ownership and restarts them locally. It also stops any local
child the store says is owned by another node, so that exactly one node runs
each child.

The loop runs on every node and must tolerate Khepri being briefly unavailable
- precisely the condition it recovers from - so the whole body is wrapped in a
`try/catch`: any error is logged and retried on the next tick rather than
crashing the supervisor.
During a network partition both sides of a mirrored supervisor group kept
their children running, producing duplicates - for example two copies of the
same federation link draining the same upstream. The majority side should own
the children; the minority side should give them up.

Extend the reconciliation loop to detect when the local node is in a minority
partition, that is when fewer than a strict majority of the cluster members
are reachable, and stop its local children in that case. The majority side
then reclaims them through the orphan reconciliation. Clusters with fewer than
three members never stop children this way, since they cannot establish a
majority, and an even split stops both sides until the partition heals.
…e-node start

On initialisation, when a mirroring process found no peers in its process
group, it deleted every child record for the group from Khepri. Because Khepri
is cluster-wide, this removed the records backing children running on all other
nodes, not just stale local ones.

An empty group is not a reliable signal that this node is alone: a peer's
mirroring process may simply not have joined the group yet, or this node's
supervisor may be restarting while the rest of the cluster keeps running its
children. In particular, when a supervisor restarts after reaching its maximum
restart intensity (for example following Khepri timeouts during a partition),
it re-initialises with an empty group for a moment and wipes every node's
records. The children are then gone from the store, so orphan reconciliation
has nothing left to restart and federation links stop running cluster-wide.

Only delete the records when this node is the sole reachable cluster member,
i.e. a genuine single-node (re)start. Otherwise leave any genuinely orphaned
records to the reconciliation loop, which restarts them on the majority side.
When a peer mirroring process goes down, the first member of the group takes
over its children, which requires writing to Khepri. If that write failed or
timed out, the handler stopped the mirroring process with
`{shutdown, [timeout]}`. This is most likely precisely when a peer goes down:
losing a node can trigger a Khepri leader election during which writes time
out. The mirroring process then restarts, fails again, and quickly reaches its
maximum restart intensity, tearing down the whole link supervisor for the
group.

Wrap the failover in a `try` and treat update or child-start failures as
transient: log them and keep running rather than stopping. The periodic
reconciliation loop takes over the orphaned children once Khepri is reachable
again. The successful failover path is unchanged, so children still migrate
immediately in the common case.
Clean up the reconciliation code added for orphan reclamation and minority
handling:

  - Drop the `is_list/1` guard and the dead `_ -> ok` clause on
    `supervisor2:which_children/1`, whose return is always a list.
  - Use `lists:foreach/2` instead of side-effecting list comprehensions whose
    list result was discarded, so the functions return `ok` and no longer
    trigger `unmatched_returns`.

No behaviour change.
The reconciliation loop reclaimed children owned by dead or unreachable nodes,
but never handled a child a node owns in the store yet is not actually running
locally. A node ends up in this state after leaving a minority partition:
while in the minority it stops its children but, having no quorum, cannot give
up their ownership in the store. If no majority node reclaims them before the
partition heals, the owner is alive again, so the records are no longer orphans
and nothing ever restarts the children. The result is records owned by a live
node with no running child anywhere - observed with Shovel after a partition
healed: every shovel was owned by the formerly-partitioned node but none were
running.

Add a reconciliation step, run on every node, that starts any child this node
owns in the store but is not currently running locally. This complements the
existing orphan reclamation (dead owners) and conflict resolution (children
owned by another node) to keep the store and the running children consistent.
The dynamic Shovel supervisor (rabbit_shovel_dyn_worker_sup_sup, a mirrored
supervisor) was a transient child of rabbit_shovel_sup. As with the federation
link supervisors, a transient child is not restarted after a normal or
shutdown exit, so if it stopped that way - for example after reaching its
restart intensity during Khepri timeouts in a partition, or because a peer
asked it to die - it would not come back and the node would run no shovels
until a full restart. Its restart intensity is low ({one_for_one, 3, 10}),
making this more likely than for federation.

Mark it permanent so it is always restarted, mirroring the federation fix.
Failover is event-driven: when a peer goes down, the first surviving member
takes over its children immediately. Minority-stop and conflict resolution,
however, only ran on the 30s periodic reconciliation. So after a partition
healed, a formerly-isolated node kept running (and reporting) its children
until its next periodic tick noticed they were now owned by another node and
stopped them - a window of up to 30s during which a federation link or shovel
ran on two nodes at once. For shovels in particular this means duplicate
message movement, not just duplicate reporting.

Monitor the process group with pg:monitor/1 and, on a join (partition heal) or
leave (partition), schedule a single debounced reconciliation. A short debounce
coalesces bursts of membership events, and the membership-triggered
reconciliation does not disturb the periodic timer, which remains as a backstop.

As a result the minority side sheds its children shortly after losing quorum,
and the rejoining side drops any duplicate shortly after healing, shrinking the
duplicate window from up to the periodic interval to a couple of seconds. This
lives in the shared mirrored_supervisor, so federation and shovel both benefit.
When a peer's mirroring process goes down, the first surviving member takes
over its children via update_all/2, which writes to Khepri. On the minority
side of a partition there is no quorum, so that write blocks the mirroring
gen_server until it times out (~30s). While blocked, the process handles no
other message - including the reconciliation that is meant to stop this node's
children because it is in the minority. So minority-stop was delayed by the
full Khepri timeout, and if the partition healed sooner it never ran at all,
leaving the isolated node running duplicates of children the majority had
already taken over (observed: a partitioned node kept all its shovels and
federation links, and they stayed on it after the partition healed).

Check for a minority partition before attempting the takeover and skip it when
in the minority: such a node must not take over children (the majority owns
them) and cannot (no quorum). This frees the gen_server to process the
membership-triggered reconciliation promptly, so the minority side sheds its
children within the debounce window instead of after a 30s stall or not at all.

The minority test is factored into partition_status/0 and shared with the
reconciliation loop. It relies only on Erlang distribution and locally-known
membership, both of which were confirmed to stay responsive and accurate on a
partitioned node (unlike Khepri writes).
@mkuratczyk mkuratczyk force-pushed the federation-fix-status branch from 0185e6b to 15274dc Compare June 12, 2026 11:04
Fix a 500 error from the Management UI during a network partition:
```
crasher:
  initial call: cowboy_stream_h:request_process/3
  pid: <0.4677.0>
  registered_name: []
  exception error: bad generator {badrpc,timeout}
    in function  rabbit_shovel_mgmt_util:'-status/1-lc$^0/1-0-'/2 (rabbit_shovel_mgmt_util.erl:44)
    in call from rabbit_shovel_mgmt_util:'-status/2-lc$^0/1-0-'/1 (rabbit_shovel_mgmt_util.erl:34)
    in call from rabbit_shovel_mgmt_util:'-status/2-lc$^0/1-0-'/1 (rabbit_shovel_mgmt_util.erl:34)
    in call from rabbit_shovel_mgmt_util:status/2 (rabbit_shovel_mgmt_util.erl:34)
    in call from rabbit_shovel_mgmt_shovels:to_json/2 (rabbit_shovel_mgmt_shovels.erl:47)
    in call from cowboy_rest:call/3 (src/cowboy_rest.erl:1577)
    in call from cowboy_rest:set_resp_body/2 (src/cowboy_rest.erl:1455)
    in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl:281)
```
@mkuratczyk mkuratczyk changed the title Federation: better network partition handling Federation and shovel: better network partition handling Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant