Federation and shovel: better network partition handling#16662
Draft
mkuratczyk wants to merge 12 commits into
Draft
Federation and shovel: better network partition handling#16662mkuratczyk wants to merge 12 commits into
mkuratczyk wants to merge 12 commits into
Conversation
The federation status server keeps an ETS entry per link, reported by the link supervisor. When a link went away without removing itself - for example because its supervisor crashed - the entry was never deleted, so the link was left behind in the status table (typically stuck in a "starting" state) and kept being reported by the status API. Monitor each supervisor that reports a status, once per supervisor, and on its 'DOWN' drop every status entry it owns. Conversely, demonitor a supervisor once its last entry has been removed explicitly, so the monitor table does not grow without bound.
The top-level federation link supervisors (`x_links` and `q_links`) were registered as `transient` children of their respective federation supervisors. A `transient` child is only restarted after an abnormal exit, so whenever one of these supervisors stopped with reason `normal` or `shutdown` - for example when a peer asked it to die as part of the mirrored supervisor failover protocol - it was not restarted. The node was then left with no federation links until the next full restart. Mark both supervisors as `permanent` so they are always restarted regardless of the exit reason.
When a mirrored supervisor process goes down, the "first" surviving member of the group takes over its children in the `DOWN` handler. If that takeover fails - for example because Khepri is temporarily unavailable or electing a new leader - the child specs are left in Khepri still owned by the dead process. The restarted supervisor only boots its static `initial_childspecs`, so dynamic children such as federation links stay orphaned and are never restarted, sometimes leaving a link stuck without a running process. Add a periodic reconciliation loop. On the sorted-first active member of each group it scans Khepri for child specs whose owner is no longer a group member, takes over their ownership and restarts them locally. It also stops any local child the store says is owned by another node, so that exactly one node runs each child. The loop runs on every node and must tolerate Khepri being briefly unavailable - precisely the condition it recovers from - so the whole body is wrapped in a `try/catch`: any error is logged and retried on the next tick rather than crashing the supervisor.
During a network partition both sides of a mirrored supervisor group kept their children running, producing duplicates - for example two copies of the same federation link draining the same upstream. The majority side should own the children; the minority side should give them up. Extend the reconciliation loop to detect when the local node is in a minority partition, that is when fewer than a strict majority of the cluster members are reachable, and stop its local children in that case. The majority side then reclaims them through the orphan reconciliation. Clusters with fewer than three members never stop children this way, since they cannot establish a majority, and an even split stops both sides until the partition heals.
…e-node start On initialisation, when a mirroring process found no peers in its process group, it deleted every child record for the group from Khepri. Because Khepri is cluster-wide, this removed the records backing children running on all other nodes, not just stale local ones. An empty group is not a reliable signal that this node is alone: a peer's mirroring process may simply not have joined the group yet, or this node's supervisor may be restarting while the rest of the cluster keeps running its children. In particular, when a supervisor restarts after reaching its maximum restart intensity (for example following Khepri timeouts during a partition), it re-initialises with an empty group for a moment and wipes every node's records. The children are then gone from the store, so orphan reconciliation has nothing left to restart and federation links stop running cluster-wide. Only delete the records when this node is the sole reachable cluster member, i.e. a genuine single-node (re)start. Otherwise leave any genuinely orphaned records to the reconciliation loop, which restarts them on the majority side.
When a peer mirroring process goes down, the first member of the group takes
over its children, which requires writing to Khepri. If that write failed or
timed out, the handler stopped the mirroring process with
`{shutdown, [timeout]}`. This is most likely precisely when a peer goes down:
losing a node can trigger a Khepri leader election during which writes time
out. The mirroring process then restarts, fails again, and quickly reaches its
maximum restart intensity, tearing down the whole link supervisor for the
group.
Wrap the failover in a `try` and treat update or child-start failures as
transient: log them and keep running rather than stopping. The periodic
reconciliation loop takes over the orphaned children once Khepri is reachable
again. The successful failover path is unchanged, so children still migrate
immediately in the common case.
Clean up the reconciliation code added for orphan reclamation and minority
handling:
- Drop the `is_list/1` guard and the dead `_ -> ok` clause on
`supervisor2:which_children/1`, whose return is always a list.
- Use `lists:foreach/2` instead of side-effecting list comprehensions whose
list result was discarded, so the functions return `ok` and no longer
trigger `unmatched_returns`.
No behaviour change.
The reconciliation loop reclaimed children owned by dead or unreachable nodes, but never handled a child a node owns in the store yet is not actually running locally. A node ends up in this state after leaving a minority partition: while in the minority it stops its children but, having no quorum, cannot give up their ownership in the store. If no majority node reclaims them before the partition heals, the owner is alive again, so the records are no longer orphans and nothing ever restarts the children. The result is records owned by a live node with no running child anywhere - observed with Shovel after a partition healed: every shovel was owned by the formerly-partitioned node but none were running. Add a reconciliation step, run on every node, that starts any child this node owns in the store but is not currently running locally. This complements the existing orphan reclamation (dead owners) and conflict resolution (children owned by another node) to keep the store and the running children consistent.
The dynamic Shovel supervisor (rabbit_shovel_dyn_worker_sup_sup, a mirrored
supervisor) was a transient child of rabbit_shovel_sup. As with the federation
link supervisors, a transient child is not restarted after a normal or
shutdown exit, so if it stopped that way - for example after reaching its
restart intensity during Khepri timeouts in a partition, or because a peer
asked it to die - it would not come back and the node would run no shovels
until a full restart. Its restart intensity is low ({one_for_one, 3, 10}),
making this more likely than for federation.
Mark it permanent so it is always restarted, mirroring the federation fix.
Failover is event-driven: when a peer goes down, the first surviving member takes over its children immediately. Minority-stop and conflict resolution, however, only ran on the 30s periodic reconciliation. So after a partition healed, a formerly-isolated node kept running (and reporting) its children until its next periodic tick noticed they were now owned by another node and stopped them - a window of up to 30s during which a federation link or shovel ran on two nodes at once. For shovels in particular this means duplicate message movement, not just duplicate reporting. Monitor the process group with pg:monitor/1 and, on a join (partition heal) or leave (partition), schedule a single debounced reconciliation. A short debounce coalesces bursts of membership events, and the membership-triggered reconciliation does not disturb the periodic timer, which remains as a backstop. As a result the minority side sheds its children shortly after losing quorum, and the rejoining side drops any duplicate shortly after healing, shrinking the duplicate window from up to the periodic interval to a couple of seconds. This lives in the shared mirrored_supervisor, so federation and shovel both benefit.
When a peer's mirroring process goes down, the first surviving member takes over its children via update_all/2, which writes to Khepri. On the minority side of a partition there is no quorum, so that write blocks the mirroring gen_server until it times out (~30s). While blocked, the process handles no other message - including the reconciliation that is meant to stop this node's children because it is in the minority. So minority-stop was delayed by the full Khepri timeout, and if the partition healed sooner it never ran at all, leaving the isolated node running duplicates of children the majority had already taken over (observed: a partitioned node kept all its shovels and federation links, and they stayed on it after the partition healed). Check for a minority partition before attempting the takeover and skip it when in the minority: such a node must not take over children (the majority owns them) and cannot (no quorum). This frees the gen_server to process the membership-triggered reconciliation promptly, so the minority side sheds its children within the debounce window instead of after a 30s stall or not at all. The minority test is factored into partition_status/0 and shared with the reconciliation loop. It relies only on Erlang distribution and locally-known membership, both of which were confirmed to stay responsive and accurate on a partitioned node (unlike Khepri writes).
0185e6b to
15274dc
Compare
Fix a 500 error from the Management UI during a network partition:
```
crasher:
initial call: cowboy_stream_h:request_process/3
pid: <0.4677.0>
registered_name: []
exception error: bad generator {badrpc,timeout}
in function rabbit_shovel_mgmt_util:'-status/1-lc$^0/1-0-'/2 (rabbit_shovel_mgmt_util.erl:44)
in call from rabbit_shovel_mgmt_util:'-status/2-lc$^0/1-0-'/1 (rabbit_shovel_mgmt_util.erl:34)
in call from rabbit_shovel_mgmt_util:'-status/2-lc$^0/1-0-'/1 (rabbit_shovel_mgmt_util.erl:34)
in call from rabbit_shovel_mgmt_util:status/2 (rabbit_shovel_mgmt_util.erl:34)
in call from rabbit_shovel_mgmt_shovels:to_json/2 (rabbit_shovel_mgmt_shovels.erl:47)
in call from cowboy_rest:call/3 (src/cowboy_rest.erl:1577)
in call from cowboy_rest:set_resp_body/2 (src/cowboy_rest.erl:1455)
in call from cowboy_rest:upgrade/4 (src/cowboy_rest.erl:281)
```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR resolves a number of issues in shovel and federation in the context of network partition handling.
Issues I could observe, that I can no longer reproduce on this branch include: