Replace rabbit_fifo_dlx_sup with two-level supervisor hierarchy#16672
Open
lukebakken wants to merge 7 commits into
Open
Replace rabbit_fifo_dlx_sup with two-level supervisor hierarchy#16672lukebakken wants to merge 7 commits into
lukebakken wants to merge 7 commits into
Conversation
9594857 to
7e30dcc
Compare
0b35850 to
601f62e
Compare
the-mikedavis
left a comment
Collaborator
There was a problem hiding this comment.
A few first-pass comments...
601f62e to
7d7c9b1
Compare
1386b29 to
39134d4
Compare
Collaborator
|
@lukebakken this potentially breaks a Tanzu RabbitMQ-specific queue type, so that will take some time to confirm and find a reasonable mitigation. |
|
Tick the box to add this pull request to the merge queue (same as
|
c1ba7c9 to
c971b84
Compare
The flat simple_one_for_one rabbit_fifo_dlx_sup allowed a burst of DLX worker crashes to trip the supervisor's restart intensity and render it permanently unavailable, stranding all at_least_once DLX quorum queues cluster-wide (GitHub rabbitmq#16652). Replace it with: - rabbit_fifo_dlx_sup_sup: simple_one_for_one top-level supervisor whose children are per-queue worker supervisors (temporary). - rabbit_fifo_dlx_worker_sup: one_for_one per-queue supervisor with intensity 0. A worker crash terminates only its own supervisor, isolating failures to a single queue. Additional fixes: - ensure_worker_started now checks live supervisor state via find_worker_sup/1 before starting a new worker, preventing duplicate worker_sups from concurrent state_enter and {dlx, setup} aux calls. - state_enter(eol, ...) now terminates the DLX worker on queue deletion, preventing orphaned worker_sups after queue cleanup.
Prepending is O(1) vs O(n) on the effects list, and effect processing order does not matter here.
Move the `supervisor:start_child` + `which_children` sequence into a single helper in `rabbit_fifo_dlx_sup_sup`, eliminating duplication between `rabbit_fifo.erl` and `rabbit_fifo_dlx.erl`.
…er/1` Consolidate the worker termination logic in the module that owns the process dictionary contract (`put_sup_pid`/`?DLX_WORKER_SUP_PID_KEY`). Also extract `is_local_and_alive` to `rabbit_misc:is_local_process_alive/1` to eliminate duplication between `rabbit_fifo.erl` and `rabbit_fifo_dlx.erl`.
The `end_per_testcase` and test assertions called `which_children` and `count_children` on `rabbit_fifo_dlx_sup_sup` unconditionally. In mixed-cluster tests, old nodes only have `rabbit_fifo_dlx_sup` registered, causing `noproc` failures. Extract `dlx_sup_name/0`, `dlx_workers/0`, and `dlx_count_children/0` helpers that resolve the correct supervisor name via `whereis`. Also ensure `rabbit_fifo_dlx_worker:terminate_worker/1` return value is matched against `ok` in `rabbit_fifo_dlx:ensure_worker_terminated`.
The tail call to `maybe_start_dlx_worker/3` was flush at column 0, making it read like a function head rather than the function body.
When the DLX worker exits gracefully (e.g., `{shutdown, queue_leader_down}`
on leadership change), a `transient` child is not restarted and does not
trip the intensity counter. Without `auto_shutdown`, the worker_sup
survives as an empty orphan under the sup_sup.
Marking the worker as `significant => true` and the supervisor as
`auto_shutdown => any_significant` ensures the worker_sup self-terminates
when its worker exits gracefully. Same pattern as `rabbit_channel_sup`.
ed8e41d to
9ef3e3e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #16652.
A burst of DLX worker crashes can trip the flat
rabbit_fifo_dlx_supsupervisor's restart intensity (100/1s), rendering it permanently unavailable. Once unavailable on all nodes, everyat_least_onceDLX quorum queue that enters leader state crashes its Ra server on{noproc, start_child}. Each Ra server then tripsra_server_sup(intensity 2/period 5), permanently stranding the queue with no automatic recovery.Changes
Replace the flat
simple_one_for_onesupervisor with a two-level hierarchy:rabbit_fifo_dlx_sup_sup- top-levelsimple_one_for_onewhose children are per-queue worker supervisors (restart => temporary).rabbit_fifo_dlx_worker_sup- per-queueone_for_onewithintensity => 0. A worker crash terminates only its own supervisor, isolating failures to a single queue without affecting the top-level supervisor or any other queue.With
restart => temporarychildren in the sup_sup, crashes never increment the top-level intensity counter (OTPsupervisor.erlskipsadd_restartfor temporary children indo_restart). The sup_sup itself can never reach max restart intensity regardless of how many workers crash simultaneously.Additional fixes included in this PR:
ensure_worker_startednow checks live supervisor state viafind_worker_sup/1before starting a new worker, preventing duplicate worker_sups from concurrentstate_enterand{dlx, setup}aux calls racing.state_enter(eol, ...)now terminates the DLX worker on queue deletion, preventing orphaned worker_sups.How to reproduce
See the reproduction steps in #16652. The simplest deterministic trigger:
erlang:unregister(rabbit_fifo_dlx_sup)on all nodes (simulates supervisor unavailability)at_least_onceDLX quorum queuenoproc)With the fix applied, step 1 is no longer possible - individual worker crashes are isolated to their own
rabbit_fifo_dlx_worker_supand cannot affect the top-level supervisor or other queues.Test
rabbit_fifo_dlx_SUITE- 5/5 passrabbit_fifo_dlx_integration_SUITE- 19/19 pass (multiple clean runs)