Skip to content

Replace rabbit_fifo_dlx_sup with two-level supervisor hierarchy#16672

Open
lukebakken wants to merge 7 commits into
rabbitmq:mainfrom
amazon-mq:fix/gh-16652-dlx-qq-permanently-stranded
Open

Replace rabbit_fifo_dlx_sup with two-level supervisor hierarchy#16672
lukebakken wants to merge 7 commits into
rabbitmq:mainfrom
amazon-mq:fix/gh-16652-dlx-qq-permanently-stranded

Conversation

@lukebakken

Copy link
Copy Markdown
Collaborator

Summary

Fixes #16652.

A burst of DLX worker crashes can trip the flat rabbit_fifo_dlx_sup supervisor's restart intensity (100/1s), rendering it permanently unavailable. Once unavailable on all nodes, every at_least_once DLX quorum queue that enters leader state crashes its Ra server on {noproc, start_child}. Each Ra server then trips ra_server_sup (intensity 2/period 5), permanently stranding the queue with no automatic recovery.

Changes

Replace the flat simple_one_for_one supervisor with a two-level hierarchy:

  • rabbit_fifo_dlx_sup_sup - top-level simple_one_for_one whose children are per-queue worker supervisors (restart => temporary).
  • rabbit_fifo_dlx_worker_sup - per-queue one_for_one with intensity => 0. A worker crash terminates only its own supervisor, isolating failures to a single queue without affecting the top-level supervisor or any other queue.

With restart => temporary children in the sup_sup, crashes never increment the top-level intensity counter (OTP supervisor.erl skips add_restart for temporary children in do_restart). The sup_sup itself can never reach max restart intensity regardless of how many workers crash simultaneously.

Additional fixes included in this PR:

  • ensure_worker_started now checks live supervisor state via find_worker_sup/1 before starting a new worker, preventing duplicate worker_sups from concurrent state_enter and {dlx, setup} aux calls racing.
  • state_enter(eol, ...) now terminates the DLX worker on queue deletion, preventing orphaned worker_sups.

How to reproduce

See the reproduction steps in #16652. The simplest deterministic trigger:

  1. erlang:unregister(rabbit_fifo_dlx_sup) on all nodes (simulates supervisor unavailability)
  2. Transfer leadership of an at_least_once DLX quorum queue
  3. Observe the queue becomes permanently stranded (noproc)

With the fix applied, step 1 is no longer possible - individual worker crashes are isolated to their own rabbit_fifo_dlx_worker_sup and cannot affect the top-level supervisor or other queues.

Test

  • rabbit_fifo_dlx_SUITE - 5/5 pass
  • rabbit_fifo_dlx_integration_SUITE - 19/19 pass (multiple clean runs)

@lukebakken lukebakken added the bug label Jun 12, 2026
@lukebakken lukebakken self-assigned this Jun 12, 2026
@lukebakken lukebakken force-pushed the fix/gh-16652-dlx-qq-permanently-stranded branch from 9594857 to 7e30dcc Compare June 12, 2026 19:41
@lukebakken lukebakken marked this pull request as draft June 13, 2026 00:05
@lukebakken lukebakken force-pushed the fix/gh-16652-dlx-qq-permanently-stranded branch 5 times, most recently from 0b35850 to 601f62e Compare June 15, 2026 22:10

@the-mikedavis the-mikedavis left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few first-pass comments...

Comment thread deps/rabbit/src/rabbit_fifo.erl Outdated
Comment thread deps/rabbit/src/rabbit_fifo.erl Outdated
Comment thread deps/rabbit/src/rabbit_fifo_dlx_worker.erl Outdated
@lukebakken lukebakken force-pushed the fix/gh-16652-dlx-qq-permanently-stranded branch from 601f62e to 7d7c9b1 Compare June 16, 2026 14:25
@lukebakken lukebakken requested a review from the-mikedavis June 16, 2026 15:26
@lukebakken lukebakken force-pushed the fix/gh-16652-dlx-qq-permanently-stranded branch 3 times, most recently from 1386b29 to 39134d4 Compare June 16, 2026 20:02
@lukebakken lukebakken marked this pull request as ready for review June 16, 2026 20:14
@michaelklishin

Copy link
Copy Markdown
Collaborator

@lukebakken this potentially breaks a Tanzu RabbitMQ-specific queue type, so that will take some time to confirm and find a reasonable mitigation.

Comment thread deps/rabbit/src/rabbit_fifo_dlx_worker_sup.erl
@mergify

mergify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@lukebakken lukebakken force-pushed the fix/gh-16652-dlx-qq-permanently-stranded branch from c1ba7c9 to c971b84 Compare June 22, 2026 12:57
The flat simple_one_for_one rabbit_fifo_dlx_sup allowed a burst of
DLX worker crashes to trip the supervisor's restart intensity and
render it permanently unavailable, stranding all at_least_once DLX
quorum queues cluster-wide (GitHub rabbitmq#16652).

Replace it with:

- rabbit_fifo_dlx_sup_sup: simple_one_for_one top-level supervisor
  whose children are per-queue worker supervisors (temporary).

- rabbit_fifo_dlx_worker_sup: one_for_one per-queue supervisor with
  intensity 0. A worker crash terminates only its own supervisor,
  isolating failures to a single queue.

Additional fixes:

- ensure_worker_started now checks live supervisor state via
  find_worker_sup/1 before starting a new worker, preventing
  duplicate worker_sups from concurrent state_enter and {dlx, setup}
  aux calls.

- state_enter(eol, ...) now terminates the DLX worker on queue
  deletion, preventing orphaned worker_sups after queue cleanup.
Prepending is O(1) vs O(n) on the effects list, and effect
processing order does not matter here.
Move the `supervisor:start_child` + `which_children` sequence into a
single helper in `rabbit_fifo_dlx_sup_sup`, eliminating duplication
between `rabbit_fifo.erl` and `rabbit_fifo_dlx.erl`.
…er/1`

Consolidate the worker termination logic in the module that owns
the process dictionary contract (`put_sup_pid`/`?DLX_WORKER_SUP_PID_KEY`).

Also extract `is_local_and_alive` to `rabbit_misc:is_local_process_alive/1`
to eliminate duplication between `rabbit_fifo.erl` and `rabbit_fifo_dlx.erl`.
The `end_per_testcase` and test assertions called `which_children` and
`count_children` on `rabbit_fifo_dlx_sup_sup` unconditionally. In
mixed-cluster tests, old nodes only have `rabbit_fifo_dlx_sup`
registered, causing `noproc` failures.

Extract `dlx_sup_name/0`, `dlx_workers/0`, and `dlx_count_children/0`
helpers that resolve the correct supervisor name via `whereis`.

Also ensure `rabbit_fifo_dlx_worker:terminate_worker/1` return value
is matched against `ok` in `rabbit_fifo_dlx:ensure_worker_terminated`.
The tail call to `maybe_start_dlx_worker/3` was flush at column 0,
making it read like a function head rather than the function body.
When the DLX worker exits gracefully (e.g., `{shutdown, queue_leader_down}`
on leadership change), a `transient` child is not restarted and does not
trip the intensity counter. Without `auto_shutdown`, the worker_sup
survives as an empty orphan under the sup_sup.

Marking the worker as `significant => true` and the supervisor as
`auto_shutdown => any_significant` ensures the worker_sup self-terminates
when its worker exits gracefully. Same pattern as `rabbit_channel_sup`.
@lukebakken lukebakken force-pushed the fix/gh-16652-dlx-qq-permanently-stranded branch from ed8e41d to 9ef3e3e Compare June 23, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

at_least_once DLX quorum queues permanently stranded during membership churn

3 participants