[Questions] Why does RabbitMQ 4.2.1 with Khepri detect split-brain so rapidly after brief network hiccups in 3-node pause_minority cluster? #15909

JoelVcare · 2026-04-01T09:55:16Z

JoelVcare
Apr 1, 2026

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.2.1

Erlang version used

26.0.x

Operating system (distribution) used

Ubuntu 22.04

How is RabbitMQ deployed?

Debian package

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

Details

Status of node rabbit@RABBIT03 ...
Runtime

OS PID: 714934
OS: Linux
Uptime (seconds): 6530438
Is under maintenance?: false
RabbitMQ version: 4.2.1
RabbitMQ release series support status: see https://www.rabbitmq.com/release-information
Node name: rabbit@RABBIT03
Erlang configuration: Erlang/OTP 26 [erts-14.2.5.8] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [jit:ns]
Crypto library: OpenSSL 3.0.2 15 Mar 2022
Erlang processes: 1345 used, 1048576 limit
Scheduler run queue: 1
Cluster heartbeat timeout (net_ticktime): 60

Plugins

Enabled plugin file: /etc/rabbitmq/enabled_plugins
Enabled plugins:

 * rabbitmq_shovel_management
 * rabbitmq_shovel
 * amqp10_client
 * gun
 * rabbitmq_management
 * rabbitmq_management_agent
 * rabbitmq_web_dispatch
 * amqp_client
 * cowboy
 * oauth2_client
 * jose

Data directory

Node data directory: /var/lib/rabbitmq/mnesia/rabbit@RABBIT03
Raft data directory: /var/lib/rabbitmq/mnesia/rabbit@RABBIT03/quorum/rabbit@RABBIT03

Config files

 * /etc/rabbitmq/rabbitmq.conf

Log file(s)

 * /var/log/rabbitmq/rabbit@RABBIT03.log
 * <stdout>

Alarms

(none)

Tags

(none)

Memory

Total memory used: 0.3239 gb
Calculation strategy: rss
Memory high watermark setting: 0.6 of available memory, computed to: 2.4609 gb

allocated_unused: 0.0711 gb (21.94 %)
quorum_queue_procs: 0.046 gb (14.19 %)
binary: 0.0413 gb (12.75 %)
reserved_unallocated: 0.0266 gb (8.22 %)
code: 0.0252 gb (7.78 %)
other_proc: 0.0249 gb (7.69 %)
other_system: 0.0214 gb (6.61 %)
quorum_ets: 0.0165 gb (5.1 %)
mgmt_db: 0.014 gb (4.33 %)
metadata_store: 0.0112 gb (3.45 %)
plugins: 0.0089 gb (2.75 %)
connection_channels: 0.005 gb (1.56 %)
other_ets: 0.0049 gb (1.52 %)
connection_other: 0.0017 gb (0.53 %)
metrics: 0.0013 gb (0.39 %)
atom: 0.0011 gb (0.35 %)
connection_writers: 0.0011 gb (0.34 %)
metadata_store_ets: 0.0011 gb (0.34 %)
connection_readers: 0.0004 gb (0.14 %)
msg_index: 0.0001 gb (0.04 %)
quorum_queue_dlx_procs: 0.0 gb (0.0 %)
stream_queue_procs: 0.0 gb (0.0 %)
stream_queue_replica_reader_procs: 0.0 gb (0.0 %)
mnesia: 0.0 gb (0.0 %)
stream_queue_coordinator_procs: 0.0 gb (0.0 %)
queue_procs: 0.0 gb (0.0 %)

File Descriptors

Total: 0, limit: 32671

Free Disk Space

Low free disk space watermark: 0.05 gb
Free disk space: 23.2547 gb

Totals

Connection count: 7
Queue count: 294
Virtual host count: 1

Listeners

Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 15672, protocol: http, purpose: HTTP API

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

Details

Mar 31 01:56:24 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 01:56:24.539865+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.155151642.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.1515573>,unknown}
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.237632+00:00 [warning] <0.155151649.0> Recurring shovel spec clean up failed with exit:{{nodedown,
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.237632+00:00 [warning] <0.155151649.0>                                                   'rabbit@RABBIT02'},
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.237632+00:00 [warning] <0.155151649.0>                                                  {gen_server,call,
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.237632+00:00 [warning] <0.155151649.0>                                                   [<14643.66111706.0>,
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.237632+00:00 [warning] <0.155151649.0>                                                    which_children,infinity]}}
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.249999+00:00 [warning] <0.935.0> queue 'MergeCall_SERVICE08' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.243518+00:00 [warning] <0.155151630.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.243518+00:00 [warning] <0.155151630.0> [{<14643.66111663.0>,{exit,{nodedown,'rabbit@RABBIT02'},[]}}]
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.240116+00:00 [warning] <0.155151687.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.240116+00:00 [warning] <0.155151687.0> [{<14643.66111663.0>,{exit,{nodedown,'rabbit@RABBIT02'},[]}}]
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.243658+00:00 [warning] <0.155151666.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.243658+00:00 [warning] <0.155151666.0> [{<14643.66111663.0>,{exit,{nodedown,'rabbit@RABBIT02'},[]}}]
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.244666+00:00 [warning] <0.855.0> queue 'MaxCallsInQueueReached_SERVICE08' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.245202+00:00 [warning] <0.1242.0> queue 'AddAsteriskEndpoint_SERVICE07' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.247486+00:00 [warning] <0.949.0> queue 'FindTransferCandidate_SERVICE05' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.251269+00:00 [warning] <0.929.0> queue 'RedirectCall_SERVICE02' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.247492+00:00 [warning] <0.1057.0> queue 'StartTranscriptionCall_SERVICE02' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.251986+00:00 [warning] <0.1109.0> queue 'ListenInCall_SERVICE06' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252032+00:00 [warning] <0.1492.0> queue 'CheckQueueCall_SERVICE03' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252037+00:00 [warning] <0.1026.0> queue 'StartTranscriptionCall_SERVICE06' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252043+00:00 [warning] <0.1429.0> queue 'MergeCall_SERVICE05' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252059+00:00 [warning] <0.905.0> queue 'MaxCallsInQueueReached_SERVICE02' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252064+00:00 [warning] <0.835.0> queue 'MaxCallsInQueueReached_SERVICE05' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252069+00:00 [warning] <0.1023.0> queue 'StartTranscriptionCall_SERVICE03' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252074+00:00 [warning] <0.1233.0> queue 'FindTransferCandidate_SERVICE02' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252079+00:00 [warning] <0.1466.0> queue 'RedirectCall_SERVICE01' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.252115+00:00 [warning] <0.1198.0> queue 'RedirectCall_SERVICE06' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.247518+00:00 [warning] <0.1145.0> queue 'AnswerCall_SERVICE06' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.247551+00:00 [warning] <0.1159.0> queue 'StartTranscriptionCall_SERVICE07' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.258204+00:00 [warning] <0.1532.0> queue 'CheckQueueCall_SERVICE08' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.258221+00:00 [warning] <0.1001.0> queue 'MergeCall_SERVICE03' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.258226+00:00 [warning] <0.1101.0> queue 'FindTransferCandidate_SERVICE07' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.258237+00:00 [warning] <0.1127.0> queue 'FindTransferCandidate_SERVICE04' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.311340+00:00 [warning] <0.12706554.4> queue 'SetTransferProcessProxy_SERVICE03' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.312206+00:00 [warning] <0.12707789.4> queue 'SetTransferProcessProxy_SERVICE04' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.313489+00:00 [warning] <0.155157284.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.313489+00:00 [warning] <0.155157284.0> [{<14643.66111663.0>,{exit,{nodedown,'rabbit@RABBIT02'},[]}}]
Mar 31 02:06:17 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:06:17.312618+00:00 [warning] <0.13634259.4> queue 'SetTransferProcessProxy_SERVICE07' in vhost '/': await_condition - Leader node 'rabbit@RABBIT02' might be down. Re-entering follower state.
Mar 31 02:18:24 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:18:24.055272+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.155151642.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.1844406>,unknown}
Mar 31 02:26:54 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 02:26:54.063027+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.155151642.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.1844406>,unknown}
Mar 31 04:34:54 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 04:34:54.121469+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.155151642.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.1844406>,unknown}
Mar 31 05:04:10 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 05:04:10.041532+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.155151619.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.1212567>,unknown}
Mar 31 06:56:24 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 06:56:24.125854+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.155151642.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.1844406>,unknown}
Mar 31 10:02:40 RABBIT01.production.local rabbitmq-server[1210768]: 2026-03-31 10:02:40.039611+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.155151619.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.1212567>,unknown}

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

Details

Mar 31 00:52:40 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 00:52:40.672441+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.66111639.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.123394>,unknown}
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.176072+00:00 [error] <0.225837306.2> ** Node 'rabbit@RABBIT01' not responding **
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.176072+00:00 [error] <0.225837306.2> ** Removing (timedout) connection **
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.176072+00:00 [error] <0.225837306.2> 
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.177804+00:00 [warning] <0.66111750.0> Recurring shovel spec clean up failed with exit:{{nodedown,
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.177804+00:00 [warning] <0.66111750.0>                                                   'rabbit@RABBIT01'},
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.177804+00:00 [warning] <0.66111750.0>                                                  {gen_server,call,
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.177804+00:00 [warning] <0.66111750.0>                                                   [<14643.155151637.0>,
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.177804+00:00 [warning] <0.66111750.0>                                                    which_children,infinity]}}
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.178633+00:00 [warning] <0.66111709.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.178633+00:00 [warning] <0.66111709.0> [{<14643.155151629.0>,{exit,{nodedown,'rabbit@RABBIT01'},[]}}]
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.183834+00:00 [warning] <0.66111700.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.183834+00:00 [warning] <0.66111700.0> [{<14643.155151629.0>,{exit,{nodedown,'rabbit@RABBIT01'},[]}}]
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.205870+00:00 [warning] <0.66111749.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.205870+00:00 [warning] <0.66111749.0> [{<14643.155151629.0>,{exit,{nodedown,'rabbit@RABBIT01'},[]}}]
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.220390+00:00 [warning] <0.66118622.0> Management delegate query returned errors:
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.220390+00:00 [warning] <0.66118622.0> [{<14643.155151629.0>,{exit,{nodedown,'rabbit@RABBIT01'},[]}}]
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.240641+00:00 [warning] <0.1135.0> queue 'PhoneStatusChanged_SERVICE07' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.240980+00:00 [warning] <0.1175.0> queue 'QueueCaller_SERVICE05_skipped' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.242787+00:00 [warning] <0.1794.0> queue 'CheckQueueCallRoundForUserAll' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.242645+00:00 [warning] <0.1275.0> queue 'CheckQueueCall_SERVICE01' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.244213+00:00 [warning] <0.1974.0> queue 'PhoneStatusChanged_SERVICE02' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.254547+00:00 [warning] <0.1754.0> queue 'PhoneStatusChanged_SERVICE05' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.255245+00:00 [warning] <0.1906.0> queue 'PhoneStatusChanged_SERVICE06' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.256703+00:00 [warning] <0.2196.0> queue 'PhoneStatusChanged_SERVICE08' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.242866+00:00 [warning] <0.2249.0> queue 'PhoneStatusChanged_SERVICE04' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.271545+00:00 [warning] <0.1728.0> queue 'SendNotifyAll' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.274492+00:00 [warning] <0.1223.0> queue 'CallCdrAll' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.330362+00:00 [warning] <0.89913944.3> queue 'SetTransferProcessProxy_SERVICE08' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.330575+00:00 [warning] <0.89450376.3> queue 'SetTransferProcessProxy_SERVICE06' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.331026+00:00 [warning] <0.89449666.3> queue 'SetTransferProcessProxy_SERVICE05' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:17 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:17.347349+00:00 [warning] <0.88591645.3> queue 'SetTransferProcessProxy_SERVICE01' in vhost '/': await_condition - Leader node 'rabbit@RABBIT01' might be down. Re-entering follower state.
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3> Error in process <0.153394030.3> on node 'rabbit@RABBIT02' with exit value:
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3> {{erpc,noconnection},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3>  [{erpc,call,5,[{file,"erpc.erl"},{line,710}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3>   {ra_snapshot,context,2,[{file,"src/ra_snapshot.erl"},{line,520}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3>   {ra_server_proc,send_snapshots,8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3>                   [{file,"src/ra_server_proc.erl"},{line,1881}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3>   {ra_server_proc,'-handle_effect/5-fun-1-',8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3>                   [{file,"src/ra_server_proc.erl"},{line,1521}]}]}
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184124+00:00 [error] <0.153394030.3> 
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3> Error in process <0.153394254.3> on node 'rabbit@RABBIT02' with exit value:
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3> {{erpc,noconnection},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3>  [{erpc,call,5,[{file,"erpc.erl"},{line,710}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3>   {ra_snapshot,context,2,[{file,"src/ra_snapshot.erl"},{line,520}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3>   {ra_server_proc,send_snapshots,8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3>                   [{file,"src/ra_server_proc.erl"},{line,1881}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3>   {ra_server_proc,'-handle_effect/5-fun-1-',8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3>                   [{file,"src/ra_server_proc.erl"},{line,1521}]}]}
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184151+00:00 [error] <0.153394254.3> 
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3> Error in process <0.153394246.3> on node 'rabbit@RABBIT02' with exit value:
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3> {{erpc,noconnection},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3>  [{erpc,call,5,[{file,"erpc.erl"},{line,710}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3>   {ra_snapshot,context,2,[{file,"src/ra_snapshot.erl"},{line,520}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3>   {ra_server_proc,send_snapshots,8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3>                   [{file,"src/ra_server_proc.erl"},{line,1881}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3>   {ra_server_proc,'-handle_effect/5-fun-1-',8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3>                   [{file,"src/ra_server_proc.erl"},{line,1521}]}]}
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184160+00:00 [error] <0.153394246.3> 
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3> Error in process <0.153394199.3> on node 'rabbit@RABBIT02' with exit value:
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3> {{erpc,noconnection},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3>  [{erpc,call,5,[{file,"erpc.erl"},{line,710}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3>   {ra_snapshot,context,2,[{file,"src/ra_snapshot.erl"},{line,520}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3>   {ra_server_proc,send_snapshots,8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3>                   [{file,"src/ra_server_proc.erl"},{line,1881}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3>   {ra_server_proc,'-handle_effect/5-fun-1-',8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3>                   [{file,"src/ra_server_proc.erl"},{line,1521}]}]}
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184172+00:00 [error] <0.153394199.3> 
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3> Error in process <0.153394393.3> on node 'rabbit@RABBIT02' with exit value:
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3> {{erpc,noconnection},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3>  [{erpc,call,5,[{file,"erpc.erl"},{line,710}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3>   {ra_snapshot,context,2,[{file,"src/ra_snapshot.erl"},{line,520}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3>   {ra_server_proc,send_snapshots,8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3>                   [{file,"src/ra_server_proc.erl"},{line,1881}]},
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3>   {ra_server_proc,'-handle_effect/5-fun-1-',8,
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3>                   [{file,"src/ra_server_proc.erl"},{line,1521}]}]}
Mar 31 02:06:22 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:06:22.184177+00:00 [error] <0.153394393.3> 
Mar 31 02:25:37 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 02:25:37.884908+00:00 [warning] <0.153575908.3> HTTP access denied: user 'administrator' - invalid credentials
Mar 31 06:07:40 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 06:07:40.175541+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.66111639.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.123394>,unknown}
Mar 31 10:58:10 RABBIT02.production.local rabbitmq-server[1305021]: 2026-03-31 10:58:10.462673+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.66111639.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.123394>,unknown}

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

Details

Mar 31 00:41:36 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 00:41:36.595437+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538231.0> [{name,delegate_management_3},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.880929>,unknown}
Mar 31 02:36:06 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 02:36:06.374644+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538231.0> [{name,delegate_management_3},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.880929>,unknown}
Mar 31 05:20:24 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 05:20:24.419282+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538215.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.17>,unknown}
Mar 31 06:10:36 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 06:10:36.072142+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538231.0> [{name,delegate_management_3},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.880929>,unknown}
Mar 31 07:00:36 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 07:00:36.457951+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538231.0> [{name,delegate_management_3},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.880929>,unknown}
Mar 31 09:20:03 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 09:20:03.410369+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538228.0> [{name,delegate_management_4},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.880929>,unknown}
Mar 31 09:52:54 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 09:52:54.487208+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538215.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.17>,unknown}
Mar 31 10:09:24 RABBIT03.production.local rabbitmq-server[714934]: 2026-03-31 10:09:24.102778+00:00 [warning] <0.224.0> rabbit_sysmon_handler busy_dist_port <0.42538215.0> [{name,delegate_management_2},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] {#Port<0.17>,unknown}

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

Details

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config

cluster_formation.classic_config.nodes.1 = rabbit@RABBIT01
cluster_formation.classic_config.nodes.2 = rabbit@RABBIT02
cluster_formation.classic_config.nodes.3 = rabbit@RABBIT03

quorum_queue.initial_cluster_size = 3
quorum_queue.continuous_membership_reconciliation.enabled = true

cluster_partition_handling = pause_minority
cluster_name = production-cluster

log.console.level = warning
log.file.level = warning

Steps to deploy RabbitMQ cluster

I think this is not applicable to the issue. If neccesary i will provide

Steps to reproduce the behavior in question

I dont know. thats the issue

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

Details

Not used

Application code

Details

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

Details

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

Hi RabbitMQ community,

We're troubleshooting persistent split‑brain and partition events in our production 3‑node RabbitMQ cluster. Even very brief network hiccups, latency spikes, or packet loss can cause nodes to detect peers as down within seconds, triggering partition handling far sooner than we expected. This happens despite using pause_minority and Khepri for metadata resilience, leading to frequent manual recovery efforts that disrupt the production cluster. The timing of these split‑brain events is completely random and does not appear to correlate with any specific maintenance window or load pattern.

A typical failure pattern we've observed is asymmetric peer visibility:

Node 1 sees both Node 2 and Node 3.
Node 2 doesnt see Node 1, but sees Node 3.
Node 3 sees both Node 1 and node 2

That creates unclear minority/majority behavior in a 3‑node cluster and can cause partition handling to activate even when the network issue is brief. The documentation states that after a split brain it is necessary to restart the nodes to clear the split‑brain state. However, our nodes do not show a split‑brain state, so we do not restart the nodes after a network partition.

Environment

RabbitMQ: 4.2.1.
Erlang/OTP: 26 [erts-14.2.5.3].
Metadata store: Khepri.
Cluster size: 3 nodes.
OS: Ubuntu 22.04 LTS on VMware ESXi 8.0.3.
Virtualization: VMware VMXNET3.
NIC buffer settings observed on the guests: RX 1024, TX 512.
We are not running custom heartbeat, net_ticktime, or cluster formation timeouts; defaults are in use.

Why does detection happen so fast?

Documentation suggests that Erlang inter-node detection based on net_ticktime should normally take around 60 seconds by default, and that partition handling should not react to tiny hiccups this aggressively. Yet we see node-down detection and partition handling activate within seconds. RabbitMQ also notes that pause_minority acts when nodes determine they are in a minority after seeing peers go down, and that asymmetric visibility can mean the listed nodes are split across both sides in a way that makes recovery behavior surprising.

Relevant network observations

We also see non-zero VMXNET3 dropped RX counters on multiple guests, which suggests packet loss at the virtualization/network layer rather than a RabbitMQ-only issue.

rabbitmq-node-01: ethtool shows

root@RABBIT01:/# ethtool -S eth01 | grep drop
       drv dropped tx total: 0
       drv dropped tx total: 0
       drv dropped rx total: 1647
       drv dropped rx total: 14010

rabbitmq-node-02: ethtool shows

root@RABBIT02:/# ethtool -S eth01 | grep drop
       drv dropped tx total: 0
       drv dropped tx total: 0
       drv dropped rx total: 13989
       drv dropped rx total: 3159

haproxy-node-01: ethtool shows

root@HAPROXY01:/# ethtool -S eth01 | grep drop
       drv dropped tx total: 0
       drv dropped rx total: 29880

This points to possible VMXNET3 buffer exhaustion, guest scheduling latency, or an ESXi-side issue. Broadcom’s guidance for VMXNET3 packet loss on ESXi 8.x also mentions queue/poll limits and increasing the relevant VMXNET3 bounds when packet rate exceeds the default queue capacity.

Questions

Is pause_minority reacting to initial peer-discovery or early peer-loss events more quickly than the net_ticktime timeout would suggest, especially in asymmetric 3-node splits?
Are there known race conditions or recovery issues between Khepri / Ra and network partition handling in RabbitMQ 4.2.1?
How should we tune our config?
Could the underlying VMware / VMXNET3 layer, Ubuntu 22.04 kernel behavior, or RX/TX buffering be amplifying short hiccups into false node-down detections?

Any insights into why detection happens this rapidly, or config tweaks for resilience against short hiccups? References to similar issues or diagnostics commands during active partitions would be gold.

Thanks for your expertise!

lukebakken · 2026-04-01T15:11:27Z

lukebakken
Apr 1, 2026
Maintainer

Since you're using Khepri, you should set the partition handling strategy to autoheal or ignore:

https://www.rabbitmq.com/docs/partitions#automatic-handling

Give that a try and report back, thanks.

1 reply

kjnilsson Apr 1, 2026
Maintainer

It is our expectations that all partition handling is disabled and defunct when khepri is enabled. @dumbbell is looking into this to make sure this is definitely the case.

michaelklishin · 2026-04-01T21:25:00Z

michaelklishin
Apr 1, 2026
Maintainer

Also note that partition handling strategies are completely gone in main and v4.3.x.

0 replies

dumbbell · 2026-04-02T08:15:51Z

dumbbell
Apr 2, 2026
Maintainer

Hi @JoelVcare!

I don’t see anything related to partition handling strategies kicking off in the log files you shared. These Mnesia-specific partition handling options should be no-op once Khepri is enabled. What makes you think they are activated? Do you have more logs to share?

Also, could you please share the list of enabled feature flags, using rabbitmqctl list_feature_flags?

About your questions:

Is pause_minority reacting to initial peer-discovery or early peer-loss events more quickly than the net_ticktime timeout would suggest, especially in asymmetric 3-node splits?

Partition handling is activated from "node down" events emitted by the Erlang VM, if Mnesia is used (i.e. Khepri is disabled).

Do you observe the problem at the time the cluster is created?

Are there known race conditions or recovery issues between Khepri / Ra and network partition handling in RabbitMQ 4.2.1?

Not that I’m aware of.

How should we tune our config?

Let’s try to understand what’s going on here first.

Could the underlying VMware / VMXNET3 layer, Ubuntu 22.04 kernel behavior, or RX/TX buffering be amplifying short hiccups into false node-down detections?

I don’t think so. Ra/Khepri/RabbitMQ should manage unstable networks (up to a certain point).

0 replies

JoelVcare · 2026-04-02T13:17:56Z

JoelVcare
Apr 2, 2026
Author

Hi @dumbbell!

Thanks for the reply.

The logs i shared where from some time before the network partition happened, to hours after the partition had happened. I dont have any more logs from that period. Only from earlier but those also dont state anything about partition handeling.

Her is have the output from rabbitmqctl list_feature_flags

Listing feature flags ...
name    state
classic_mirrored_queue_version  enabled
classic_queue_type_delivery_support     enabled
detailed_queues_endpoint        enabled
direct_exchange_routing_v2      enabled
drop_unroutable_metric  enabled
empty_basic_get_metric  enabled
feature_flags_v2        enabled
implicit_default_bindings       enabled
khepri_db       enabled
listener_records_in_ets enabled
maintenance_mode_status enabled
message_containers      enabled
message_containers_deaths_v2    enabled
quorum_queue    enabled
quorum_queue_non_voters enabled
rabbit_exchange_type_local_random       enabled
rabbitmq_4.0.0  enabled
rabbitmq_4.1.0  enabled
rabbitmq_4.2.0  enabled
restart_streams enabled
stream_filtering        enabled
stream_queue    enabled
stream_sac_coordinator_unblock_group    enabled
stream_single_active_consumer   enabled
stream_update_config_command    enabled
tracking_records_in_ets enabled
user_limits     enabled
virtual_host_metadata   enabled

I have khepri enabled and did not know that the partition handeling isn't supported with kehpri, Now my question is more how does khepri handle it?
No i dont observer the problem when the cluster is created. But after creating a new cluster and simulating an asymetric network hickup the problem also occurs.

6 replies

JoelVcare Apr 2, 2026
Author

The issue we have is that when a partition happends and queues are moved around. there wil be some queues where no consumer are left so messages dont get consumed. the way for us to fix his at the moment is to restart all services connected to rabbit so they reinitialize the publishers and consumers.

lukebakken Apr 2, 2026
Maintainer

there wil be some queues where no consumer are left so messages dont get consumed

This is due to your applications not handling the disconnect correctly from whatever is causing the partitions. It's not on RabbitMQ's side.

Your applications are not re-connecting and re-consuming.

michaelklishin Apr 2, 2026
Maintainer

Perhaps we should mention the time proven recovery algorithm (which many clients provide as a library feature, although some expect app developers to reimplement it).

dumbbell Apr 3, 2026
Maintainer

We had an issue where clients were not disconnected or notified properly after an exclusive/auto-delete queue was deleted, when Khepri was used. The ones we know about were fixed, but it’s possible there are other code paths affected by some change of behaviour brought by the integration of Khepri.

Could you please tell us more about your use case? Like the topology, queues/exchanges/bindings types and parameters, etc.

JoelVcare Apr 3, 2026
Author

For our topology:

DC 1, Machine 1: Rabbit01
DC 1, Machine 2: Rabbit03
DC 2, Machine 1: Rabbit02

We use rabbitmq mainly as a message bus between our API's and core services.

Cluster connected over Haproxy. Our clients use The masstransit .NET package. We have about 10 .NET clients connected using quorum queues. in the exchange we only use fanout from the API's to the core services, all the core services have an consumer on the queus that are used here. In bindings we only use Default exchange bindings.

JoelVcare · 2026-04-03T12:29:38Z

JoelVcare
Apr 3, 2026
Author

We just had another hickup with crash logging:

Rabbit01 logging.txt
Rabbit02 logging.txt
Rabbit03 logging.txt

This time we identified which consumer didnt recover and closed those connections to make the client reconnect. We are just trying to find out now why the hickups happen. The reconnection issue will be placed internally and maybe at Masstransit

0 replies

[Questions] Why does RabbitMQ 4.2.1 with Khepri detect split-brain so rapidly after brief network hiccups in 3-node pause_minority cluster? #15909

Uh oh!

JoelVcare Apr 1, 2026

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Environment

Why does detection happen so fast?

Relevant network observations

Questions

Replies: 5 comments · 7 replies

Uh oh!

lukebakken Apr 1, 2026 Maintainer

Uh oh!

kjnilsson Apr 1, 2026 Maintainer

Uh oh!

michaelklishin Apr 1, 2026 Maintainer

Uh oh!

dumbbell Apr 2, 2026 Maintainer

Uh oh!

JoelVcare Apr 2, 2026 Author

Uh oh!

JoelVcare Apr 2, 2026 Author

Uh oh!

lukebakken Apr 2, 2026 Maintainer

Uh oh!

michaelklishin Apr 2, 2026 Maintainer

Uh oh!

dumbbell Apr 3, 2026 Maintainer

Uh oh!

Uh oh!

JoelVcare Apr 3, 2026 Author

Uh oh!

JoelVcare Apr 3, 2026 Author

JoelVcare
Apr 1, 2026

Replies: 5 comments 7 replies

lukebakken
Apr 1, 2026
Maintainer

kjnilsson Apr 1, 2026
Maintainer

michaelklishin
Apr 1, 2026
Maintainer

dumbbell
Apr 2, 2026
Maintainer

JoelVcare
Apr 2, 2026
Author

JoelVcare Apr 2, 2026
Author

lukebakken Apr 2, 2026
Maintainer

michaelklishin Apr 2, 2026
Maintainer

dumbbell Apr 3, 2026
Maintainer

JoelVcare Apr 3, 2026
Author

JoelVcare
Apr 3, 2026
Author