Replies: 5 comments 7 replies
-
|
Since you're using Khepri, you should set the partition handling strategy to https://www.rabbitmq.com/docs/partitions#automatic-handling Give that a try and report back, thanks. |
Beta Was this translation helpful? Give feedback.
-
|
Also note that partition handling strategies are completely gone in |
Beta Was this translation helpful? Give feedback.
-
|
Hi @JoelVcare! I don’t see anything related to partition handling strategies kicking off in the log files you shared. These Mnesia-specific partition handling options should be no-op once Khepri is enabled. What makes you think they are activated? Do you have more logs to share? Also, could you please share the list of enabled feature flags, using About your questions:
Partition handling is activated from "node down" events emitted by the Erlang VM, if Mnesia is used (i.e. Khepri is disabled). Do you observe the problem at the time the cluster is created?
Not that I’m aware of.
Let’s try to understand what’s going on here first.
I don’t think so. Ra/Khepri/RabbitMQ should manage unstable networks (up to a certain point). |
Beta Was this translation helpful? Give feedback.
-
|
Hi @dumbbell! Thanks for the reply. The logs i shared where from some time before the network partition happened, to hours after the partition had happened. I dont have any more logs from that period. Only from earlier but those also dont state anything about partition handeling. Her is have the output from
|
Beta Was this translation helpful? Give feedback.
-
|
We just had another hickup with crash logging: Rabbit01 logging.txt This time we identified which consumer didnt recover and closed those connections to make the client reconnect. We are just trying to find out now why the hickups happen. The reconnection issue will be placed internally and maybe at Masstransit |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Community Support Policy
RabbitMQ version used
4.2.1
Erlang version used
26.0.x
Operating system (distribution) used
Ubuntu 22.04
How is RabbitMQ deployed?
Debian package
rabbitmq-diagnostics status output
See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics
Details
Logs from node 1 (with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Details
Logs from node 2 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Details
Logs from node 3 (if applicable, with sensitive values edited out)
See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Details
rabbitmq.conf
See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location
Details
Steps to deploy RabbitMQ cluster
I think this is not applicable to the issue. If neccesary i will provide
Steps to reproduce the behavior in question
I dont know. thats the issue
advanced.config
See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location
Details
Application code
Details
# PASTE CODE HERE, BETWEEN BACKTICKSKubernetes deployment file
Details
What problem are you trying to solve?
Hi RabbitMQ community,
We're troubleshooting persistent split‑brain and partition events in our production 3‑node RabbitMQ cluster. Even very brief network hiccups, latency spikes, or packet loss can cause nodes to detect peers as down within seconds, triggering partition handling far sooner than we expected. This happens despite using
pause_minorityand Khepri for metadata resilience, leading to frequent manual recovery efforts that disrupt the production cluster. The timing of these split‑brain events is completely random and does not appear to correlate with any specific maintenance window or load pattern.A typical failure pattern we've observed is asymmetric peer visibility:
That creates unclear minority/majority behavior in a 3‑node cluster and can cause partition handling to activate even when the network issue is brief. The documentation states that after a split brain it is necessary to restart the nodes to clear the split‑brain state. However, our nodes do not show a split‑brain state, so we do not restart the nodes after a network partition.
Environment
net_ticktime, or cluster formation timeouts; defaults are in use.Why does detection happen so fast?
Documentation suggests that Erlang inter-node detection based on
net_ticktimeshould normally take around 60 seconds by default, and that partition handling should not react to tiny hiccups this aggressively. Yet we see node-down detection and partition handling activate within seconds. RabbitMQ also notes thatpause_minorityacts when nodes determine they are in a minority after seeing peers go down, and that asymmetric visibility can mean the listed nodes are split across both sides in a way that makes recovery behavior surprising.Relevant network observations
We also see non-zero VMXNET3 dropped RX counters on multiple guests, which suggests packet loss at the virtualization/network layer rather than a RabbitMQ-only issue.
rabbitmq-node-01:ethtoolshowsrabbitmq-node-02:ethtoolshowshaproxy-node-01:ethtoolshowsThis points to possible VMXNET3 buffer exhaustion, guest scheduling latency, or an ESXi-side issue. Broadcom’s guidance for VMXNET3 packet loss on ESXi 8.x also mentions queue/poll limits and increasing the relevant VMXNET3 bounds when packet rate exceeds the default queue capacity.
Questions
pause_minorityreacting to initial peer-discovery or early peer-loss events more quickly than thenet_ticktimetimeout would suggest, especially in asymmetric 3-node splits?Any insights into why detection happens this rapidly, or config tweaks for resilience against short hiccups? References to similar issues or diagnostics commands during active partitions would be gold.
Thanks for your expertise!
Beta Was this translation helpful? Give feedback.
All reactions