Skip to content

rabbit_mnesia: add retries#15502

Draft
mkuratczyk wants to merge 1 commit intomainfrom
mnesia-retry
Draft

rabbit_mnesia: add retries#15502
mkuratczyk wants to merge 1 commit intomainfrom
mnesia-retry

Conversation

@mkuratczyk
Copy link
Contributor

Occasionally, clustering will fail with the log
as pasted before. I believe it's because of the parallel node startup, sometimes leading to crashes.

Hopefully, with retries, we'll handle this more gracefully.

Feature flags: nodes `rmq-ct-cluster_size_3_2-2-21072@localhost` and `rmq-ct-cluster_size_3_2-1-21000@localhost` are compatible

Mnesia('rmq-ct-cluster_size_3_2-2-21072@localhost'): ** ERROR ** (ignoring core) ** FATAL ** mnesia_monitor crashed:
{{badmatch, <0.203.0>, Ref<0.1988436133.884998146.137464>}},
{mnesia_monitor, handle_info, 2, [{file, "mnesia_monitor.erl"}, {line, 583}]},
gen_server, try_handle_info, 3, [{file, "gen_server.erl"}, {line, 2434}]},
gen_server, handle_msg, 3, [{file, "gen_server.erl"}, {line, 2420}]},
proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 333}]}]}

Error in process <0.300.0> on node 'rmq-ct-cluster_size_3_2-2-21072@localhost' with exit value:
{badarg,[{erlang,send,
                [mnesia_locker,{release_tid,{tid,142,<24815.431.0>}}],
                [{error_info,#{module => erl_erts_errors}}]},
        {mnesia_locker,release_tid,1,[{file,"mnesia_locker.erl"},{line,128}]},
        {mnesia_tm,commit_participant,7,
                   [{file,"mnesia_tm.erl"},{line,1828}]}]}

Application mnesia exited with reason: stopped

BOOT FAILED
===========
Exception during startup:

Exit:{killed,{gen_server,call,[<0.280.0>,{negotiate_protocol,['rmq-ct-cluster_size_3_2-1-21000@localhost']},infinity]}}

   gen_server:call/3, line 1301
   mnesia_monitor:call/1, line 232
   rabbit_mnesia:-check_mnesia_consistency/2-fun-0-/2, line 1002
   rabbit_mnesia:with_running_or_clean_mnesia/1, line 1036
   rabbit_mnesia:check_cluster_consistency/2, line 719
   lists:foldl/3, line 2466
   rabbit_mnesia:check_cluster_consistency/0, line 680
   rabbit_prelaunch_cluster:setup/1, line 27

@mkuratczyk mkuratczyk marked this pull request as draft February 18, 2026 16:26
@mkuratczyk
Copy link
Contributor Author

"Mnesia protocol negotiation with node ~tp "
"failed: ~tp. No retries left.",
[Node, Reason]),
[]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an empty list instead of raising an exception. Is this change of behaviour ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is. An empty list is handled as an error in check_mnesia_consistency

Occasionally, clustering will fail with the log
as pasted before. I believe it's because of the parallel
node startup, sometimes leading to crashes.

Hopefully, with retries, we'll handle this more gracefully.

```
Feature flags: nodes `rmq-ct-cluster_size_3_2-2-21072@localhost` and `rmq-ct-cluster_size_3_2-1-21000@localhost` are compatible

Mnesia('rmq-ct-cluster_size_3_2-2-21072@localhost'): ** ERROR ** (ignoring core) ** FATAL ** mnesia_monitor crashed:
{{badmatch, <0.203.0>, Ref<0.1988436133.884998146.137464>}},
{mnesia_monitor, handle_info, 2, [{file, "mnesia_monitor.erl"}, {line, 583}]},
gen_server, try_handle_info, 3, [{file, "gen_server.erl"}, {line, 2434}]},
gen_server, handle_msg, 3, [{file, "gen_server.erl"}, {line, 2420}]},
proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 333}]}]}

Error in process <0.300.0> on node 'rmq-ct-cluster_size_3_2-2-21072@localhost' with exit value:
{badarg,[{erlang,send,
                [mnesia_locker,{release_tid,{tid,142,<24815.431.0>}}],
                [{error_info,#{module => erl_erts_errors}}]},
        {mnesia_locker,release_tid,1,[{file,"mnesia_locker.erl"},{line,128}]},
        {mnesia_tm,commit_participant,7,
                   [{file,"mnesia_tm.erl"},{line,1828}]}]}

Application mnesia exited with reason: stopped

BOOT FAILED
===========
Exception during startup:

Exit:{killed,{gen_server,call,[<0.280.0>,{negotiate_protocol,['rmq-ct-cluster_size_3_2-1-21000@localhost']},infinity]}}

   gen_server:call/3, line 1301
   mnesia_monitor:call/1, line 232
   rabbit_mnesia:-check_mnesia_consistency/2-fun-0-/2, line 1002
   rabbit_mnesia:with_running_or_clean_mnesia/1, line 1036
   rabbit_mnesia:check_cluster_consistency/2, line 719
   lists:foldl/3, line 2466
   rabbit_mnesia:check_cluster_consistency/0, line 680
   rabbit_prelaunch_cluster:setup/1, line 27
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments