-
Notifications
You must be signed in to change notification settings - Fork 52
Plugin re-activation can fail #57
Description
When enabling the plugin on the first node in a cluster the following error is generated.
[root@db3 ~]# rabbitmq-plugins enable autocluster
The following plugins have been enabled:
rabbitmq_aws
autocluster
Applying plugin configuration to [email protected]... failed.
Error: {{badmatch,false},
[{autocluster_periodic,start_delayed,3,
[{file,"src/autocluster_periodic.erl"},
{line,47}]},
{autocluster_consul,register,0,
[{file,"src/autocluster_consul.erl"},{line,135}]},
{autocluster,register_with_backend,1,
[{file,"src/autocluster.erl"},{line,307}]},
{autocluster,run_steps,1,[{file,"src/autocluster.erl"},{line,131}]},
{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
[{file,"src/rabbit_boot_steps.erl"},{line,49}]},
{rabbit_boot_steps,run_step,2,
[{file,"src/rabbit_boot_steps.erl"},{line,49}]},
{rabbit_boot_steps,'-run_boot_steps/1-lc$^0/1-0-',1,
[{file,"src/rabbit_boot_steps.erl"},{line,26}]},
{rabbit_boot_steps,run_boot_steps,1,
[{file,"src/rabbit_boot_steps.erl"},{line,26}]}]}
Investigating this seems to point to a race condition in the logic, The plugin runs through the steps initially and acquires a lock and possibly inserts into ets, then on determining that it is the only node goes through the steps again while still holding the initial lock.
Retaining the initial lock itself could be a problem, after the initial lock is released the process then proceeds to register with consul, after registration the issue then arises when setting up the delayed task.
It seems the initial run may have already created this, thus a false is returned from ets:insert_new whose documentation[1] says a false is returned if keys are already exist.
In this state, if you then disable the plugin, it keeps running and trying to recreate the node in consul. I suspect the delayed task is not removed.
A spiral loop is then entered because consul returns 500 result code when the a request is made to check the state of a service that does not exist.
https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L194
https://github.com/rabbitmq/rabbitmq-autocluster/blob/stable/src/autocluster_consul.erl#L212
The only way to get out of this is to restart the rabbitmq-server process.
I do not have much time at the moment to dig through this so i have created a workaround[2] to stop the enable error until i have time later. If someone else is able to look it to this it would be great.
[1] http://erlang.org/doc/man/ets.html#insert_new-2
[2] akissa@39e0cb4