The problem
Ok so we modified one of our applications to replace ActiveMq with the much more reliable RabbitMq. During this process I was configuring clustering, adding new users, and virtual hosts, all with scripts that made use of rabbitmqctl. Therefore, rabbitmqctl was working correctly. I would note that this was on Windows 2008 and this problem will not likely affect Linux.
So I RDP’ed into the Windows box the next day and was going to configure some scripts to pull queue size stats out of RabbitMq for our Kibana reporting environment. This method of getting the queue size uses RabbitMqctl from another batch file. On the first run I noticed that the script wasn’t working. Digging further I noticed that if I issued the command:
rabbitmqctl status
Then I received the response
Status of node ‘rabbit@SVR1’ …
Error: unable to connect to node ‘rabbit@SVR1’: nodedown
DIAGNOSTICS
===========
attempted to contact: [‘rabbit@SVR1’]
rabbit@SVR1:
* connected to epmd (port 4369) on SVR1
* epmd reports: node ‘rabbit’ not running at all
no other nodes on SVR1
* suggestion: start the node
current node details:
– node name: ‘rabbitmqctl23333@SVR1’
– home dir: C:\Users\me
– cookie hash: <deleted>
Very strange as this same command had been working the previous day. Now I know that the command was working from my account yesterday, I also know that I have the correct Erlang cookie in my home folder. So this should really be working.
After getting some help from the super helpful Simon MacNullen on the RabbitMq mailing list I identified that the following is the problem.
I had installed RabbitMq, then with the Rabbit service not running I issued the command “rabbitmqctl status”. This caused Erlang to start the epmd.exe process but note that this was running under my user account. I then started the RabbitMq service, which registered itself with the running epmd.
See here for more information on the epmd daemon: http://www.erlang.org/doc/man/epmd.html
This daemon is used when interacting with Rabbit using Rabbitmqctl, but it is also used for configuring clustering.
So at first rabbitmqctl and epmd worked fine, but because the process was running under my user account, when I logged off the epmd process was killed. When I logged back in therefore, RabbitMqctl no longer worked.
I would also like to notify people that my tests show that if Rabbit is not registered with epmd correctly, then clustering will not work.
For instance, if I create the same situation on the master for the cluster, then restart one of the slave servers, the slave will not be able to connect to the cluster and will fail to start.
Simon indicates that rabbitmqctl, clustering and rabbitmq-plugins in 3.4.0+ will not work if Rabbit is not registered with epmd.
So this is quite a serious problem as it breaks clustering meaning that on production if we had not noticed this and the slave service was bounced, then it would not have come back up. Now we have monitoring for the service not starting but it still wouldn’t be much fun.
The exact steps to reproduce the problem are:
- With no epmd running.
- Stop Rabbit service.
- Run rabbitmqctl status, this starts epmd as your local user account.
- Start Rabbit service
- Run rabbitmqctl status, notice that it works.
- Log off
- Log back in
- epmd has exited due to being killed during logoff
- Run rabbitmqctl status, notice that it no longer works.
The fix
If you have a downtime window. Kill any running epmd executables and restart the RabbitMq service. If like us you don’t have that luxury then follow the steps below.
Thanks again to Simon for suggesting the fix and Sysinternals for their tools.
- Download the Sysinternals pstools from here: http://technet.microsoft.com/en-gb/sysinternals/bb896649.aspx
- Extract psexec and copy it to the destination server.
- Kill any running epmd.exe
- Start an Administrator command line
- Run the command psexec -s “c:\Program Files (x86)\RabbitMQ Server\rabbitmq_server-3.3.5\sbin\rabbitmqctl.bat” status
- You might obviously need to adjust the path to you installed Rabbit.
- Check using ProcessExplorer that epmd.exe is running as the user: NT Authority\System
- The first time you do this you might need to click through a license agreement for the Sysinternals tool. This might cause it to fail but repeat steps 3-6 again.
- Start a second command line window
- In the second window run the Erlang shell: “c:\Program Files\erl6.2\bin\erl.exe”
- Again use the path to your installed verison.
- In the Erlang shell enter:
- erl_epmd:start().
- This should return {ok, SomeProcessIdentifier}
- erl_epmd:register_node(rabbit, 25672).
- Replace rabbit in the above statement with your hostname and the second parameter with the clustering port.
- This should return {ok, SomeNumber} if successful, and {error, SomeError} otherwise.
- erl_epmd:start().
- Hopefully in your case the above worked.
- Now enter into the Erlang shell: without hitting return to execute
- halt().
- We now have registered Rabbit with epmd, but when our Erlang exits it will unregister. Therefore, we need to do a little bit more work.
- With the other prompt still open and ready for return, start the RabbitMq command prompt.
- The following will spawn an Erlang process that waits 10s before registering RabbitMq with epmd.
- Run the following at the command line (again take into account the hostname and port for the erl_epmd:register function call:
- rabbitmqctl eval “spawn(fun()->timer:sleep(10000), erl_epmd:register_node(rabbit,25672)end).”
- Quickly hit enter in the Erlang shell (you have 10s due to the 10000 ms parameter).
- Now after 10s you should be able to log out, log back in and get rabbitmqctl to work.
Preventing it happening:
This is a bug with RabbitMq on Windows and this will be fixed in a future release, the bug number for the release notes is: 26426.
Simon suggested the fix will be to stop rabbitmqctl or rabbitmq-plugins from starting epmd if it is not already running