Hard Lesson: RabbitMQ Clustering
A while back I provisioned 2 Debian Linux VM’s. The purpose was to spike RabbitMQ clustering and HA options. After setting up some HA policy’s and doing random node failure testing the cluster was stable enough to use in Test.
An issue arose when an application could no longer save data. An error was logged BUT the page would just hang forever – no yellow screen of death.
After looking at the logs and isolating the code path the error was apparent. If the code failed to insert a record – a message was published to the RabbitMQ broker. But the code from the publish operation would never return.
OK time to have a look at the rabbit cluster.
It looks like one of the nodes has run out of memory.
A quick look at the Queues is misleading. Below it looks like everything is OK we have +1 nodes and everything is running. So why the failure?
A look at the connection screen shows us the truth. Blocking or Blocked connections. Why?
I would later find out after re-reading the doco…
By default, when the RabbitMQ server uses above 40% of the installed RAM, it raises a memory alarm and blocks all connections that are publishing messages.
Time to fix the bad node. A quick recycle later and…
NO good – Now the PRIMARY node has gone bad?
Was this the case all along? Had the stats been messed up? Not sure – time to recycle the PRIMARY.
All good.
Lessons
The protocol you use for distributed operations always need extra special attention
I read the docs. I ran the spike. I had a decent understanding of the complexity. I had simulated random node failure under high message throughput and it acted as expected. Turns out you are only as strong as your weakest node (under my HA config).
Not provocatively setting up monitoring hurts
I was aware of the memory alarm and its implications. I only knew when a system had failed. A extra couple of minutes setting up external checks on the memory alarm could have helped before the system failed.
Stats can be wrong
I did not anticipate the stats reported in the management console could be wrong/misleading. i.e. The PRIMARY nodes memory was green – until the SECONDARY went down – Need to look further into this.
Closing
This stuff can get messy –
- Read the docs
- And then read them again
- Know the IN’s and Out’s of HA/Clustering and design a protocol PER-application.
- Monitor Proactively
- Don’t always trust the stats
Links
https://www.rabbitmq.com/ha.html
https://www.rabbitmq.com/production-checklist.html
https://www.rabbitmq.com/memory.html
Categories