Skip to content

Hard Lesson: RabbitMQ Clustering

PolicyA while back I provisioned 2 Debian Linux VM’s. The purpose was to spike RabbitMQ clustering and HA options. After setting up some HA policy’s and doing random node failure testing the cluster was stable enough to use in Test.

An issue arose when an application could no longer save data. An error was logged BUT the page would just hang forever – no yellow screen of death.

After looking at the logs and isolating the code path the error was apparent. If the code failed to insert a record – a message was published to the RabbitMQ broker. But the code from the publish operation would never return.

OK time to have a look at the rabbit cluster.

rabbit

It looks like one of the nodes has run out of memory.

A quick look at the Queues is misleading. Below it looks like everything is OK we have +1 nodes and everything is running. So why the failure?

Queues.png

A look at the connection screen shows us the truth. Blocking or Blocked connections.  Why?

blocked.png

I would later find out after re-reading the doco…

By default, when the RabbitMQ server uses above 40% of the installed RAM, it raises a memory alarm and blocks all connections that are publishing messages.

Time to fix the bad node. A quick recycle later and…

After Recycle

NO good – Now the PRIMARY node has gone bad?

Was this the case all along? Had the stats been messed up? Not sure – time to recycle the PRIMARY.

Rabbit Good

All good.

Lessons

The protocol you use for distributed operations always need extra special attention

I read the docs. I ran the spike. I had a decent understanding of the complexity. I had simulated random node failure under high message throughput and it acted as expected. Turns out you are only as strong as your weakest node (under my HA config).

Not provocatively setting up monitoring hurts

I was aware of the memory alarm and its implications. I only knew when a system had failed. A extra couple of minutes setting up external checks on the memory alarm could have helped before the system failed.

Stats can be wrong

I did not anticipate the stats reported in the management console could be wrong/misleading.  i.e. The PRIMARY nodes memory was green – until the SECONDARY went down – Need to look further into this.

Closing

This stuff can get messy –

  • Read the docs
  • And then read them again
  • Know the IN’s and Out’s of HA/Clustering and design a protocol PER-application.
  • Monitor Proactively
  • Don’t always trust the stats

Links

https://www.rabbitmq.com/ha.html

https://www.rabbitmq.com/production-checklist.html

https://www.rabbitmq.com/memory.html

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: