Jump to: navigation, search

Difference between revisions of "Large Scale Configuration Rabbit"

(Created page with "== Large Scale Configuration of Rabbit == === Introduction === The following information are mostly taken from a discussion we had on the mailing list. You can see this discu...")
 
(message-ttl)
(6 intermediate revisions by 2 users not shown)
Line 21: Line 21:
  
 
Moreover, as many OpenStack services are using RabbitMQ for internal communication (a.k.a. RPC), having a highly available rabbit solution is a must have.
 
Moreover, as many OpenStack services are using RabbitMQ for internal communication (a.k.a. RPC), having a highly available rabbit solution is a must have.
 +
 +
If you choose the clustering mode, you should always keep an odd number of servers in the cluster (like 3 / 5 / 7, etc) to avoid split-brain issues.
  
 
==== One rabbit to rule them all? ====
 
==== One rabbit to rule them all? ====
Line 38: Line 40:
 
See https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFhmpHF2aWk
 
See https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFhmpHF2aWk
  
=== Rabbit clustering configuration ===
+
=== Rabbit config recommendation ===
 +
Most of the config we explain in the following parts apply only on rabbitmq in clustering mode.
 +
 
 +
 
 +
==== Service config ====
 +
 
 +
When running the rabbit software on a node, you can configure some parameters for it in:
 +
 
 +
/etc/rabbitmq/rabbitmq.config        # (ubuntu)
 +
 
 +
The most important configuration are the following
 +
 
 +
===== net_ticktime and heartbeat =====
 +
See https://www.rabbitmq.com/nettick.html and https://www.rabbitmq.com/heartbeats.html
 +
 
 +
A node is considered down after this time
 +
 
 +
This config mostly depends on the network you have between the nodes.
 +
 
 +
We consider that the default values are good.
 +
 
 +
===== disk / ram mode =====
 +
https://www.rabbitmq.com/clustering.html#cluster-node-types
 +
 
 +
"In the vast majority of cases you want all your nodes to be disk nodes; RAM nodes are a special case that can be used to improve the performance clusters with high queue, exchange, or binding churn. RAM nodes do not provide higher message rates. When in doubt, use disk nodes only."
 +
 
 +
So we recommend staying on disk node
 +
 
 +
 
 +
==== Policy ====
 +
 
 +
RabbitMQ apply some policies on queues and exchanges.
 +
See here: https://www.rabbitmq.com/parameters.html
 +
 
 +
If you plan to deploy a cluster of RabbitMQ, you will have to add a policy.
 +
 
 +
Remember that Rabbit can apply only '''one policy to a queue''' or an exchange. So you should avoid having multiples policies in your deployment, or if you do, try to avoid overlapping policies because you wont be able to predict which one is effective on a queue.
 +
 
 +
===== pattern =====
 +
Policies are applied based on a regex pattern. The pattern we agreed on (from the mailing list discussion) is the following:
 +
 
 +
'^(?!(amq\.)|(.*_fanout_)|(reply_)).*'
 +
 
 +
which will set HA on all queues, except the one that:
 +
* starts with amq.
 +
* contains _fanout_
 +
* starts with reply_
 +
 
 +
===== parameters =====
 +
A policy will apply some parameters to queues / exchanges.
 +
 
 +
Here are the parameters we recommend when running rabbit in cluster mode (time is in milliseconds as per https://www.rabbitmq.com/ttl.html):
 +
 
 +
{
 +
    "alternate-exchange": "unroutable",
 +
    "expires": 3600000,
 +
    "ha-mode": "all",
 +
    "ha-promote-on-failure": "always",
 +
    "ha-promote-on-shutdown": "always",
 +
    "ha-sync-mode": "manual",
 +
    "message-ttl": 600000,
 +
    "queue-master-locator": "client-local"
 +
}
 +
 
 +
====== alternate-exchange ======
 +
See https://rabbitmq.com/ae.html
 +
 
 +
This is not mandatory, but a nice to have feature to collect "lost" messages from rabbit (the messages that could not be routed).
 +
 
 +
====== expires ======
 +
queue expiration period in milliseconds.
 +
By default there is no expires.
 +
 
 +
So a queue without any consumer for 1H will be automatically deleted.
 +
 
 +
====== ha-mode ======
 +
See https://www.rabbitmq.com/ha.html#mirroring-arguments
 +
 
 +
Can be one of:
 +
* all: queues are mirrored across all nodes
 +
* exactly: need also ha-params "count". Will be replicated on "count" nodes
 +
* nodes: need also ha-params "node-names:. Will be replicated on all nodes in "node-names"
 +
 
 +
We recommend to mirror all queues across nodes, so a queue which is created on a node will also be created on other nodes.
 +
 
 +
 
 +
====== ha-promote-on-failure ======
 +
* always: (default) will force moving queue master to another node if master die unexpectedly
 +
* when-synced: will allow moving queue master only on a synced node. If no synced node, then queue will need to be removed
 +
 
 +
We keep the default here, to make sure that on failure, a new queue master will be elected and it will continue working.
 +
 
 +
====== ha-promote-on-shutdown ======
 +
* always: will force moving queue master to another node if master is shutdown
 +
* when-synced: (default)
 +
 
 +
We prefer to have queue master moved to an unsynchronized mirror in all circumstances (i.e. we choose availability of the queue over avoiding message loss due to unsynchronised mirror promotion).
 +
 
 +
====== ha-sync-mode ======
 +
See https://www.rabbitmq.com/ha.html#replication-factor
 +
 
 +
* automatic: can be blocking, will always replicate the queue, but can block the io while doing it
 +
* manual: (default) mode. A new queue mirror will only receive new messages (messages already in the queue wont be mirrored).
 +
 
 +
 
 +
Using manual is not a big issue for us, as most of the time, OpenStack queues are empty.
 +
 
 +
 
 +
====== message-ttl ======
 +
Message TTL in queues.
 +
 
 +
By default, no TTL.
 +
 
 +
We recommend to set it to 600000 (10 minutes).
 +
 
 +
This is supposed to be safe because most of the OpenStack elements are timeouting after 300 sec.
 +
 
 +
So a message not consumed in 10 minutes will be dropped from queues.
 +
 
 +
====== master-locator ======
 +
Determine which node is elected master when creating the queue
 +
* client-local: (default) Pick the node the client that declares the queue is connected to
 +
* min-masters: Pick the node hosting the minimum number of bound masters
 +
* random
 +
 
 +
We recommend keeping the client-local (default value).
 +
 
 +
==== On OpenStack services ====
 +
 
 +
===== rabbit_ha_queues =====
 +
You may see this parameters in some of the configuration file.
 +
 
 +
But this is useless now because the policy is setting this.
 +
 
 +
===== amqp_durable_queues =====
 +
See here: https://www.rabbitmq.com/queues.html#durability
 +
 
 +
"In most other cases, durable queues are the recommended option. For replicated queues, the only reasonable option is to use durable queues."
 +
 
 +
So, because we enabled HA in our policy, we MUST enable durable queues:
 +
 
 +
So, set:
  
 +
amqp_durable_queues = True
  
=== Rabbit without clustering configuration ===
+
in every OpenStack config file
  
TODO
+
Note that the durability of a queue (or an exchange) cannot be set AFTER the queue has been created.
 +
So if you forgot to set this at the beginning, you will have to delete your queue before OpenStack can recreate the queues with correct durability.

Revision as of 13:41, 25 January 2022

Large Scale Configuration of Rabbit

Introduction

The following information are mostly taken from a discussion we had on the mailing list. You can see this discussion here:

http://lists.openstack.org/pipermail/openstack-discuss/2020-August/thread.html#16362


Clustering or not clustering?

When deploying RabbitMQ, you have two possibility:

  • Deploy rabbit in a cluster
  • Deploy only one rabbit node

Deploying only one rabbit node can be seen as dangerous, mostly because if the node is down, your service is also down.

On the other hand, clustering rabbit has some downside that make it harder to configure / manage.

So, if your cluster is less reliable than a single node, the single node solution is better.

Moreover, as many OpenStack services are using RabbitMQ for internal communication (a.k.a. RPC), having a highly available rabbit solution is a must have.

If you choose the clustering mode, you should always keep an odd number of servers in the cluster (like 3 / 5 / 7, etc) to avoid split-brain issues.

One rabbit to rule them all?

You can also consider deploying rabbit in two ways:

  • one rabbit (cluster or not) for each OpenStack services
  • one big rabbit (cluster or not) for all OpenStack services

There is no recommendation on that part, except that if you split your rabbit in multiples services, you will, for sure, reduce the risk.


Which version of rabbit should I run?

You should always try to consider running the latest version of rabbit.

We also know that rabbit before 3.8 may have some issues on clustering side, so you might consider running at least rabbitmq 3.8.x.

See https://groups.google.com/forum/#!newtopic/rabbitmq-users/rabbitmq-users/zFhmpHF2aWk

Rabbit config recommendation

Most of the config we explain in the following parts apply only on rabbitmq in clustering mode.


Service config

When running the rabbit software on a node, you can configure some parameters for it in:

/etc/rabbitmq/rabbitmq.config # (ubuntu)

The most important configuration are the following

net_ticktime and heartbeat

See https://www.rabbitmq.com/nettick.html and https://www.rabbitmq.com/heartbeats.html

A node is considered down after this time

This config mostly depends on the network you have between the nodes.

We consider that the default values are good.

disk / ram mode

https://www.rabbitmq.com/clustering.html#cluster-node-types

"In the vast majority of cases you want all your nodes to be disk nodes; RAM nodes are a special case that can be used to improve the performance clusters with high queue, exchange, or binding churn. RAM nodes do not provide higher message rates. When in doubt, use disk nodes only."

So we recommend staying on disk node


Policy

RabbitMQ apply some policies on queues and exchanges. See here: https://www.rabbitmq.com/parameters.html

If you plan to deploy a cluster of RabbitMQ, you will have to add a policy.

Remember that Rabbit can apply only one policy to a queue or an exchange. So you should avoid having multiples policies in your deployment, or if you do, try to avoid overlapping policies because you wont be able to predict which one is effective on a queue.

pattern

Policies are applied based on a regex pattern. The pattern we agreed on (from the mailing list discussion) is the following:

'^(?!(amq\.)|(.*_fanout_)|(reply_)).*'

which will set HA on all queues, except the one that:

  • starts with amq.
  • contains _fanout_
  • starts with reply_
parameters

A policy will apply some parameters to queues / exchanges.

Here are the parameters we recommend when running rabbit in cluster mode (time is in milliseconds as per https://www.rabbitmq.com/ttl.html):

{

   "alternate-exchange": "unroutable",
   "expires": 3600000,
   "ha-mode": "all",
   "ha-promote-on-failure": "always",
   "ha-promote-on-shutdown": "always",
   "ha-sync-mode": "manual",
   "message-ttl": 600000,
   "queue-master-locator": "client-local"

}

alternate-exchange

See https://rabbitmq.com/ae.html

This is not mandatory, but a nice to have feature to collect "lost" messages from rabbit (the messages that could not be routed).

expires

queue expiration period in milliseconds. By default there is no expires.

So a queue without any consumer for 1H will be automatically deleted.

ha-mode

See https://www.rabbitmq.com/ha.html#mirroring-arguments

Can be one of:

  • all: queues are mirrored across all nodes
  • exactly: need also ha-params "count". Will be replicated on "count" nodes
  • nodes: need also ha-params "node-names:. Will be replicated on all nodes in "node-names"

We recommend to mirror all queues across nodes, so a queue which is created on a node will also be created on other nodes.


ha-promote-on-failure
  • always: (default) will force moving queue master to another node if master die unexpectedly
  • when-synced: will allow moving queue master only on a synced node. If no synced node, then queue will need to be removed

We keep the default here, to make sure that on failure, a new queue master will be elected and it will continue working.

ha-promote-on-shutdown
  • always: will force moving queue master to another node if master is shutdown
  • when-synced: (default)

We prefer to have queue master moved to an unsynchronized mirror in all circumstances (i.e. we choose availability of the queue over avoiding message loss due to unsynchronised mirror promotion).

ha-sync-mode

See https://www.rabbitmq.com/ha.html#replication-factor

  • automatic: can be blocking, will always replicate the queue, but can block the io while doing it
  • manual: (default) mode. A new queue mirror will only receive new messages (messages already in the queue wont be mirrored).


Using manual is not a big issue for us, as most of the time, OpenStack queues are empty.


message-ttl

Message TTL in queues.

By default, no TTL.

We recommend to set it to 600000 (10 minutes).

This is supposed to be safe because most of the OpenStack elements are timeouting after 300 sec.

So a message not consumed in 10 minutes will be dropped from queues.

master-locator

Determine which node is elected master when creating the queue

  • client-local: (default) Pick the node the client that declares the queue is connected to
  • min-masters: Pick the node hosting the minimum number of bound masters
  • random

We recommend keeping the client-local (default value).

On OpenStack services

rabbit_ha_queues

You may see this parameters in some of the configuration file.

But this is useless now because the policy is setting this.

amqp_durable_queues

See here: https://www.rabbitmq.com/queues.html#durability

"In most other cases, durable queues are the recommended option. For replicated queues, the only reasonable option is to use durable queues."

So, because we enabled HA in our policy, we MUST enable durable queues:

So, set:

amqp_durable_queues = True

in every OpenStack config file

Note that the durability of a queue (or an exchange) cannot be set AFTER the queue has been created. So if you forgot to set this at the beginning, you will have to delete your queue before OpenStack can recreate the queues with correct durability.