RabbitmqHA

Launchpad Entry: NovaSpec:tbd
Created: 19 October 2010
Last updated: 19 October 2010
Contributors: Armando Migliaccio

Summary

This specification covers how Nova supports RabbitMQ configurations like clustering and active/passive replication.

Release Note

Austin release of Nova RPC mappings deals with intermittent network connectivity only. In order to support RabbitMQ clusters and active/passive brokers, more advanced Nova RPC mappings need to be provided, such as strategies to deal with failures of nodes holding queues within clusters and/or master/slave failover for active/passive replication.

Rationale

Currently, the message queue configuration variables are tied to RabbitMQ from nova/flags.py. In particular, only one rabbitmq host is provided and it is assumed, for simplicity of the deployment, that a single instance is up and running. In face of failures of the RabbitMQ host (e.g. disk or power related), Nova components cannot send/receive messages from the queueing system until it recovers. To provide higher resiliency, RabbitMQ can be made to work in an active/passive setup, such that persistent messages that have been written to disk on the active node are able to be recovered by the passive node should the active node fail. If high-availability is required, active/passive HA can be achieved by using shared disk storage, heartbeat/pacemaker, and possibly a TCP load-balancer in front of the service replicas. Although this solution ensures total transparency to the client-side such as Nova API, Scheduler, and Compute (e.g. no fail-over strategies are required in the Nova RPC mappings) it still represents a bottleneck of the overall architecture, it may require expensive hardware to run and hence it is far from ideal.

Another option is RabbitMQ Clustering. A RabbitMQ cluster (or broker) is a logical grouping of one or several Erlang nodes, each running the RabbitMQ application and sharing users, virtual hosts, queues, exchanges, bindings etc. The adoption of a RabbitMQ cluster becames appealing in the context of virtual appliances, where each appliance is dedicated to a single specific Nova task (e.g. compute, volume, network, scheduler, api, ...) and it also runs an instance of RabbitMQ server. By clustering all the instances together we would have a single massive cluster spanning the deployment, providing the following benefits:

no single point of failure
no requirement of expensive hardware
no requirement of separate appliances/hosts to run RabbitMQ
RabbitMQ becomes 'hidden' in the deployment

However, there is a problem All data/state required for the operation of a RabbitMQ broker is replicated across all nodes, for reliability and scaling, with full ACID properties. An exception to this are message queues, which at the current RabbitMQ release, only reside on the node that created them, though they are visible and reachable from all nodes. For this reason, clusters are advised by the RabbitMQ development team only for scalability reasons rather than high-availability, however their choice still looks appealing for the following reasons:

reason1
reason2

User stories

Assumptions

Design

Implementation

Code Changes