Difference between revisions of "RabbitmqHA"

Revision as of 16:05, 19 October 2010

Launchpad Entry: NovaSpec:tbd
Created: 19 October 2010
Last updated: 19 October 2010
Contributors: Armando Migliaccio

Summary

This specification covers how Nova supports RabbitMQ configurations like clustering and active/passive replication.

Release Note

Austin release of Nova RPC mappings deals with intermittent network connectivity only. In order to support RabbitMQ clusters and active/passive brokers, more advanced Nova RPC mappings need to be provided, such as strategies to deal with failures of nodes holding queues within clusters and/or master/slave failover for active/passive replication.

Rationale

Currently, the message queue configuration variables are tied to RabbitMQ from nova/flags.py. In particular, only one rabbitmq host is provided and it is assumed, for simplicity of the deployment, that a single instance is up and running. In face of failures (e.g. disk or power related) of the RabbitMQ host, Nova components cannot send/receive messages from the queueing system until it recovers. To provide higher resiliency, RabbitMQ can be made to work in an active/passive setup, such that persistent messages that have been written to disk on the active node are able to be recovered by the passive node should the active node fail. If high-availability is required, active/passive HA can be achieved by using shared disk storage, heartbeat/pacemaker, and possibly a TCP load-balancer in front of the service replicas. Although this solution ensures the least amount of development effort on the client-side (e.g. Nova API, Scheduler, Compute, etc.) it still represents a bottleneck of the overall architecture and may require expensive hardware.

User stories

Assumptions

Design

Implementation

Code Changes

@@ Line 15: / Line 15: @@
 == Rationale ==
-Currently Nova RPC implementation deals with intermittent network connectivity, i.e. if the connection between a RabbitMQ server and a Nova component (e.g. scheduler, API or compute) drops, the RabbitMQ client adapter tries to recover from it. On the other hand, RabbitMQ provides solutions like clustering and active/passive replication for a greater deal of scalability and reliability. If these
+Currently, the message queue configuration variables are tied to RabbitMQ from nova/flags.py. In particular, only one rabbitmq host is provided and it is assumed, for simplicity of the deployment, that a single instance is up and running. In face of failures (e.g. disk or power related) of the RabbitMQ host, Nova components cannot send/receive messages from the queueing system until it recovers. To provide higher resiliency, RabbitMQ can be made to work in an active/passive setup, such that persistent messages that have been written to disk on the active node are able to be recovered by the passive node should the active node fail. If high-availability is required, active/passive HA can be achieved by using shared disk storage, heartbeat/pacemaker, and possibly a TCP load-balancer in front of the service replicas. Although this solution ensures the least amount of development effort on the client-side (e.g. Nova API, Scheduler, Compute, etc.) it still represents a bottleneck of the overall architecture and may require expensive hardware.
-However, RabbitMQ clustering and active/passive replication may provide a greater deal of scalability and/or fault-tolerance, but support must be provided by the Nova RPC make sure that proper fail-over strategies are implemented
 == User stories ==
@@ Line 27: / Line 25: @@
 == Implementation ==
-=== Code Changes ===
+== Code Changes ==
-Code changes should include an overview of what needs to change, and in some cases even the specific details.
 ----
 [[Category:Spec]]