HAGuideImprovements/TOC

(Moved original to the bottom of the page for reference.)

Proposed Revision

Strategy and assumptions:

Audience is people who have some experience installing OpenStack, not first time users
Focus on installation of OpenStack core services
Structure the guide sequentially -- the steps to take in a reasonable order
Avoid redundancy with the Install Guide; for steps that are identical for HA and non-HA installations, link to appropriate sections in the Install Guide
One guide for all Linux distros/platforms
Emphasize a reasonable, standard deployment based on open source components. We can provide some notes about alternatives as appropriate (for example, using a commercial load-balancer might be a better alternative than relying on HAProxy) and perhaps a link to the OpenStack Marketplace.

Structure/Outline

HA Intro and Concepts

Redundancy and failover
Stateless/stateful, active/passive, active/active (Keep: http://docs.openstack.org/high-availability-guide/content/stateless-vs-stateful.html)
Quorums; many services should use an odd number of nodes equal to or greater than 3
Single-controller HA mode and scaling up to 3 or more

Hardware setup

Minimal Architecture Example -- Network Layout, styled as in http://docs.openstack.org/juno/install-guide/install/apt/content/ch_basic_environment.html#basics-prerequisites for easy comparison

Prerequisites

Link to Install Guide: Install O/S on each node
Install pacemaker, crmsh, corosync, cluster-glue, fence-agents (Fedora only), resource-agents. (Modify: http://docs.openstack.org/high-availability-guide/content/_install_packages.html)
What is needed for LSB/upstart/systemd alternative to OCF scripts (RA) for Pacemaker? See https://bugs.launchpad.net/openstack-manuals/+bug/1349398
Set up and start Corosync and Pacemaker. Stick with 'crm' tool for Ubuntu/Debian and 'pcs' for RHEL/Fedora (Modify http://docs.openstack.org/high-availability-guide/content/_set_up_corosync.html; Modify: http://docs.openstack.org/high-availability-guide/content/_start_pacemaker.html)
Set basic cluster properties (Modify: http://docs.openstack.org/high-availability-guide/content/_set_basic_cluster_properties.html))
Configure fencing for Pacemaker cluster (Links to http://clusterlabs.org/doc/)
Configure the VIP (Keep: http://docs.openstack.org/high-availability-guide/content/s-api-vip.html )
API services -- Do those belong here or in specific sections? (Modify Glance API: http://docs.openstack.org/high-availability-guide/content/s-glance-api.html and Modify Cinder API: http://docs.openstack.org/high-availability-guide/content/s-cinder-api.html )
Schedulers
Memcached service on Controllers (Keep: http://docs.openstack.org/high-availability-guide/content/_memcached.html , which links to http://code.google.com/p/memcached/wiki/NewStart for specifics)

Configure networking on each node

Rather than configuring neutron here, we should simply mention physical network HA methods (e.g., bonding) and additional node/network requirements for L3HA and DVR for planning purposes. As this point the, the networking guide likely won't cover the former.
Link to Networking Guide
(Neutron agents should be described for active/active; deprecate single agent's instances case)
For Kilo and beyond, focus on L3HA and DVR

Install and Configure MySQL

Two nodes plus GARBD.
MySQL with Galera
Pacemaker multistate clone resource for Galera cluster
Pacemaker resource agent for Galera cluster management
Deprecate MySQL DRBD configuration because of split-brain issues

RabbitMQ Message broker

Install and configure message broker on Controller; see http://docs.openstack.org/juno/install-guide/install/apt/content/ch_basic_environment.html#basics-prerequisites
Oslo messaging for active/active

I think services need some special configuration with more than two nodes?

No need for active/passive AMQP; Two-node active/active cluster with mirrored queues instead
Pacemaker multistate clone resource for RabbitMQ cluster
Pacemaker resource agent for RabbitMQ cluster management
Deprecate DRBD for RabbitMQ

Memcached

Does this go here or in "Prerequisites" section above?

I think Oslo supports hash synchronization so this shouldn't take more than load balancing.

NTP

Run NTP servers on every controller and configure other nodes to use all of them for synchronization. Link to http://docs.openstack.org/juno/install-guide/install/apt/content/ch_basic_environment.html#basics-ntp

Keystone Identity services

Install Guide for concepts: http://docs.openstack.org/juno/install-guide/install/apt/content/keystone-concepts.html
Install Guide to configure prerequisites, install and configure the components, and finalize the installation: http://docs.openstack.org/juno/install-guide/install/apt/content/keystone-install.html
Configure Keystone for HA MySQL and HA RabbitMQ
Add Keystone resource to Pacemaker
Change bind parameters in keystone.conf
Configure OpenStack services to use HA Keystone

Glance image service

Install Guide for basics (http://docs.openstack.org/juno/install-guide/install/apt/content/ch_keystone.html )
Configure Glance for HA MySQL and HA RabbitMQ
Add OpenStack Image API resource to Pacemaker, Configure OpenStack Image Service API, Configure OpenStack services to use HA Image API (Modify: http://docs.openstack.org/high-availability-guide/content/s-keystone.html )
Should Glance use a redundant storage backend such as Swift?

Cinder Block Storage Service

Install Guide for basic installation
The installation guide covers one API/scheduler node and one volume node.
Add API/scheduler redundancy and multiple volume nodes.
Discuss availability zones?
Need to use Ceph as the storage backend to have data redundancy? We should support at least one open source option such as Ceph and perhaps NFS... and simply mention other options such as NetApp and NFS.

Swift Object Storage

Install Guide for basic installation
The installation guide covers basic storage node redundancy, but only deploys one proxy server. Do we want to discuss the process of adding proxy servers and load balancing them? Also, what about adding storage nodes and perhaps discussing regions/zones?

Nova compute service

Install Guide for basic setup
The installation guide covers multiple compute nodes, but only deploys one instance of API and other services. We should discuss the process of deploying multiple instances of the latter.

Heat Orchestration

Install Guide for basic installation
Add API redundancy
How to set up so that VMs on a failed compute node are quickly migrated to other compute nodes

Ceilometer Telemetry and MongoDB

Install Guide for basic installation
Need one MongoDB node for each Controller node

Database Service (Trove)

Install Guide for basics
Need details about how to apply HA

Sahara

Install Guide for basics (http://docs.openstack.org/juno/install-guide/install/apt/content/ch_sahara.html )
Should link to Sahara docs for discussion of OpenStack HA versus Hadoop HA and how they work together, although the installation instructions at http://docs.openstack.org/developer/sahara/userdoc/installation.guide.html do not currently mention HA

Other

Configure Pacemaker service group to ensure that the VIP is linked to the API services resource
Systemd alternative to OCF scripts for Pacemaker RA
MariaDB with Galera alternative to MySQL
Install and configure HAProxy for API services and MySQL with Galera cluster load balancing
Mention value of redundant hardware load balancers for stateless services such as REST APIs
Describe scaling single node to 3 nodes HA
Ceph?
Murano?

Original for reference

NOTE: This is the original for us to depart from.

I. Introduction to OpenStack High Availability

Stateless vs. Stateful services
Active/Passive
Active/Active

II. HA Using Active/Passive

1. The Pacemaker Cluster Stack

Installing Packages
Setting up Corosync
Starting Corosync
Starting Pacemaker
Setting basic cluster properties

2. Cloud Controller Cluster Stack

Highly available MySQL
Highly available RabbitMQ

3. API Node Cluster Stack

Configure the VIP
Highly available OpenStack Identity
Highly available OpenStack Image API
Highly available Cinder API
Highly available OpenStack Networking Server
Highly available Ceilometer Central Agent
Configure Pacemaker Group

4. Network Controller Cluster Stack

Highly available Neutron L3 Agent
Highly available Neutron DHCP Agent
Highly available Neutron Metadata Agent
Manage network resources

III. HA Using Active/Active

5. Database

MySQL with Galera
Galera Monitoring Scripts
Other ways to provide a Highly Available database

6. RabbitMQ

Install RabbitMQ
Configure RabbitMQ
Configure OpenStack Services to use RabbitMQ

7. HAproxy Nodes 8. OpenStack Controller Nodes

Running OpenStack API & schedulers
Memcached

9. OpenStack Network Nodes

Running Neutron DHCP Agent
Running Neutron L3 Agent
Running Neutron Metadata Agent