Jump to: navigation, search


Revision as of 21:30, 9 April 2015 by StackScribe (talk | contribs) (Strategy and assumptions:)

(Moved original to the bottom of the page for reference.)

Proposed Revision

This spec refers to the https://blueprints.launchpad.net/openstack-manuals/+spec/improve-ha-guide blueprint.

Strategy and assumptions:

  1. Audience is people who have some experience installing OpenStack, not first time users
  2. Focus on installation of OpenStack core services
  3. Structure the guide sequentially -- the steps to take in a reasonable order
  4. Avoid redundancy with the Install Guide; for steps that are identical for HA and non-HA installations, link to appropriate sections in the Install Guide
  5. One guide for all Linux distros/platforms
  6. Emphasize a reasonable, standard deployment based on open source components. We can provide some notes about alternatives as appropriate (for example, using a commercial load-balancer might be a better alternative than relying on HAProxy) and perhaps a link to the OpenStack Marketplace.

Shamail: We had discussed adding comments here regarding A/A and A/P. Should we add configuration for both A/A and A/P as a sub-topic for each components? I would prefer that versus making A/A or A/P top-level topics. Thoughts?

mattgriffin: I agree with Shamail. I think detailing A/A & A/P options within each section is the right thing to do. That should fit well with the existing HA Guide content too and bringing it over to this new structure.


HA Intro and Concepts

(Priority: 1)

  1. Redundancy and failover
  2. Stateless/stateful, active/passive, active/active (Keep: http://docs.openstack.org/high-availability-guide/content/stateless-vs-stateful.html)
  3. Quorums; many services should use an odd number of nodes equal to or greater than 3
  4. Single-controller HA mode and scaling up to 3 or more

Storage Backends

(Priority: 1) This section contains more concepts than actual procedures; our expectation is that the specific technologies discussed have their own configuration documentation that can be referenced.

This section describes the data plane (infrastructure) elements that factor into the overall HA capabilities of the storage; in other words, how does one ensure that ones data is not lost when systems fail. Topics to be discussed include RAID, Erasure Coding, etc. and describe the protections they do and do not offer.

We will also in a blurb of the options that are available. Finally, we could state that cinder supports multiple storage providers (Ceph, EMC, NetApp, SolidFire, etc.) and you can also get additional details from your storage providers documentation.

Swift combines control and data plane so we would cover some aspects of both.

Hardware setup

(Priority: 1)

  1. Minimal Architecture Example -- Network Layout, styled as in http://docs.openstack.org/juno/install-guide/install/apt/content/ch_basic_environment.html#basics-prerequisites for easy comparison
  2. Configure networking on each node
  3. Rather than configuring neutron here, we should simply mention physical network HA methods (e.g., bonding) and additional node/network requirements for L3HA and DVR for planning purposes. As this point the, the networking guide likely won't cover the former.
  4. Link to Networking Guide
  5. (Neutron agents should be described for active/active; deprecate single agent's instances case)
  6. For Kilo and beyond, focus on L3HA and DVR

Basic Environment

(Priority: 1)

  1. Install O/S on each node (link to Install Guide, e.g http://docs.openstack.org/juno/install-guide/install/apt/content/ch_basic_environment.html
  2. Install Memcached (Verify that Oslo supports hash synchronization; if so, this should not take more than load balancing.
  3. Run NTP servers on every controller and configure other nodes to use all of them for synchronization. Link to http://docs.openstack.org/juno/install-guide/install/apt/content/ch_basic_environment.html#basics-ntp

Basic HA facilities

(Priority: 1)

  1. Install pacemaker, crmsh, corosync, cluster-glue, fence-agents (Fedora only), resource-agents. (Modify: http://docs.openstack.org/high-availability-guide/content/_install_packages.html)
  2. What is needed for LSB/upstart/systemd alternative to OCF scripts (RA) for Pacemaker? See https://bugs.launchpad.net/openstack-manuals/+bug/1349398
  3. Set up and start Corosync and Pacemaker. Stick with 'crm' tool for Ubuntu/Debian and 'pcs' for RHEL/Fedora (Modify http://docs.openstack.org/high-availability-guide/content/_set_up_corosync.html; Modify: http://docs.openstack.org/high-availability-guide/content/_start_pacemaker.html)
  4. Set basic cluster properties (Modify: http://docs.openstack.org/high-availability-guide/content/_set_basic_cluster_properties.html))
  5. Configure fencing for Pacemaker cluster (Links to http://clusterlabs.org/doc/)
  6. Configure the VIP (Keep: http://docs.openstack.org/high-availability-guide/content/s-api-vip.html )
  7. API services -- Do those belong here or in specific sections? (Modify Glance API: http://docs.openstack.org/high-availability-guide/content/s-glance-api.html and Modify Cinder API: http://docs.openstack.org/high-availability-guide/content/s-cinder-api.html )
  8. Schedulers
  9. Memcached service on Controllers (Keep: http://docs.openstack.org/high-availability-guide/content/_memcached.html , which links to http://code.google.com/p/memcached/wiki/NewStart for specifics)

Install and Configure MySQL

(Priority: 2)

  1. Two nodes plus GARBD.
  2. MySQL variant with Galera: Cover major options (Galera Cluster for MySQL, Percona XtraDB Cluster, and MariaDB Galera Cluster) and link off to resources to understand installation and initial config options (e.g., SST).
  3. Pacemaker multistate clone resource for Galera cluster
  4. Pacemaker resource agent for Galera cluster management
  5. Deprecate MySQL DRBD configuration because of split-brain issues

RabbitMQ Message broker

(Priority: 2)

  1. Install and configure message broker on Controller; see http://docs.openstack.org/juno/install-guide/install/apt/content/ch_basic_environment.html#basics-prerequisites
  2. Oslo messaging for active/active
  1. I think services need some special configuration with more than two nodes?
  1. No need for active/passive AMQP; Two-node active/active cluster with mirrored queues instead
  2. Pacemaker multistate clone resource for RabbitMQ cluster
  3. Pacemaker resource agent for RabbitMQ cluster management
  4. Deprecate DRBD for RabbitMQ

Keystone Identity services

(Priority: 3, Depends-on: infrastructure)

  1. Install Guide for concepts: http://docs.openstack.org/juno/install-guide/install/apt/content/keystone-concepts.html
  2. Install Guide to configure prerequisites, install and configure the components, and finalize the installation: http://docs.openstack.org/juno/install-guide/install/apt/content/keystone-install.html
  3. Configure Keystone for HA MySQL and HA RabbitMQ
  4. Add Keystone resource to Pacemaker
  5. Change bind parameters in keystone.conf
  6. Configure OpenStack services to use HA Keystone

Glance image service

(Priority: 5, Depends-on: swift, keystone, infrastructure)

  1. Install Guide for basics (http://docs.openstack.org/juno/install-guide/install/apt/content/ch_keystone.html )
  2. Configure Glance for HA MySQL and HA RabbitMQ
  3. Add OpenStack Image API resource to Pacemaker, Configure OpenStack Image Service API, Configure OpenStack services to use HA Image API (Modify: http://docs.openstack.org/high-availability-guide/content/s-keystone.html )
  4. Configure OpenStack Image Service API (http://docs.openstack.org/high-availability-guide/content/_configure_openstack_image_service_api.html)
  5. Configure OpenStack services to use HA Image API (http://docs.openstack.org/high-availability-guide/content/_configure_openstack_services_to_use_high_available_openstack_image_api.html)
  6. Should Glance use a redundant storage backend such as Swift or Ceph?

Cinder Block Storage Service

(Priority: 6, Depends-on: glance, keystone, infrastructure) This section discusses how to configure the Cinder control plane only. The Cinder dataplane is addressed in the "Storage Backend" section.

  1. Install Guide for basic installation
  2. The installation guide covers one API/scheduler node and one volume node.
  3. Install on the controller node
  4. Download resource agent and add Block Storage API to Pacemaker (Keep: http://docs.openstack.org/high-availability-guide/content/_add_block_storage_api_resource_to_pacemaker.html)
  5. Configure Block Storage API service (Expand (it's a bit terse): http://docs.openstack.org/high-availability-guide/content/_add_block_storage_api_resource_to_pacemaker.html)
  6. Configure OpenStack services to use HA Block Storage API (http://docs.openstack.org/high-availability-guide/content/_configure_openstack_services_to_use_highly_available_block_storage_api.html)
  7. Add API/scheduler redundancy and multiple volume nodes.
  8. Discuss availability zones?
  9. Requires at least one additional storage node that provides persistent storage volumes for instances.
  10. Need to use Ceph as the storage backend to have data redundancy? We should support at least one open source option such as Ceph and perhaps NFS... and simply mention other options such as NetApp, EMC, and NFS. (Shamail: We should remove this item from here and address it in the "Storage Backends" section. Meg and I have been discussing ideas on how to provide information while keeping it fairly neutral. I don't think anything above LVM (reference implementation) should get special treatment, however at the same time, we should let people know there are several ways of protecting storage and let them select which one appeals to them, if any)

Swift Object Storage

(Priority: 4, Depends-on: keystone, infrastructure)

  1. Install Guide for basic installation
  2. The installation guide covers basic storage node redundancy, but only deploys one proxy server. Do we want to discuss the process of adding proxy servers and load balancing them? Also, what about adding storage nodes and perhaps discussing regions/zones?

Nova compute service

(Priority: 6, Depends-on: neutron, glance, keystone, infrastructure)

  1. Install Guide for basic setup
  2. The installation guide covers multiple compute nodes, but only deploys one instance of API and other services. We should discuss the process of deploying multiple instances of the latter.

Heat Orchestration

(Priority: 8, Depends-on: telemetry, neutron, nova, glance, keystone, infrastructure)

  1. Install Guide for basic installation
  2. Add API redundancy
  3. How to set up so that VMs on a failed compute node are quickly migrated to other compute nodes

Ceilometer Telemetry and MongoDB

(Priority: 7, Depends-on: neutron, nova, glance, keystone, infrastructure)

  1. Install Guide for basic installation
  2. Need one MongoDB node for each Controller node

Database Service (Trove)

(Priority 9: Depends-on: neutron, nova, glance, keystone, infrastructure)

  1. Install Guide for basics
  2. Need details about how to apply HA


(Priority 9: Depends-on: neutron, nova, glance, keystone, infrastructure)

  1. Install Guide for basics (http://docs.openstack.org/juno/install-guide/install/apt/content/ch_sahara.html )
  2. Should link to Sahara docs for discussion of OpenStack HA versus Hadoop HA and how they work together, although the installation instructions at http://docs.openstack.org/developer/sahara/userdoc/installation.guide.html do not currently mention HA


  1. Configure Pacemaker service group to ensure that the VIP is linked to the API services resource
  2. Systemd alternative to OCF scripts for Pacemaker RA
  3. MariaDB/Percona with Galera alternative to MySQL
  4. Install and configure HAProxy for API services and MySQL with Galera cluster load balancing
  5. Mention value of redundant hardware load balancers for stateless services such as REST APIs
  6. Describe scaling single node to 3 nodes HA
  7. Ceph?
  8. Murano?

Original for reference

NOTE: This is the original for us to depart from.

I. Introduction to OpenStack High Availability

  1. Stateless vs. Stateful services
  2. Active/Passive
  3. Active/Active

II. HA Using Active/Passive

1. The Pacemaker Cluster Stack

  1. Installing Packages
  2. Setting up Corosync
  3. Starting Corosync
  4. Starting Pacemaker
  5. Setting basic cluster properties

2. Cloud Controller Cluster Stack

  1. Highly available MySQL
  2. Highly available RabbitMQ

3. API Node Cluster Stack

  1. Configure the VIP
  2. Highly available OpenStack Identity
  3. Highly available OpenStack Image API
  4. Highly available Cinder API
  5. Highly available OpenStack Networking Server
  6. Highly available Ceilometer Central Agent
  7. Configure Pacemaker Group

4. Network Controller Cluster Stack

  1. Highly available Neutron L3 Agent
  2. Highly available Neutron DHCP Agent
  3. Highly available Neutron Metadata Agent
  4. Manage network resources

III. HA Using Active/Active

5. Database

  1. MySQL with Galera
  2. Galera Monitoring Scripts
  3. Other ways to provide a Highly Available database

6. RabbitMQ

  1. Install RabbitMQ
  2. Configure RabbitMQ
  3. Configure OpenStack Services to use RabbitMQ

7. HAproxy Nodes 8. OpenStack Controller Nodes

  1. Running OpenStack API & schedulers
  2. Memcached

9. OpenStack Network Nodes

  1. Running Neutron DHCP Agent
  2. Running Neutron L3 Agent
  3. Running Neutron Metadata Agent