HAforNovaDB

= Making Nova Database Highly Available with Pacemaker =

Abstract
Nova maintains a list of resources and their state in a database. This database is queried and modified frequently by various services in the Nova installation. Therefore, this database must be highly available - if the database server dies the database must quickly be restarted either locally or on another server.

This document will illustrate the steps necessary to make this happen using the open source High Availability (HA) software called Pacemaker.

Overview
At Internap, we are using the open source High Availability software called Pacemaker (http://www.clusterlabs.org) for HA. Pacemaker is configured on two servers with the servers sharing a disk subsystem.

We use hardware RAID for the shared disk and then put the LUNs into an LVM volume so that we can increase the storage online in the future. A file system is then created on top of the volume and MySQL is configured on the file system. MySQL is then bound to a virtual IP address (VIP). As long as the other Openstack components only use the VIP to talk to the database, fail over of the database from one server to another server is virtually transparent.

Some implementations may use something called Distributed Replicated Block Device or DRBD instead of shared disk. It was felt that DRBD adds an extra level of complexity that was unnecessary for the first release. Therefore, these steps do not cover DRBD.

This wiki will start the reader with the two servers sharing storage all the way through Pacemaker configuration and MySQL installation and database creation at which point the user will be ready to create the database tables.

After each configuration step, the user will be asked to verify that the step worked as expected with a simple test case.

Finally, this wiki will give some practical steps on configuring Pacemaker, troubleshooting tips and test cases.

Terms
Shoot The Other Node In The Head or STONITH - one of the techniques to prevent data corruption in clustering. It is possible that two nodes in a high availability cluster will each think the other one is dead when in fact they are both alive. If both nodes attempt recovery, they will corrupt the shared data (file system). The example in this wiki implements STONITH with something called IPMI which allows one server to reset another server. For more information, refer to the Pacemaker documentation.

Virtual IP address or VIP - IP address associated with an application and moves with the application when the application is moved from one server to another server. It allows a client to always call the same IP address without having to figure out if the application is presently available on server1 or server2.

Before You Start

 * After each configuration step, make sure you test! The best way to ensure high availability is to test different scenarios as you build the solution.
 * Helpful links for additional information are found at the end of the wiki
 * These steps were tested with Ubuntu 10.04.3 LTS using OpenStack Nova Cactus

Setting Up Storage
Before setting up HA on the servers, we must test the shared storage and then setup LVM.

Shared Storage Test Case

 * Test that can access the storage from both servers and that you know the LUNs which are shared
 * Tested that shared storage works with these tests
 * reboot both nodes at once and make sure that you see no errors in syslog
 * use dd to do reads from one server to one LUN and reboot second server - should not see any SCSI errors in syslog
 * if have multiple paths, test that paths work by pulling one of the active paths while doing I/O and make sure continues to work

LVM Setup
We used hardware based RAID and then configured LVM on top of the RAID LUNs. This was done so that as our needs increase we can expand the storage later by growing the LVM volume.

The steps to configure LVM are:

On one node, partition the RAID devices. The device names on the server are /dev/mapper/mpath*. Repeat these steps for all LUNs:


 * 1) parted /dev/mapper/ mklabel gpt
 * 2) parted /dev/mapper/ mkpart primary 1 

On both nodes, disable the LVM cache of disks by changing /etc/lvm/lvm.conf per below and remove the cache. Also, filter disks that LVM should not scan:

53,54c53 <    #filter = [ "a/.*/" ] <    filter = [ "r|/dev/sd*|" ] --- >    filter = [ "a/.*/" ] 79,80c78 <    #write_cache_state = 1 <    write_cache_state = 0 --- >    write_cache_state = 1 360,361d355 <    volume_list = [ "db-vg" ]
 * 1) rm -rf /etc/lvm/cache/.cache
 * 2) diff /etc/lvm/lvm.conf /etc/lvm/lvm.conf-orig

NOTE: volume_list must contain the name of any LVM volume groups placed under HA control as well as any “root” volumes if you are using LVM for the root disk. If the root volumes are not listed then the system will fail to boot! Therefore, be sure to add them as need. Since we are using the LVM volume group named db-vg we have added it above.

Recreate ramfs with the new lvm.conf so that the node does not automatically active the volume group:


 * 1) /usr/sbin/update-initramfs -u

On one node, create a physical volume on partition 1 of each disk:


 * 1) pvcreate /dev/mapper/path1

On one node, setup the db volume group with the LUN(s).


 * 1) vgcreate db-vg /dev/mapper/mpath1-part1

On one node, create the logical volume. Here we are creating a 16TB volume:


 * 1) lvcreate -L16TB -ndb-vol db-vg

On one node, create a file system on the volumes. The file system is created as ext4:


 * 1) mke2fs -t ext4 -j /dev/db-vg/db-vol

On each node, make sure that the volume does not automatically start:


 * 1) vgchange -a n

On each node, make the directories where the file system will be mounted when Pacemaker starts the database service on each node.:


 * 1) mkdir /dbmnt

On one node, mount the directory:


 * 1) mount  /dev/db-vg/db-vol /dbmnt

LVM Test Case

 * Reboot both systems and make sure that volume group db-vg is not displayed with the command vgdisplay -a
 * On one system, import the volume group, start the volume and mount it with these commands
 * vgchange -a y db-vg
 * mount /dev/db-vg/db-vol /dbmnt

Installing MySQL
On both nodes, make sure the mysql user and group exists and has the same user ID and group ID.:


 * 1) groupadd -g mysql
 * 2) useradd -u -d/var/lib/mysql -s/bin/false -g mysql
 * 3) mkdir /var/run/mysqld
 * 4) chmod 755 /var/run/mysqld
 * 5) chown mysql /var/run/mysqld
 * 6) chgrp mysql /var/run/mysqld

Install mysql-server and mysql-client on both nodes (we use root/nova):


 * 1) apt-get install mysql-server
 * 2) apt-get install mysql-client

NOTE: AppArmor (http://en.wikipedia.org/wiki/AppArmor) is a security module in Linux. For example, it allows the administrator to restrict an application to only access certain capabilities such as certain files and directories. If the below change is not made, MySQL will not be able to open the directory containing the database.

On both nodes, setup /etc/apparmor.d/usr.sbin.mysqld so that mysql can read/write the file system on the shared disk. Here is the diff:

33,34d31 <  /dbmnt/mysql/ rw, <  /dbmnt/mysql/** rwkl,
 * 1) diff *mys* $HOME/*mysq*

and also add these lines:

/var/run/mysql/mysqld.pid w, /var/run/mysql/mysqld.sock w, /dbmnt/ rw, /dbmnt/** rwkl,

NOTE: The above lines must be changed if you install the mysql database in a directory other than /dbmnt.

On both nodes, modify /etc/mysql/my.cnf as follows:

root@server1:/etc/mysql# diff my.cnf my.cnf-orig 46,47c46 < #datadir               = /var/lib/mysql < datadir               = /dbmnt/mysql --- > datadir               = /var/lib/mysql

also change bind-address to:

bind-address = 0.0.0.0

On both nodes, prevent MySQL from starting automatically by commenting out the start lines in /etc/init/mysql.conf as follows:


 * 1) start on (net-device-up
 * 2)          and local-filesystems
 * 3)         and runlevel [2345])

MySQL will only be started/stopped/failed over by Pacemaker. This change prevents MySQL from being started at boot.

On the node which has /dbmnt mounted on the LVM volume, do these commands from the command line to create a MySQL database in /dbmnt/mysql:

mysql> show databases;
 * 1) mkdir /dbmnt/mysql
 * 2) cd /dbmnt
 * 3) chgrp mysql mysql
 * 4) chown mysql.
 * 5) chgrp mysql.
 * 6) ls -l /dbmnt
 * 7) mysql_install_db --datadir=/dbmnt/mysql --user=mysql
 * 8) /usr/bin/mysqladmin -u root password 'nova'
 * 9) mysql -u root -p

MySQL Test Case

 * Connect to the database instance by doing mysql -u root -p

Installing and Configuring Pacemaker
Assumptions


 * Someone has already configured the shared disk storage and has already setup LVM using the naming outlined in this document.
 * HA will start/stop the LVM volume groups
 * LVM volume group will be named db-vg.
 * The LVM volume will be called db-vol.
 * The mount point for the MySQL database is /dbmnt
 * An ext4 file system has already been created on the LVM volume for the MySQL database.
 * Someone has already configured IPMI access between the nodes for STONITH
 * Heartbeat links for HA heartbeating will be provided with two heartbeat links between the HA nodes. The heartbeat link will NOT use a switch but instead will use a network cable (direct connect).   (If two network cables are used then they should be bonded together)

Setting Up Direct Attached Network Cables with Corosync
Ethernet cables should be directly connected between the hosts and should be configured.

A bond interface should be created for the Ethernet devices. Check /etc/network/interfaces for the hearbeat bond to be sure. An example /etc/network/interfaces file entry for the bonded hearbeat link is:

auto bond2 iface bond2 inet static address 172.16.0.1 netmask 255.255.255.0 bond-slaves eth0 eth1 bond_mode 4 bond_miimon 100 bond_lacp_rate 1
 * 1) heartbeat bond

where 172.16.0.1 would be replaced by the IP address on each server for the private heartbeat link and eth0/eth1 would list the actual ethernet interfaces used.

Exact details of creating a bond device from the two Ethernet devices is outside the scope of this document.

Install Pacemaker

 * 1) apt-get install pacemaker
 * 2) apt-get install ipmitool

Configure Pacemaker
On both nodes, configure and start OpenAIS (the cluster membership protocol)

For reference, you can look at the document (http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf) specifically the sections titled Configuring OpenAIS, Verify OpenAIS Installation and Verify Pacemaker Installation.

However, everything you need is covered below.

Change /etc/default/corosync as follows:

/etc/default 2c2 < START=yes --- > START=no
 * 1) pwd
 * 1) diff corosync corosync-orig

Configure the bindnetadd in /etc/corosync/corosync.conf file.

For reference, I included my working file using direct connect cables (no network switch):


 * 1) Please read the openais.conf.5 manual page

totem { version: 2

# How long before declaring a token lost (ms) token: 3000

# How many token retransmits before forming a new configuration token_retransmits_before_loss_const: 10

# How long to wait for join messages in the membership protocol (ms) join: 60

# How long to wait for consensus to be achieved before starting a new round of membership configuration (ms) consensus: 5000

# Turn off the virtual synchrony filter vsftype: none

# Number of messages that may be sent by one processor on receipt of the token max_messages: 20

# Limit generated nodeids to 31-bits (positive signed integers) clear_node_high_bit: yes

# Disable encryption secauth: off

# How many threads to use for encryption/decryption threads: 0

# Optionally assign a fixed node id (integer) # nodeid: 1234

# This specifies the mode of redundant ring, which may be none, active, or passive. rrp_mode: none

interface { # The following values need to be set based on your environment ringnumber: 0 #bindnetaddr: 127.0.0.1 bindnetaddr: 172.16.0.0 mcastaddr: 226.94.1.1 mcastport: 5405 } }

amf { mode: disabled }

service { # Load the Pacemaker Cluster Resource Manager ver:      0 name:     pacemaker }

aisexec { user:  root group: root }

logging { fileline: off to_stderr: yes to_logfile: no      to_syslog: yes syslog_facility: daemon debug: off timestamp: on      logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } }

NOTE: 172.16.0.0 should be changed to the public IP address of the systems with the last octet of the public IP address replaced with a 0.

Start Corosync:


 * 1) /etc/init.d/corosync start

Verify that the pacemaker nodes can see each other with the command:


 * 1) crm_mon -n -1

which should show both nodes. This means that Pacemaker can see both nodes and we are ready to configure Pacemaker to manage the OpenStack services.

Test that the heartbeats are actually going through the direct connect cables by doing this command on the first node:


 * 1) ifconfig down

You should see a message in the “crm_mon -n -1” output that lists one of the nodes as “UNCLEAN (offline)”.

After you see this, reboot both servers.

Based on the above corosync.conf file, you should also see messages in /var/log/syslog like:

 server1 corosync[26356]:  [TOTEM ] Initializing transport (UDP/IP).  server1 corosync[26356]:  [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).  server1 corosync[26356]:  [TOTEM ] The network interface [] is now up.

Setting Up STONITH
Setting up IPMI access is outside of the scope of this document.

However, an simple overview is that IP addresses must be statically assigned to the IPMI ports of each server and it must be possible to reset each server.

Pacemaker Test Case

 * From one server, check the status of Pacemaker with the command crm_mon -n -1
 * From each server, do the command ipmitool -l lan -H  -U ADMIN -P  power status
 * From each server, take turns resetting the peer server with the command ipmitool -l lan -H  -U ADMIN -P  power reset

Setup Modprobe Alias for Device Driver
This is needed because the FileSystem agent of Pacemaker (used to mount, umount and monitor a file system) looks for this device. On BOTH nodes do:


 * 1) echo “alias scsi_hostadapter ahci” >> /etc/modprobe.d/modprobe.conf

After this is done, verify that there are no duplicates in the file with this command:


 * 1) cat /etc/modprobe.d/modprobe.conf

Configure Resources in Pacemaker
Pacemaker must be configured to start and stop the resources in the correct order. For example, first the LVM volume group db-vg must be started before the file system can be mounted and the MySQL database started.

To do this, take the example configuration in the Appendix and save it to a file named /tmp/pace-config.

Then, modify the values SERVER1, SERVER2, MYSQL_INTERFACE, MYSQL_VIP, ADD_PASSWORD_HERE, SERVER1_IPMI_IP, and SERVER2_IPMI_IP to the values for your local setup.

Setup Pacemaker with the new configuration with the command:


 * 1) crm configure load update /tmp/pace-config

Tell Pacemaker that STONITH should be used with the command:


 * 1) crm configure property stonith-enabled=true

Second Set of Pacemaker Test Cases

 * Reboot both servers at the same time and make sure that the database is started on one of the servers and that the command mysql -uroot -p  is able to connect to the database
 * Reboot the server which is presently running the MySQL database and make sure that the database is failed over to the second server.  Use the command crm_mon -n -1 to track that there are no errors

Next Steps
The next steps are to keep testing, create the database tables for Nova, make sure the database is secure and then modify the other OpenStack services to point to the VIP of MySQL.

These steps are outside of the scope of this document but are necessary.

Pacemaker Troubleshooting Tools/Tips
Refer to the Pacemaker documentation for additional information.

A simple command to see what systems and resources Pacemaker sees is crm_mon -n -1

To see the Pacemaker configuration do crm configure show

Pacemaker generally logs messages in /var/log/syslog. If a resource such as the file system does not mount, read the log file to see if you can figure out why it is not mounting.

Pacemaker uses shell scripts named agents or RAs to start, stop and monitor resources like a file system. These scripts are found /usr/lib/ocf/resource.d. If for example the file system will not start (mount), try the exact set of commands from the agent to reproduce the problem.

Appendix - Sample Pacemaker Configuration File
Below is an example Pacemaker configuration file for MySQL. This file must be modified with:


 * SERVER1 - name of first server
 * SERVER2 - name of second server
 * MYSQL_INTERFACE - replace with ethernet interface used for MYSQL_VIP (i.e. bond1:200)
 * MYSQL_VIP - replace with IP address of VIP for MYSQL
 * ADD_PASSWORD_HERE - replace with password of IPMI interface
 * SERVER1_IPMI_IP - replace with IP address of IPMI interface of server1
 * SERVER2_IPMI_IP - replace with IP address of IPMI interface of server2

node SERVER1 node SERVER2 primitive db-fs-p ocf:heartbeat:Filesystem \ params device="/dev/db-vg/db-vol" directory="/dbmnt" fstype="ext4" \ op start interval="0" timeout="120" \ op monitor interval="60" timeout="60" OCF_CHECK_LEVEL="20" \ op stop interval="0" timeout="240" primitive db-lvm-p ocf:heartbeat:LVM \ params volgrpname="db-vg" exclusive="true" \ op start interval="0" timeout="30" \ op stop interval="0" timeout="30" primitive db-mysql-p ocf:heartbeat:mysql \ params binary="/usr/sbin/mysqld" config="/etc/mysql/my.cnf" datadir="/dbmnt/mysql" pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock" additional_parameters="--bind-address=MYSQL_VIP " \ op start interval="0" timeout="120s" \ op stop interval="0" timeout="120s" \ op monitor interval="20s" timeout="30s" primitive ipmilan-server1-p stonith:external/ipmi \ params hostname="server1" ipaddr="SERVER1_IPMI_IP" userid="root" passwd="ADD_PASSWORD_HERE" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" \ op monitor interval="60" timeout="60" start-delay="0" primitive ipmilan-server2-p stonith:external/ipmi \ params hostname="server2" ipaddr="SERVER2_IPMI_IP" userid="root" passwd="ADD_PASSWORD_HERE" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" \ op monitor interval="60" timeout="60" start-delay="0" primitive nova-ip-p ocf:heartbeat:IPaddr2 \ params ip="MYSQL_VIP" nic="MYSQL_INTERFACE" \ op monitor interval="5s" group lvm-fs-ip-mysql-g db-lvm-p db-fs-p glance-ip-p nova-ip-p db-mysql-p \ meta target-role="Started" location loc-ipmilan-server1 ipmilan-server1-p -inf: server1 location loc-ipmilan-server2 ipmilan-server2-p -inf: server2