Jump to: navigation, search

HAforNovaDB

Revision as of 07:21, 28 December 2011 by Bschatz (talk)

Making Nova Database Highly Available with Pacemaker (UNDER CONSTRUCTION)

Abstract

Nova maintains a list of resources and their state in a database. This database is queried and modified frequently by various services in the Nova installation. Therefore, this database must be highly available - if the database server dies the database must quickly be restarted either locally or on another server.

This document will illustrate the steps necessary to make this happen using the open source High Availability (HA) software called Pacemaker.

Overview

At Internap, we are using the open source High Availability software called Pacemaker for HA. Pacemaker is configured on two servers with the servers sharing a disk subsystem.

We use hardware RAID for the shared disk and then put the LUNs into an LVM volume so that we can increase the storage online in the future. A file system is then created on top of the volume and MySQL is configured on the file system. MySQL is then bound to a virtual IP address (VIP). As long as the other Openstack components only use the VIP to talk to the database, fail over of the database from one server to another server is virtually transparent.

This wiki will start the reader with the two servers sharing storage all the way through Pacemaker configuration and MySQL installation and database creation at which point the user will be ready to create the database tables.

After each configuration step, the user will be asked to verify that the step worked as expected with a simple test case.

Finally, this wiki will give some practical steps on configuring Pacemaker, troubleshooting tips and test cases.

Terms

STONITH - Shoot The Other Node In The Head....

VIP - virtual IP address......

Before You Start

       test after each step!!!!!
       links to references
       overview of what we want
       why did we not use clustered file systems?

Setting Up Storage

Before setting up HA on the servers, we must first setup the shared storage.

ASSUMPTIONS: Already have the shared storage connected, multipath configured if need (???include link????) and have tested it by doing things like READs from one LUN on one system while the second system is rebooted.

TEST CASE

LVM Setup

We used hardware based RAID and then configured LVM on top of the RAID LUNs. This was done so that as our needs increase we can expand the storage later by growing the LVM volume.

The steps to configure LVM are:

On one node, partition the RAID devices. The device names on the server are /dev/mapper/mpath*. Repeat these steps for all LUNs:

# parted /dev/mapper/<device name> mklabel gpt
# parted /dev/mapper/<device name> mkpart primary 1 <sizeof LUN>


On both nodes, disable the LVM cache of disks by changing /etc/lvm/lvm.conf per below and remove the cache. Also, filter disks that LVM should not scan:


# rm -rf /etc/lvm/cache/.cache
# diff /etc/lvm/lvm.conf /etc/lvm/lvm.conf-orig
53,54c53
<     #filter = [ "a/.*/" ]
<     filter = [ "r|/dev/sd*|" ]
---
>     filter = [ "a/.*/" ]
79,80c78
<     #write_cache_state = 1
<     write_cache_state = 0
---
>     write_cache_state = 1
360,361d355
<     volume_list = [ "db-vg" ]


NOTE: volume_list must contain the name of any LVM volume groups placed under HA control as well as any “root” volumes if you are using LVM for the root disk. If the root volumes are not listed then the system will fail to boot! Therefore, be sure to add them as need. Since we are using the LVM volume group named “db-vg” we have added it above.

Recreate ramfs with the new lvm.conf so that the node does not automatically active the volume group:

Recreate ramfs with the new lvm.conf so that the node does not automatically active the volume group:

# /usr/sbin/update-initramfs -u

On one node, create a physical volume on partition 1 of each disk:

# pvcreate /dev/mapper/<LUN name>path1

On one node, setup the db volume group with the LUN(s).

# vgcreate db-vg /dev/mapper/mpath1-part1 

On one node, create the logical volume. Here we are creating a 16TB volume:

# lvcreate -L16TB -ndb-vol db-vg

On one node, create a file system on the volumes. The file system is created as ext4:

# mke2fs -t ext4 -j /dev/db-vg/db-vol

On each node, make sure that the volume does not automatically start:

# vgchange -a n 

On each node, make the directories where the file system will be mounted when Pacemaker starts the database service on each node.:

# mkdir /dbmnt

On one node, mount the directory:

# mount  /dev/db-vg/db-vol /dbmnt

Installing MySQL

On both nodes, make sure the mysql user and group exists and has the same uid/gid.:

# groupadd -g<group_id> mysql
# useradd -u<user_id> -d/var/lib/mysql -s/bin/false -g<group_id> mysql
# mkdir /var/run/mysqld
# chmod 755 /var/run/mysqld
# chown mysql /var/run/mysqld
# chgrp mysql /var/run/mysqld

Install mysql-server and mysql-client on both nodes (we use root/nova):

# apt-get install mysql-server
# apt-get install mysql-client


NOTE: AppArmor (http://en.wikipedia.org/wiki/AppArmor) is a security module in Linux. For example, it allows the administrator to restrict an application to only access certain capabilities such as certain files and directories. If the below change is not made, MySQL will not be able to open the directory containing the database.

On both nodes, setup /etc/apparmor.d/usr.sbin.mysqld so that mysql can read/write the file system on the shared disk. Here is the diff:

# diff *mys* $HOME/*mysq*
33,34d31
<   /dbmnt/mysql/ rw,
<   /dbmnt/mysql/** rwkl,

and also add these lines:

 /var/run/mysql/mysqld.pid w,
 /var/run/mysql/mysqld.sock w,
 /dbmnt/ rw,
 /dbmnt/** rwkl,


       NOTE: The above lines must be changed if you install the mysql database in a directory other than /dbmnt.

On both nodes, /etc/mysql/my.cnf as follows:


root@cc-vol-4-1:/etc/mysql# diff my.cnf my.cnf-orig
46,47c46
< #datadir                = /var/lib/mysql
< datadir                = /dbmnt/mysql
---
> datadir                = /var/lib/mysql


also change bind-address to:


bind-address = 0.0.0.0


On both nodes, prevent MySQL from starting automatically by commenting out the start lines in /etc/init/mysql.conf as follows:


#start on (net-device-up
#          and local-filesystems
#         and runlevel [2345])


MySQL will only be started/stopped/failed over by Pacemaker. This change prevents MySQL from being started at boot.

On the node which has /dbmnt mounted on the LVM volume do this commands from the command line:

# mkdir /dbmnt/mysql
# cd /dbmnt
# chgrp mysql mysql
# chown mysql .
# chgrp mysql .
# ls -l /dbmnt
# mysql_install_db --datadir=/dbmnt/mysql --user=mysql
# /usr/bin/mysqladmin -u root password 'nova'
# mysql -u root -p
mysql> show databases;


??? how start database by hand here????? how connect by hand here??? what about test cases here????? how configure pacemaker with the “crm configure edit”???? how setup quorum policy???

Installing and Configuring Pacemaker

This document will cover how to make the servers highly available in OpenStack.

Assumptions

  • Someone has already configured the shared disk storage and has already setup LVM using the naming outlined in this document.
  • HA will start/stop the LVM volume groups
  • LVM volume group will be named “db-vg”.
  • The LVM volume will be called “db-vol”.
  • The mount point for the MySQL database is /dbmnt.
  • An ext4 file system has already been created on the LVM volume for the MySQL database.
  • Someone has already configured IPMI access between the nodes for STONITH
  • Heartbeat links for HA heartbeating will be provided with two heartbeat links between the HA nodes. The heartbeat link will NOT use a switch but instead will use a network cable (direct connect). (If two network cables are used then they should be bonded together)

Setting Up Direct Attached Network Cables with Corosync

Direct connect cables should already be connected between the hosts.

The cables will be bonded together on the device “bond1” OR “bond2”. Check /etc/network/interfaces for the hearbeat bond to be sure. An example /etc/network/interfaces file entry for the bonded hearbeat link is:


# heartbeat bond
auto bond2
iface bond2 inet static
  address 172.16.0.1
  netmask 255.255.255.0
  bond-slaves eth0 eth1
  bond_mode 4
  bond_miimon 100
  bond_lacp_rate 1


where 172.16.0.1 would be replaced by the IP address on each server for the private heartbeat link. Setting Up Pacemaker for High Availability

Pacemaker is an HA framework. To set it up do:

Install

# apt-get install pacemaker
# apt-get install ipmitool


Configure Pacemaker

On both nodes, configure and start OpenAIS (the cluster membership protocol)

For reference, you can look at the document (everything covered in steps below) http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

       specifically the sections titled “Configuring OpenAIS”, “Verify OpenAIS Installation” and “Verify Pacemaker Installation”.

Change /etc/default/corosync as follows:


# pwd
/etc/default
# diff corosync corosync-orig
2c2
< START=yes
---
> START=no


Configure the bindnetadd, multi-cast address and multi-cast port in /etc/corosync/corosync.conf as well as other settings. I only modified bindnetaddr in the default file. For reference, I included my working file using a direct cable (no network switch):


# Please read the openais.conf.5 manual page

totem {
       version: 2

       # How long before declaring a token lost (ms)
       token: 3000

       # How many token retransmits before forming a new configuration
       token_retransmits_before_loss_const: 10

       # How long to wait for join messages in the membership protocol (ms)
       join: 60

       # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
       consensus: 5000

       # Turn off the virtual synchrony filter
       vsftype: none

       # Number of messages that may be sent by one processor on receipt of the token
       max_messages: 20

       # Limit generated nodeids to 31-bits (positive signed integers)
       clear_node_high_bit: yes

       # Disable encryption
       secauth: off

       # How many threads to use for encryption/decryption
       threads: 0

       # Optionally assign a fixed node id (integer)
       # nodeid: 1234

       # This specifies the mode of redundant ring, which may be none, active, or passive.
       rrp_mode: none

       interface {
               # The following values need to be set based on your environment
               ringnumber: 0
               #bindnetaddr: 127.0.0.1
               bindnetaddr: 172.16.0.0
               mcastaddr: 226.94.1.1
               mcastport: 5405
       }
}

amf {
       mode: disabled
}

service {
       # Load the Pacemaker Cluster Resource Manager
       ver:       0
       name:      pacemaker
}

aisexec {
       user:   root
       group:  root
}

logging {
       fileline: off
       to_stderr: yes
       to_logfile: no
       to_syslog: yes
       syslog_facility: daemon
       debug: off
       timestamp: on
       logger_subsys {
    subsys: AMF
               debug: off
               tags: enter|leave|trace1|trace2|trace3|trace4|trace6
       }
}


NOTE: 172.16.0.0 should be changed to the public IP address of the systems with the last octet of the public IP address replaced with a 0.

Start Corosync:


# /etc/init.d/corosync start


Verify that the pacemaker nodes can see each other with the command:


# crm_mon -n


which should show both nodes. This means that Pacemaker can see both nodes and we are ready to configure Pacemaker to manage the OpenStack services.

Test that the heartbeats are actually going through the direct connect cables by doing this command on the first node:

# ifconfig <bond interface> down

You should see a message in the “crm_mon -n” output that lists one of the nodes as “UNCLEAN (offline)”.

After you see this, reboot the system where the bond interface was taken down.

Based on the above corosync.conf file, you should also see messages in /var/log/syslog like:

Jul 12 22:41:17 server1 corosync[26356]:   [TOTEM ] Initializing transport (UDP/IP).
Jul 12 22:41:17 server1 corosync[26356]:   [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 12 22:41:17 server1 corosync[26356]:   [TOTEM ] The network interface [10.4.1.29] is now up.


Setting Up STONITH

?????? need procedure for this based on the Dells....

Refer them to a link?????????????

Setup Modprobe Alias for Device Driver

This is needed because the FileSystem agent looks for this device. On BOTH nodes do:


# echo “alias scsi_hostadapter ahci” >> /etc/modprobe.d/modprobe.conf

After this is done, verify that there are no duplicates in the file with this command:

# cat /etc/modprobe.d/modprobe.conf


Test Cases

Troubleshooting Tools

crm_mon -n
crm configure show (refer to doc)

References

       pacemaker links
       setting up mysql
       ????

Appendix - Sample Pacemaker Configuration File

node server1
node server2
primitive db-fs-p ocf:heartbeat:Filesystem \
        params device="/dev/db-vg/db-vol" directory="/dbmnt" fstype="ext4" \
        op start interval="0" timeout="120" \
        op monitor interval="60" timeout="60" OCF_CHECK_LEVEL="20" \
        op stop interval="0" timeout="240"
primitive db-lvm-p ocf:heartbeat:LVM \
        params volgrpname="db-vg" exclusive="true" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="30"
primitive db-mysql-p ocf:heartbeat:mysql \
        params binary="/usr/sbin/mysqld" config="/etc/mysql/my.cnf" datadir="/dbmnt/mysql" pid="/var/run/mysqld/mysqld.pid" socket="/var/run/mysqld/mysqld.sock" additional_parameters="--bind-address=MYSQL_VIP " \
        op start interval="0" timeout="120s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="20s" timeout="30s"
primitive ipmilan-server1-p stonith:external/ipmi \
        params hostname="server1" ipaddr="SERVER1_IPMI_IP" userid="root" passwd="ADD_PASSWORD_HERE" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="60" timeout="60" start-delay="0"
primitive ipmilan-server2-p stonith:external/ipmi \
        params hostname="server2" ipaddr="SERVER2_IPMI_IP" userid="root" passwd="ADD_PASSWORD_HERE" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="60" timeout="60" start-delay="0"
primitive nova-ip-p ocf:heartbeat:IPaddr2 \
        params ip="MYSQL_VIP" nic="MYSQL_INTERFACE" \
        op monitor interval="5s"
group lvm-fs-ip-mysql-g db-lvm-p db-fs-p glance-ip-p nova-ip-p db-mysql-p \
        meta target-role="Started"
location loc-ipmilan-server1 ipmilan-server1-p -inf: server1
location loc-ipmilan-server2 ipmilan-server2-p -inf: server2
property $id="cib-bootstrap-options" \
        dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
        cluster-infrastructure="openais" \
        stonith-enabled="true" \
        expected-quorum-votes="2" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1320372008"
rsc_defaults $id="rsc-options" \
        resource-stickiness="1000"


MYSQL_INTERFACE - replace with ethernet interface used for MYSQL_VIP
MYSQL_VIP - replace with IP address of VIP for MYSQL
ADD_PASSWORD_HERE - replace with password of IPMI interface
SERVER1_IPMI_IP - replace with IP address of IPMI interface of server1
SERVER2_IPMI_IP - replace with IP address of IPMI interface of server2

mention that have to change configuration files of other services to mention that mysql vip is above....