Fencing Instances of an Unreachable Host

Abstract
When an OpenStack controller determines that a connection to a physical host is broken, it is possible to restart some or all of its instances on other hosts. The new instance (the instance restarted on another host) takes over the identity of the obsolete instance (the instance on the unreachable host), thus it has the same volumes attached, IP and MAC addresses. OpenStack supports this remote restart operation through a Nova API command called "evacuate" (the Nova "evacuate" API is referred in this document as remote restart).

It is important to note that the remote restart may be done, whenever the OpenStack controller decides the host's connectivity is broken. This neither implies the host's connectivity is broken for sure from its entire environment nor it is broken forever. When the perceived disconnection is due to some transient or partial failure, the OpenStack remote restart might lead into two identical instances running together and having a dangerous conflict. For example, the obsolete instance may access the application storage, causing data corruption, create an IP address conflict or communicate with other nodes, in a way that may disrupt the new instance communications or create inconsistent states.

In order safely remote restart, the obsolete instance must first be fenced, i.e. shut down or isolated.

The following table shows three fencing approaches, where the methods address the case in which not only the instances are unreachable, but also their host is unreachable.

For each of these methods, the document elaborates how it can be implemented in OpenStack, requirements and recommendations of the underlying infrastructure (e.g. recommendations for the power system), analysis of the fencing method advantages and drawbacks.

Due to infrastructure limitations, it might be that only some of the three fencing methods may be used for a given system. Yet, it is recommended to combine all of them for the following reasons:
 * 1) Reducing the probability the fencing would fail (especially, this may happen by the same failure caused to the host disconnection).
 * 2) Reducing the fencing time.

Based on the three fencing approaches, a design of fencing support in OpenStack is given, covering Nova, Cinder and Neutron. This includes
 * 1) Power fencing
 * 2) Fencing from storage in Cinder
 * 3) Fencing from network in Neutron
 * 4) Self fencing by Nova Compute
 * 5) Fencing awareness and management in Nova controller

It should be noted that this is a working document, thus while some of the items above (2 and 5) are detailed designs, others (1, 3 and 4) require a further elaboration.

Scope
Handling a host disconnection event requires a variety of capabilities:
 * 1) Fault detection: a mechanism for monitoring, detecting and alerting disconnected-host events. Such a mechanism may be based, for example, on Nova Compute heartbeats and/or Ceilometer.
 * 2) Fault management: listening to disconnected-host events, choosing and initiating response actions. This may be done by Heat and/or administrator.
 * 3) Correction capabilities in Nova, Cinder, and Neutron: (i) Fence the instances of the unreachable host. (ii) Remote restart of an instance, or alternatively, start a standby instance (an instance which was kept updated by micro check pointing). (iii) Recover management operations, which were disrupted by the host disconnection. For example, if an instance creation was disrupted by the host failure, it is required to clean up some of the changes already done and to create the instance on another host.

Remote restart is supported today in OpenStack through a special Nova API method, called "evacuate". As elaborated in another document, there is a lot of place for fixes and enhancements of the remote restart ("evacuate") method. There is also a heart beat infrastructure in Nova, which may be leveraged to provide disconnected-host events. All other items in the list above are not supported at all.

While the goal is completing all items, this document focuses only in fencing. Fencing itself is quite an extensive and non trivial functionality. It should be noted that data center managers do not rely solely on OpenStack. They have their own manual procedures and/or scripts to deal with host disconnection scenarios. Thus, adding a fencing mechanism, before completing all the other items, would already be useful.

Another related issue, which is not addressed by this document, is the fencing of an unreachable instance, in case the host is reachable. This may be done by less aggressive means than those elaborated in this document, for example, by using the fence_virsh agent for fencing an instance managed by libvirt.

While fencing and remote restart are necessary for applications high availability, they are not sufficient guarantee for that. When a host crashes, the state of the instance may be lost. In order to avoid that the application state must be persisted and/or have a live copy. These issues are beyond the scope of this document.



Remote restart
Normally, when it is desired to move an instance from one host to another (for example in a planned maintenance operation, due to power consumption optimization, due to load balancing, etc.) it is best to use live migration, which is transparent to the application. Sometimes live-migration is not possible due to some incompatibly between the source and the target host. In such cases, a cold-migration should be used. Cold migration flushes all state and data to the disk and then either copies the disk to the destination, or points the new instance to a shared storage.

In the current case, the host is unreachable for instances management operations (though the host may still be reachable for power operations). Thus neither life migration nor cold migration is possible. It is only possible to remote restart the instance.

The problem
While the remote restart support (the "evacuate" API in Nova) is a good start, it is an incomplete and a dangerous solution, since the obsolete instances may attempt to access the application storage causing a data corruption, create an IP address conflict or disrupt the new instance communications.

Note that the only thing we know for sure is that the connection to the host was lost. This is indeed very often an indication that the host had been crashed, but not always. A connection to a host may be lost for other reasons as well. In some of these cases a conflict may happen. Specifically, the following causes may lead into a conflict:
 * 1) A transient failure of the network would lead to a conflict once the host is reconnected.
 * 2) A partial network failure, in which the OpenStack controller lost its connection to the host, yet, the host's instances can access the application storage and/or the data network.
 * 3) A partial failure of the host may disrupt its connection with the controller without killing host's instances.

The solution
We may try to further check and improve the root-cause-analysis of the disconnection. For example, it is possible to monitor also the data network in order to exclude the partial network failure scenario. In such a case it might be consider, for example, using the data network, in order to reach the host and to restart the Nova-Compute service, as it might be the root cause of the problem in this case.

Nevertheless, there is no way to be 100% sure that the root cause was a host crash. For example, it is unclear how one may determine what a transient failure is, since no matter how long we wait, the host may reappear after the recovery actions were issued.

Anyway, even if the host became unreachable due to some other reason, we may still like to remote restart its instances. For example, in case of a transient network failure, we may like to remote restart rather than waiting for the host to be reconnected.

Nevertheless, it is possible to safely complete a remote restart in case we first isolate or shut down the instances. Such an operation is called fencing.

There are three possible approaches for fencing: power fencing, resource fencing and host fencing.

System requirements
Since the recovery of instances depends on the OpenStack controller, it is recommended that the controller would be highly available. This can be done based on the PaceMaker clustering mechanism, which is supported by OpenStack.

For network fencing, it is recommended to have separate networks for data and management.

Self fencing may require Nova-Compute service on each host.

Since the recovery mechanism is dependent on fencing, it is recommended that the fencing system would be high available. Combining the three fencing approaches may, to a large extent, address this challenge. Beyond that, there are further measures to strengthen each of the fencing methods. Such measures are elaborated in the sections devoted to each of the fencing methods.

Power fencing
Power fencing may be done by a power off of the host or a hard reboot (power off the host and then power it on, aka cold reboot) through power distribution units (PDU).

It is important to note that hard reboot of an unreachable host may be useful even if it is not possible or desired to remote restart the instances, since after a hard reboot the host may become operational again so it would be possible to start the instances. Similarly, if hard reboot made the host operational again, it would be possible to schedule instances to it. On the other hand, if for example, the host suffers repeating failures, it might be desired to just power it off.

In summary there are three possible scenarios
 * 1) It was not possible to reach the host, turn off the power and get a confirmation for that. In such a case other fencing methods may succeed fencing or there is a problem.
 * 2) Power off succeeded but an attempt to power it on again has failed. In such a case it is possible to start the new instances, since the host is fenced.
 * 3) Power off succeeded and so is the power on such that the host overcame the failure. In such a case, if a new instance was not started already, it is possible to start the obsolete instance.

It is recommended to utilize power fencing according to the following rules:


 * 1) The fencing device (i.e. the power control device) and the host should not share a single power rail, since in case the power is lost, the OpenStack controller won't get a success confirmation from the fencing device. In such a situation, the controller won't do a remote restart, although the power is off, since it can't know that. Therefore it is recommended to have a PDU with two power rails or a with a power rail, which is not shared with the hosts it controls.
 * 2) It is recommended to have two fencing devices, where each host is connected to both devices and the two fencing devices are not sharing a single power rail.
 * 3) It is recommended that the connectivity of the controller to the different fencing devices should not simultaneously fail due to a failure of a single cable or a single network device.

Power fencing, in case implemented according to all the above rules, is a reliable solution. Such an invested PDU system may further improve HA through a backup power rail for each host. On the other hand, following all the above rules may incur additional costs and puts restrictions on the type of hardware that can be used.

Comment: Ironic, power fencing with IPMI may be relevant here.

Resource Fencing
With resource fencing the failed instance would be isolated by preventing it from accessing to the storage and to the network.

It is recommended to avoid a single point of failure of the storage or the network fencing. For example, avoiding a configuration in which the controller is connected to the network through a single switch, or a configuration in which a host and the application storage should reside behind the same single switch.

Similar to power fencing, making the resource fencing high available may require additional costs.

Storage fencing in Cinder
See blueprint at https://blueprints.launchpad.net/cinder/+spec/fencing

Fence storage form host would include preventing the host from accessing the volumes, which were attached to it.

Fencing will be done with a fence-host method, called with the arguments: context, host-name and connector.

The fencing will take care of
 * 1) Finding all storage devices with volumes attached to one or more of the host's instances.
 * 2) For each storage device in that list, removing the host from the list of hosts permitted to access volumes in the storage device. If the driver needs further information about the host, such as iSCSI initiator names or WWPNs, it would derive them out of the connector.
 * 3) Detach at the Cinder level each volume attached to one of the host's instances.

Part of the host-fencing would be done at the driver. Given there are many drivers, the fence-host would be implemented and tested first only with one or two drivers: iSCSI and maybe NFS.

It should be noted that currently Cinder supports a detach-volume command, which is only detaching volume at the Cinder management level, but does not force detachment at the storage device level. One may consider an alternative fencing implementation, in which the behavior of volume detach is changed, such that it would also force detach at the storage controller level. However, there are three reasons why this should not be done:
 * 1) In case a volume is shared, it is impossible to force-detach it from a single instance, since a volume is connected - at the storage device level - to the host and not to an instance. It would neither be possible to detach a shared volume from all instances, since it may disconnect the volume from instances on other hosts.
 * 2) In NFS it is not possible to disconnect a volume from an instance even if it is not shared.
 * 3) Since we like to fence an instance, only as part of the fencing of all instances of a given host, sending a single fence-host request would be more efficient and thus enable a faster recovery.

Network fencing in Neutron
In order to have a network fencing, it is recommended to separate the data network from the management network, for example, by Vlans or subnets. Such a separation should be such that the data network would not fall back to the management network in case of a failure. The reason such a separation may be useful, it would be possible to fence only the data network without fencing the management network. This way it would be possible to detect when the host is reachable again, clean the host from the obsolete instances and only then reconnecting it to the data network. Furthermore, having a separated data and management networks is a good practice for further reasons.

It should be noted though, that having data and management on separate networks is not the only possible solution for safe reconnection. Another possible solution, for example, is waiting for the self fencing timeout to elapse.

At this point fencing will target the physical switches connected directly to the host. If it is not possible to access all these switches then network fencing would fail.

The interface provided by Neutron will be fence-host where the argument is only the host name. The network connectivity, topology and the logic of how to do fencing (for example, finding all switches directly connected to the host) - would all be encapsulated by Neutron.

In order to achieve the above, Neutron should maintain inventory of physical switches, the connectivity of switch interfaces to hosts and any other needed topology information. Currently, this information is not available in Neutron.

The underlining implementation in Neutron may be done using Neutron extension plug-ins for physical network devices. For example, plug-ins supporting the OpenFlow standard, such as Open Daylight or Ryu.

Self fencing
With self fencing, when the host detects that it is disconnected, it shuts down or disconnects its instances.

As elaborated in previous sections, external fencing can be achieved through control of power, storage and network devices. While this external fencing system can be based on an existing infrastructure, making the fencing highly available may require additional infrastructure. Since self fencing can not be masked by any kind of failure outside the host, it may reduce the costs involved in making the fencing operation more resilient to failures.

On the other hand, self fencing has a variety of potential risks and drawbacks. The following paragraphs elaborated the different potential problems and how they may be addressed.

Safety
The independent behavior of the host fencing approach makes it powerful, but it also may be risky. For example, when a controller fails, or its network connectivity fails, the workloads are not necessarily impacted. Thus, in such a scenario, if each host would shut down its instances, just because it lost its connectivity to the controller, an unnecessary outage - of possibly many workloads - may happen.

Taking care that the controller is highly available, would reduce the risk of such a grand outage, yet this reduction may be insufficient, given the potential vast damage in case both controllers are disconnected.

It is possible to address this risk by using a quorum-consensus solution. In such a solution the controller and the NovaCompute services would create a cluster of nodes that monitor each other. In case of a spilt brain, a host would not fence itself, if it is in the majority side, even if it is disconnected from the controller.

For this solution it is required that Nova-Compute service would be on the host.

The scalability of the quorum-consensus solution may be addressed by:
 * 1) Limiting the number of hosts in the Nova Controller cell.
 * 2) Using cluster monitoring mechanisms, in which each node monitors only few other nodes at a given period, while the cluster state propagates gradually to all nodes.

Comments:
 * Heartbeat mechanisms may be relevant here, e.g. MatchMaker for RPC, DB driver, Memcached, Zookeeper based failure detection.
 * SBD fencing may be relevant here.

Failures coverage
One of the possible reasons for a host disconnection is a partial failure. In such a case, the host may also fail to detect its failure.

This problem can be avoided if self fencing is used only as complementary to external fencing.

This problem may also be addressed, if the self fencing is done with a hardware watchdog.

Slow recovery
Since the OpenStack controller does not get any notification about successful fencing, it may restart the instances on other hosts only after a timeout elapsed. This may slow down the recovery. Furthermore, using the quorum-consensus solution for safety implies a longer timeout, especially with large clusters.

This problem can be coped if self fencing is used only as complementary to external fencing. The better the external fencing system would be the lower probability that the OpenStack controller would need waiting for the self-fencing timeout.

Fencing integration within Nova
See blueprint at https://blueprints.launchpad.net/nova/+spec/fencing

Fencing state
A host may have one the following fencing states: FENCED and UNFENCED.

It is important to note that while the fencing state is associated with the host, the host is not necessarily completely fenced. The fencing is targeted only at the instances of the host, since the instances are needed to be restarted on other hosts. If possible, the host is not fenced from the controller, so it would be possible to make it usable again, in case the problem is over. For example, network fencing may disconnect only the data network and power fencing may try to do a hard reboot, rather than leaving the host powered off.

Nova should maintain the fencing state for the following reasons:
 * 1) 	Nova should agree to remote restart an instance only if its host is fenced.
 * 2) 	While a host is fenced, the Nova controller should not schedule instances to it even if it becomes reachable again.
 * 3) 	When a fenced host becomes reachable again, the Nova controller should do clean up and then un-fence the host.

The fencing states would be changed due to operations exposed by API (fence API, power down API) and due to automatic procedures triggered by events (self fencing, rejoin of a disconnected host).

Fixing the remote restart path
Currently, the remote restart ("evacuate") method enables to remote restart instances of a failed host, without checking a fencing state. It is suggested to allow a remote restart according to the following table:

Currently, remote restart detach volumes attached to the obsolete instance before attaching to the remotely restart instance. The Cinder fencing task would also detach volumes (while also forcing the detachment at the storage level). Therefore, the detachment, currently done in remote restart, may be avoided, in order to have a faster recovery.

Scheduler fencing awareness
It is suggested to allow scheduling according to the following table.

Further fencing aware paths
Are there further paths that should be influenced by the fencing state?

Adding fence-host API
While Nova would not perform fencing actions by itself, it is recommended Nova would be able to manage fencing for the following reasons:
 * 1) 	As Nova controller provides an API method for a remote restart and since fencing is a pre-condition for remote-restart, it is recommended that Nova would also provide the user an API for fencing.
 * 2) 	Such an API would also hide and save from the user fencing details, i.e. the combination of several fencing methods.

The fence-host method would do the follows:
 * 1) 	Check if the host is already in a FENCED state. If so, return success.
 * 2) 	Check if the host is up. If the host is up, return an error.
 * 3) 	Issue fence-host-from-storage requests to Cinder
 * 4) 	Issue fence-host-from-data-network to Neutron.
 * 5) 	Issue power fencing request

It would be possible to fallback into manual fencing (aka meatware), for example, by powering down the host manually. After manual fencing was done, the user would be able to use the API to confirm a host is fenced. In such a case, Nova would trust the user and change the fencing state into FENCED, with no further actions.

It should be noted that the fencing, as well as fencing sub-tasks, performed by other component (e.g. Cinder or Neutron) should be idempotent, thus there is no problem if a host is already fenced. In such a case, the fencing tasks should quickly send a successful acknowledgement for the fencing request.

Changes in fencing state
Once the host power state was changed into powered off, due to a hard reboot or a power off, which were called by the fence-host method or directly, the host fencing state would be changed into FENCED. Are there further API commands that affect fencing state?

If both the storage and the network were fenced, the host fencing state would be changed into FENCED.

After self fencing timeout was elapsed, the fencing state would be set into FENCED.

Since all the above may happen concurrently, the fencing state would be changed into FENCED after the first successful fencing action, while the rest actions would be done, but would not change the fencing state, since it is already set to FENCED.

Joining back a fenced host
When a fenced host becomes reachable again, the following actions should be done:
 * 1) 	Wait for the self fencing timeout to elapse, if it was not already elapsed. Alternatively, if self fencing is not activated, shut down all instances in case they are not shut down.
 * 2) 	Delete all obsolete instances.
 * 3) 	Reconnect the remaining instances to their volumes.
 * 4) 	Reconnect the network.
 * 5) 	If the above tasks succeeded, change host's instances state into UNFENCED.
 * 6) 	Power on the remaining instances.

Listening to changes in fencing state
It would be possible to subscribe to notifications about changes in fencing state. The subscribers would get a notification for such a change, whatever the event, action or action initiator caused to the fencing state change.