Jump to: navigation, search

Fencing Instances of an Unreachable Host

Revision as of 15:29, 22 January 2014 by Ehud Trainin (talk | contribs)

Abstract

When an OpenStack controller determines that a connection to a physical host is broken, it is possible to restart some or all of its instances on other hosts. The new instance (the instance restarted on another host) takes over the identity of the obsolete instance (the instance on the unreachable host), thus it has the same volumes attached, IP and MAC addresses. OpenStack supports this remote restart operation through a Nova API command called "evacuate" (the Nova "evacuate" API is referred in this document as remote restart).

It is important to note that the remote restart may be done, whenever the OpenStack controller decides the host's connectivity is broken. This neither implies the host's connectivity is broken for sure from its entire environment nor it is broken forever. When the perceived disconnection is due to some transient or partial failure, the OpenStack remote restart might lead into two identical instances running together and having a dangerous conflict. For example, the obsolete instance may access the application storage, causing data corruption, create an IP address conflict or communicate with other nodes, in a way that may disrupt the new instance communications or create inconsistent states.

In order safely remote restart, the obsolete instance must first be fenced, i.e. shut down or isolated.

The following table shows three fencing approaches. These methods address the case in which not only the instances are unreachable, but also their host is unreachable.

Approach Initiated by Method
Power fencing OpenStack Controller Shut down the instances by a power off or a hard/cold reboot of the host
Resource fencing OpenStack Controller Isolate the instances from the application storage and from the data network
Self fencing Nova Compute service on the host Shut down the instances

For each of these methods, the document elaborates how it can be implemented in OpenStack, requirements and recommendations of the underlying infrastructure (e.g. recommendations for the power system), analysis of the fencing method advantages and drawbacks.

Due to infrastructure limitations, it might be that only some of the three fencing methods may be used for a given system. Yet, it is recommended to combine all of them for the following reasons:

  1. Reducing the probability the fencing would fail (especially, this may happen by the same failure caused to the host disconnection).
  2. Reducing the fencing time.

Based on the three fencing approaches, a design of fencing support in OpenStack is given, covering Nova, Cinder and Neutron. This includes

  1. Power fencing
  2. Fencing from storage in Cinder
  3. Fencing from network in Neutron
  4. Self fencing by Nova Compute
  5. 5) Fencing awareness and management in Nova controller

It should be noted that this is a working document, thus while some of the items above (2 and 5) are detailed designs, others (1, 3 and 4) require a further elaboration.