CrashUp/CrashUP: A Recovery Service for Openstack

Objective
Recover openstack resources that are occupied by long pending in-progress tasks. An openstack resource can be any resource used in openstack such as instance, image, volume, network or similar other object. In-progress task can be any task executed as part of an openstack operation such as nova, glance, cinder or neutron operation.

Requirement Analysis
While executing an openstack operation, it may lock underlying resource based on performing task. During that time, it does not allow other openstack operations to be performed on the same resource if the resource is needed exclusively. This becomes an issue if an openstack operation remains for longer period in the same task. Usually such situations arise on following two scenarios:
 * 1) If the service performing that particular operation crashes and does not get back appropriate message from message queue when the service comes up again.
 * 2)  If a backup is taken while an operation in progress, during restore time, it leaves the database entry in the same in-progress state. Later on although the corresponding service starts fresh, it treats the resource as occupied by the in-progress task.

For recovering these resources as they are blocked by long pending in-progress tasks, the states of the resources should be moved to desired final states and the corresponding task_states for this resource should be dissociated. This transformation of states should be based on following information:
 * 1) Resource state in openstack tables,
 * 2) Performing task on that resource
 * 3) Actual resource status in the managed environment.

Also in some complex scenario, recovering operation might involve in reassigning or releasing of other related resources.

Architecture
To achieve earlier discussed objectives, there is a need for checking if an operation is being stacked for longer period. This checking should be done continuously with a reasonable interval. To run this continuously, it should be executed as a service and let us name as “CrashUp Service”. Among all the in-progress tasks, it is required to filter out possible candidates for recovering if they are in this state for longer period. With these requirements, following two nova flags are defined:


 * Recovery_interval_period: indicates the time in second as interval in which checking of in-progress tasks should be performed by recovery service.
 * Recovery_stale_period: indicates the time in second if crossed from the last update/create time for the resource in openstack table, then it is treated as an candidate for recovery.

The recovery operations are classified into different categories based on performing openstack operation that causes it to be recovered. An openstack operation has one or more task_states are performed sequentially. While executing over these task_states, it moves the underlying resource from one state to other states one by one and can also go back to the original state in case of failure. However a single task_state is always mapped to only one openstack operation. On other words, no two openstack operations can have same task_state. Also there are some operations that involve multiple resources like instance, image, volume and network. Recovery operations are performed based on different state information such as current task state, state of the resource in openstack tables and the state of that resource in the managed environment. All such Recovery scenarios have been divided based on task states or in general types of operations that it performs as below:


 * 1) Nova Recovery
 * 2) Glance Recovery
 * 3) Network Recovery
 * 4) Volume Recovery

Before discussing more on different types of operations in details, it is required to discuss the approach required to get information from managed environment. It is required to find the state of the instance from hypervisor or hypervisor manager at managed environment for nova recovery, status of images in the glance repository file system for glance recovery, status of network deployed during network recovery and similarly status of the volume present during volume recovery. As of now this information are not readily available by any service. So it is required to implement separate module collecting information about such crashed resources to gather information from managed environment. Let us call this piece of module as “Health Collector” that collects information about the resources from hypervisor/hypervisor manager for instances, or status of image from the file system, status of network deployed in managed host or status of volume attached to the instance.



The above picture shows the interaction among different participating components of Openstack with Recovery service to perform the desired recovery actions.

Recovery Approach
Recovery in CrushUp Service follows below mentioned guidelines for recovering:


 * 1) It plays safe by not deleting or modifying the managed environment rather it rectifies the entry in its database. However if the status of the resource is not stable, it might need to rectified or deleted based on different scenarios.
 * 2) If an resource is found in managed environment, but sufficient metadata are not available to populate the corresponding entry in database, it should be alerted back to user regarding such finding.
 * 3) It is required to send appropriate notification to concern services in case of any change in status of the resources in database. When recovery cannot be performed appropriately, the status of the resource is moved to “ERROR” state and also additionally  alert might be required to send in to the message queue for others to consume.
 * 4) In some scenarios, if the recovery agent can’t fetch information about the crashed resource due to different reasons such as network failure or when host is powered off etc, it should keep on trying for certain fixed time at regular predefined interval and act on those resources when it can collect information about them. For that it uses following two flags allowing users to configure according their requirements:


 * recovery_resync_count: indicates number times it should check from now.


 * recovery_resync_interval: indicates time in seconds when the checking is made if any disconnected host gets connected so that instances in that host are subjected to recover.