CrashUp/Recover From Nova Uncontrolled Operations

Objective:
Aims to implement recovery features for Nova's uncontrolled operations. Unontrolled operations are those operations by which the instances can not transition from start states to final states directly. It passes though one or more intermediate states. In fact these operations usually involve external resources as well such as file, volume, network or some other resources. Following operations comes under these category:
 * boot
 * delete
 * rescue
 * unrescue
 * rebuild
 * snapshot
 * resize
 * migrate
 * backup
 * restore
 * shelve
 * unshelve

Description:
For non-controlled operations, along with crashed task_state, crashed vm_state and power_state of the vm in managed environment, few more parameters are needed to be considered based on how far accurately are you planning to recover. For example if crash happens during rescue operation at the time when the VM is already rescued in managed host but yet to update the status in nova DB, it could have been the ideal solution if the entry in nova db could have been updated accordingly. However determining such status in managed host is not always easy to do.

For these non-controlled nova operations, the recovered states might differ based on the cause of such crashed situation. In case of system/application shutdown, it is relatively easy to find the recovered state. However in case of backup-restore scenarios, it is very hard to determine the recover state optimistically. For example, if an instance was resizing when the backup was taken and after that if it was resized again, it won’t be appropriate to move it resized state according to the earlier resize parameters. Hence for certain cases, the recovered states are determined as a safe solution instead of an approach that gives more accurate for most of the time but incorrect states for sometimes.

Also these types of nova operations, it is not possible to conclude the right end state in case of the backup-recovery cases. After the backup has taken there exists chance that the instance has been gone through further nova operations. Since there is no record of which operations have gone through them, it is inappropriate to recover their states in nova database based on the in-progress operations performed during back-up time. For example, while a VM was rescued, if the back-up was taken and by the time the backed up copy was recovered, there is chance that the same VM might have gone through different other nova operations. Thus it may not be wise to recover the VM based on nova operation that it has gone through during back up.

Hence it is required to track the cause of such crashing i.e. whether due to backup-restore or due to crashing of some nova services. Thus it is important to include a flag say “restore_mode” having boolen values. So when the restore_mode is set to true, states of affected instances are changed as below:

On the other hand, when recover happens due to abrupt stopping of openstack services, the restore_mode flag is set to false. During that time, it is possible to determine the appropriate recovered states of the instances by verifying status of different other related objects such as network and volume apart from power state of the VM. However the exact status of volume and network objects are not easy to determine based on other conditions such as connectivity. At the other hand, even after knowing their status, sometimes it is required to clean or reestablish the association of network and volume with the VM. So in this version it is targeted to move such VMs to error state although they can be recovered with rework on associated network and volumes in future.

The below table indicates the recovered states with the above approach with the logic of moving the states of VM to ERROR for the cases that need network and volume rework as part of recovery.

While recovering that have crashed during above non-controlled states, some other resources also might get affected. They are being discussed below in details:

Affected Files: After a rescue operation is successful in hypervisor, it creates a .rescue file. So it could have been easy to determine whether a rescue operation is over or not. But by looking at the presence of a rescue file, it can’t be concluded that the crashed rescue operation is indeed completed. Let us examine the following scenario that executes following tasks in sequence:
 * Rescue the vm -> creates xxx.rescue file.
 * Stop the vm.
 * Start the vm.
 * Do rescue again and crash it during this operation.

The second rescue operation rewrites the rescued file on the same xxx.rescue file. And it is not conclusive enough to distinguish whether the .rescue file was from the first rescue operation left out or from the second rescue operation. However there would have been an unrescue operation, then it would have cleaned the .rescue file. But that is not guaranteed always.

Affected Images: Similarly during snapshot, shelve and backup operation, an image is getting created in glance repository. However it is hard to find the completion of image operation. This leads to complexity in determining the exact state of the resource and hence difficult to recover optimally leaving no option but to clean it and do fresh.

Affected Volumes: While performing different operations such as rebuild, resize, migrate and their related operations, volumes associated with the VM are detached and attached back one by one. However completion of these attaching or detaching operations is not explicitly marked inside the task itself. To provide ideal solution, it should be checked status of each volume and then performed that attaching or detaching operation on the remaining volumes.

Affected Networks: While performing operations like migration and resize requiring migration when it does not fit in current host, needs to reassigning network IPs on the new host. Whiling doing such operations, if it crashes then these associated networks should be handled accordingly.