Jump to: navigation, search

CrashUp/Recover From Nova Uncontrolled Operations

Objective:

Aims to implement recovery features for Nova's uncontrolled operations. Unontrolled operations are those operations by which the instances can not transition from start states to final states directly. It passes though one or more intermediate states. In fact these operations usually involve external resources as well such as file, volume, network or some other resources. Following operations comes under these category:

  • boot
  • delete
  • rescue
  • unrescue
  • rebuild
  • snapshot
  • resize
  • migrate
  • backup
  • restore
  • shelve
  • unshelve
Description:

For non-controlled operations, along with crashed task_state, crashed vm_state and power_state of the vm in managed environment, few more parameters are needed to be considered based on how far accurately are you planning to recover. For example if crash happens during rescue operation at the time when the VM is already rescued in managed host but yet to update the status in nova DB, it could have been the ideal solution if the entry in nova db could have been updated accordingly. However determining such status in managed host is not always easy to do.

For these non-controlled nova operations, the recovered states might differ based on the cause of such crashed situation. In case of system/application shutdown, it is relatively easy to find the recovered state. However in case of backup-restore scenarios, it is very hard to determine the recover state optimistically. For example, if an instance was resizing when the backup was taken and after that if it was resized again, it won’t be appropriate to move it resized state according to the earlier resize parameters. Hence for certain cases, the recovered states are determined as a safe solution instead of an approach that gives more accurate for most of the time but incorrect states for sometimes.

Also these types of nova operations, it is not possible to conclude the right end state in case of the backup-recovery cases. After the backup has taken there exists chance that the instance has been gone through further nova operations. Since there is no record of which operations have gone through them, it is inappropriate to recover their states in nova database based on the in-progress operations performed during back-up time. For example, while a VM was rescued, if the back-up was taken and by the time the backed up copy was recovered, there is chance that the same VM might have gone through different other nova operations. Thus it may not be wise to recover the VM based on nova operation that it has gone through during back up.

Hence it is required to track the cause of such crashing i.e. whether due to backup-restore or due to crashing of some nova services. Thus it is important to include a flag say “restore_mode” having boolen values. So when the restore_mode is set to true, states of affected instances are changed as below:

VM Power State at Managed Env Recovered

VM State in Nova DB

RUNNING active
SHUTOFF stopped
SHUTDOWN stopped
PAUSED paused
SUSPENDED suspended
OTHERS error

On the other hand, when recover happens due to abrupt stopping of openstack services, the restore_mode flag is set to false. During that time, it is possible to determine the appropriate recovered states of the instances by verifying status of different other related objects such as network and volume apart from power state of the VM. However the exact status of volume and network objects are not easy to determine based on other conditions such as connectivity. At the other hand, even after knowing their status, sometimes it is required to clean or reestablish the association of network and volume with the VM. So in this version it is targeted to move such VMs to error state although they can be recovered with rework on associated network and volumes in future.

The below table indicates the recovered states with the above approach with the logic of moving the states of VM to ERROR for the cases that need network and volume rework as part of recovery.


Nova Operation Crashed Task State Crashed VM State Actual VM Power State Recovering Actins Recovered VM State Remark


boot scheduling building NIL error
block_device_mapping building NIL unmap block_device_mapping if prepared error
networking building NIL unmap block_device_mapping and if ip assigned, release it error
spawning building NIL unmap block_device_mapping and release assigned ip error
ACTIVE assign network and attach volume if required active
SHUTDOWN stopped
delete deleting active ACTIVE active discount quota update
active NIL delete network and detach volumes if required deleted
stopped SHUTDOWN stopped discount quota update
stopped NIL delete network and detach volumes if required deleted
soft-deleting active ACTIVE active discount quota update
stopped SHUTDOWN stopped discount quota update
rescue rescuing active RUNNING delete .rescue file if created active Not easy to conclude instance to be "RESCUED" by the presence of .rescue file. Might have been left from previous time.
active SHUTDOWN delete .rescue file if created stopped
error FAILED error
unrescue unrescuing rescued SHUTDOWN stopped
rescued RESCUED rescued
rescued RUNNING delete .rescue file if not deleted active
rebuild rebuild active RUNNING active
SHUTDOWN error
rebuild_block_device_mapping active SHUTDOWN error
rebuild_spawning active RUNNING active
SHUTDOWN stopped
snapshot image_snapshot /
image_pending_upload /
image_uploading/image_backup /
image_live_snapshot
active/
stopped/
paused/
suspended
RUNNING/
SHUTDOWN/
PAUSED/
SUSPENDED
active/
stopped/
paused/
suspended
No Change in state
resize resize_prep active RUNNING active
resize_migrating active RUNNING error Started detaching n/w and volumes
SHUTDOWN error no more n/w and volume
resize_migrated stopped SHUTDOWN error
resize_finish stopped SHUTDOWN error
RUNNING active May or may not have proper n/w and volume
resize-revert resize_reverting resized RUNNING active
resized SHUTDOWN error
resize-confirm resize_confirming resized RUNNING active
resized SHUTDOWN stopped
backup image_backup active RUNNING active
restore restoring active SHUTDOWN error
active RUNNING active
migrate migrating active SHUTDOWN error
active RUNNING active
shelve shelve active RUNNING active
shelving_image_pending_upload active RUNNING active
shelving_image_uploading active RUNNING active
shelving_offloading active RUNNING active
unshelve unshelving active RUNNING active

While recovering that have crashed during above non-controlled states, some other resources also might get affected. They are being discussed below in details:

Affected Files: After a rescue operation is successful in hypervisor, it creates a .rescue file. So it could have been easy to determine whether a rescue operation is over or not. But by looking at the presence of a rescue file, it can’t be concluded that the crashed rescue operation is indeed completed. Let us examine the following scenario that executes following tasks in sequence:

  1. Rescue the vm -> creates xxx.rescue file.
  2. Stop the vm.
  3. Start the vm.
  4. Do rescue again and crash it during this operation.

The second rescue operation rewrites the rescued file on the same xxx.rescue file. And it is not conclusive enough to distinguish whether the .rescue file was from the first rescue operation left out or from the second rescue operation. However there would have been an unrescue operation, then it would have cleaned the .rescue file. But that is not guaranteed always.

Affected Images: Similarly during snapshot, shelve and backup operation, an image is getting created in glance repository. However it is hard to find the completion of image operation. This leads to complexity in determining the exact state of the resource and hence difficult to recover optimally leaving no option but to clean it and do fresh.

Affected Volumes: While performing different operations such as rebuild, resize, migrate and their related operations, volumes associated with the VM are detached and attached back one by one. However completion of these attaching or detaching operations is not explicitly marked inside the task itself. To provide ideal solution, it should be checked status of each volume and then performed that attaching or detaching operation on the remaining volumes.

Affected Networks: While performing operations like migration and resize requiring migration when it does not fit in current host, needs to reassigning network IPs on the new host. Whiling doing such operations, if it crashes then these associated networks should be handled accordingly.