CrashUp/Recover From Nova Uncontrolled Operations

Objective:

Aims to implement recovery features for Nova's uncontrolled operations. Unontrolled operations are those operations by which the instances can not transition from start states to final states directly. It passes though one or more intermediate states. In fact these operations usually involve external resources as well such as file, volume, network or some other resources. Following operations comes under these category:

boot
delete
rescue
unrescue
rebuild
snapshot
resize
migrate
backup
restore
shelve
unshelve

Description:

For non-controlled operations, along with crashed task_state, crashed vm_state and power_state of the vm in managed environment, few more parameters are needed to be considered based on how far accurately are you planning to recover. For example if crash happens during rescue operation at the time when the VM is already rescued in managed host but yet to update the status in nova DB, it could have been the ideal solution if the entry in nova db could have been updated accordingly. However determining such status in managed host is not always easy to do.

For these non-controlled nova operations, the recovered states might differ based on the cause of such crashed situation. In case of system/application shutdown, it is relatively easy to find the recovered state. However in case of backup-restore scenarios, it is very hard to determine the recover state optimistically. For example, if an instance was resizing when the backup was taken and after that if it was resized again, it won’t be appropriate to move it resized state according to the earlier resize parameters. Hence for certain cases, the recovered states are determined as a safe solution instead of an approach that gives more accurate for most of the time but incorrect states for sometimes.

Also these types of nova operations, it is not possible to conclude the right end state in case of the backup-recovery cases. After the backup has taken there exists chance that the instance has been gone through further nova operations. Since there is no record of which operations have gone through them, it is inappropriate to recover their states in nova database based on the in-progress operations performed during back-up time. For example, while a VM was rescued, if the back-up was taken and by the time the backed up copy was recovered, there is chance that the same VM might have gone through different other nova operations. Thus it may not be wise to recover the VM based on nova operation that it has gone through during back up.

Hence it is required to track the cause of such crashing i.e. whether due to backup-restore or due to crashing of some nova services. Thus it is important to include a flag say “restore_mode” having boolen values. So when the restore_mode is set to true, states of affected instances are changed as below:

VM Power State at Managed Env	Recovered VM State in Nova DB
RUNNING	active
SHUTOFF	stopped
SHUTDOWN	stopped
PAUSED	paused
SUSPENDED	suspended
OTHERS	error

On the other hand, when recover happens due to abrupt stopping of openstack services, the restore_mode flag is set to false. During that time, it is possible to determine the appropriate recovered states of the instances by verifying status of different other related objects such as network and volume apart from power state of the VM. However the exact status of volume and network objects are not easy to determine based on other conditions such as connectivity. At the other hand, even after knowing their status, sometimes it is required to clean or reestablish the association of network and volume with the VM. So in this version it is targeted to move such VMs to error state although they can be recovered with rework on associated network and volumes in future.

The below table indicates the recovered states with the above approach with the logic of moving the states of VM to ERROR for the cases that need network and volume rework as part of recovery.

Nova Operation	Crashed Task State	Crashed VM State	Actual VM Power State	Recovering Actins	Recovered VM State	Remark
boot	scheduling	building	NIL		error
	block_device_mapping	building	NIL	unmap block_device_mapping if prepared	error
	networking	building	NIL	unmap block_device_mapping and if ip assigned, release it	error
		spawning building	NIL	unmap block_device_mapping and release assigned ip	error
			ACTIVE	assign network and attach volume if required	active
			SHUTDOWN		stopped
delete	deleting	active	ACTIVE		active	discount quota update
		active	NIL	delete network and detach volumes if required	deleted
		stopped	SHUTDOWN		stopped	discount quota update
		stopped	NIL	delete network and detach volumes if required	deleted
	soft-deleting	active	ACTIVE		active	discount quota update
		stopped	SHUTDOWN		stopped	discount quota update
rescue	rescuing	active	RUNNING	delete .rescue file if created	active	Not easy to conclude instance to be "RESCUED" by the presence of .rescue file. Might have been left from previous time.
		active	SHUTDOWN	delete .rescue file if created	stopped
		error	FAILED		error
unrescue	unrescuing	rescued	SHUTDOWN		stopped
		rescued	RESCUED		rescued
		rescued	RUNNING	delete .rescue file if not deleted	active
rebuild	rebuild	active	RUNNING		active
			SHUTDOWN		error
	rebuild_block_device_mapping	active	SHUTDOWN		error
	rebuild_spawning	active	RUNNING		active
			SHUTDOWN		stopped
snapshot	image_snapshot / image_pending_upload / image_uploading/image_backup / image_live_snapshot	active/ stopped/ paused/ suspended	RUNNING/ SHUTDOWN/ PAUSED/ SUSPENDED		active/ stopped/ paused/ suspended	No Change in state
resize	resize_prep	active	RUNNING		active
	resize_migrating	active	RUNNING		error	Started detaching n/w and volumes
			SHUTDOWN		error	no more n/w and volume
	resize_migrated	stopped	SHUTDOWN		error
	resize_finish	stopped	SHUTDOWN		error
			RUNNING		active	May or may not have proper n/w and volume
resize-revert	resize_reverting	resized	RUNNING		active
		resized	SHUTDOWN		error
resize-confirm	resize_confirming	resized	RUNNING		active
		resized	SHUTDOWN		stopped
backup	image_backup	active	RUNNING		active
restore	restoring	active	SHUTDOWN		error
		active	RUNNING		active
migrate	migrating	active	SHUTDOWN		error
		active	RUNNING		active
shelve	shelve	active	RUNNING		active
	shelving_image_pending_upload	active	RUNNING		active
	shelving_image_uploading	active	RUNNING		active
	shelving_offloading	active	RUNNING		active
unshelve	unshelving	active	RUNNING		active

While recovering that have crashed during above non-controlled states, some other resources also might get affected. They are being discussed below in details:

Affected Files: After a rescue operation is successful in hypervisor, it creates a .rescue file. So it could have been easy to determine whether a rescue operation is over or not. But by looking at the presence of a rescue file, it can’t be concluded that the crashed rescue operation is indeed completed. Let us examine the following scenario that executes following tasks in sequence:

Rescue the vm -> creates xxx.rescue file.
Stop the vm.
Start the vm.
Do rescue again and crash it during this operation.

The second rescue operation rewrites the rescued file on the same xxx.rescue file. And it is not conclusive enough to distinguish whether the .rescue file was from the first rescue operation left out or from the second rescue operation. However there would have been an unrescue operation, then it would have cleaned the .rescue file. But that is not guaranteed always.

Affected Images: Similarly during snapshot, shelve and backup operation, an image is getting created in glance repository. However it is hard to find the completion of image operation. This leads to complexity in determining the exact state of the resource and hence difficult to recover optimally leaving no option but to clean it and do fresh.

Affected Volumes: While performing different operations such as rebuild, resize, migrate and their related operations, volumes associated with the VM are detached and attached back one by one. However completion of these attaching or detaching operations is not explicitly marked inside the task itself. To provide ideal solution, it should be checked status of each volume and then performed that attaching or detaching operation on the remaining volumes.

Affected Networks: While performing operations like migration and resize requiring migration when it does not fit in current host, needs to reassigning network IPs on the new host. Whiling doing such operations, if it crashes then these associated networks should be handled accordingly.