CrashUp/Recover From Nova Controlled Operations

Objective:

Aims to implement recovery actions for Nova's controlled operations. Controlled operations are those operations that transform instances from initial states to final states without any intermediate state. Following operations comes under this category:

reboot
start
stop
pause
unpause
suspend
resume

For remaining nova operations, recovery actions are being described on a separate wiki page.

Description:

Recovery actions on nova operations are based on following inputs:

Crashed task state: This information is taken as corresponding task_state value corresponding to the instance in the “instances” table of nova database. Hence forth it will be referred as “crashed_task_state”.
Crashed VM state: This information is taken as corresponding vm_state value of the instance stored in the “instances” table of nova database. Hence forth it will be referred as “crashed_vm_state”.
VM Power State: This information is collected from managed environment from actual host or from hypervisor manager in case of hosts are managed by that. Recovery agent [as described above] will collect that information from managed environment when requested. Hence forth it will be referred as “actual_vm_power_state”.

For nova, resources mean instances and also image, network and volume if they are attached or associated with the instances. However the affected resources are very much determined from the crashed_task_state. In that case recovery agent of nova needs to work in co-ordination with recovery agents of other components.

Based on the crashed_task_state most of the time, it is deterministically concluded the possible next states where the instance would have been in the managed environment. Next, when we can see the power state of the instance in managed environment, the recovered state can be concluded exactly in most of the cases.

For certain scenarios, it is not possible to determine the state of the recovered state optimistically. In those cases, effort is made to synch up with the power state of the instance in managed environment and state of the instance is moved accordingly in the nova database. In some cases, it might be possible to find the exact state of crashed instance in managed environment with great difficulty. However simply synching up with power state makes it easier to determine the recovered state which is safe enough to be used.

Based on complexity to determine recovered state, nova control operations are handled in simplistic way i.e. recovered states are mostly synchronized with power state of the instance at managed environment. Other nova operations are handled separately having more inputs to consider for concluding the recovered states. Below table lists out all nova control operations along with corresponding task_state and crashed vm_state and final recovered state.

Nova Operation	Crashed Task State	Crashed VM State	Actual VM Power State	Recovered VM State
reboot	rebooting	active	ACTIVE	active
		active	SHUTDOWN	stopped
		stopped	ACTIVE	active
		stopped	SHUTDOWN	stopped
	rebooting_hard	active	ACTIVE	active
		active	SHUTDOWN	stopped
		stopped	ACTIVE	active
		stopped	SHUTDOWN	stopped
stop	powering-off	active	ACTIVE	active
		active	SHUTDOWN	stopped
		rescued	ACTIVE	rescued
		rescued	SHUTDOWN	stopped
		error	SHUTDOWN	stopped
		error	ERROR	error
	stopping	active	ACTIVE	active
		active	SHUTDOWN	stopped
		rescued	ACTIVE	rescued
		rescued	SHUTDOWN	stopped
		error	SHUTDOWN	stopped
		error	ERROR	error
start	powering-on	stopped	SHUTDOWN	stopped
		stopped	ACTIVE	active
	starting	stopped	SHUTDOWN	stopped
		stopped	ACTIVE	active
pause	pausing	active	ACTIVE	active
		active	PAUSED	paused
		rescued	ACTIVE	rescued
		rescued	PAUSED	paused
unpause	unpausing	paused	PAUSED	paused
		paused	ACTIVE	active
suspend	suspending	active	ACTIVE	active
		active	SUSPENDED	suspended
		rescued	ACTIVE	rescued
		rescued	SUSPENDED	suspended
resume	resuming	suspended	SUSPENDED	suspended
		suspended	ACTIVE	active

Also see recovery service for remaining nova operations here.