CrashUp/Recover From Nova Controlled Operations

Objective:
Aims to implement recovery actions for Nova's controlled operations. Controlled operations are those operations that transform instances from initial states to final states without any intermediate state. Following operations comes under this category:


 * reboot
 * start
 * stop
 * pause
 * unpause
 * suspend
 * resume

For remaining nova operations, recovery actions are being described on a separate wiki page.

Description:
Recovery actions on nova operations are based on following inputs:


 * Crashed task state: This information is taken as corresponding task_state value corresponding to the instance in the “instances” table of nova database. Hence forth it will be referred as “crashed_task_state”.
 * Crashed VM state: This information is taken as corresponding vm_state value of the instance stored in the “instances” table of nova database. Hence forth it will be referred as “crashed_vm_state”.
 * VM Power State: This information is collected from managed environment from actual host or from hypervisor manager in case of hosts are managed by that. Recovery agent [as described above] will collect that information from managed environment when requested. Hence forth it will be referred as “actual_vm_power_state”.

For nova, resources mean instances and also image, network and volume if they are attached or associated with the instances. However the affected resources are very much determined from the crashed_task_state. In that case recovery agent of nova needs to work in co-ordination with recovery agents of other components.

Based on the crashed_task_state most of the time, it is deterministically concluded the possible next states where the instance would have been in the managed environment. Next, when we can see the power state of the instance in managed environment, the recovered state can be concluded exactly in most of the cases.

For certain scenarios, it is not possible to determine the state of the recovered state optimistically. In those cases, effort is made to synch up with the power state of the instance in managed environment and state of the instance is moved accordingly in the nova database. In some cases, it might be possible to find the exact state of crashed instance in managed environment with great difficulty. However simply synching up with power state makes it easier to determine the recovered state which is safe enough to be used.

Based on complexity to determine recovered state, nova control operations are handled in simplistic way i.e. recovered states are mostly synchronized with power state of the instance at managed environment. Other nova operations are handled separately having more inputs to consider for concluding the recovered states. Below table lists out all nova control operations along with corresponding task_state and crashed vm_state and final recovered state.

Also see recovery service for remaining nova operations here.