Nova-vm-state-management


 * Launchpad Entry: Improve VM State Management to constrain state transitions
 * Created: 12 Oct 2011
 * Contributors: Phil Day (HP Cloud Services)

Summary
This blueprint would constrain the valid state transitions to a limited subset, and ensure that the remaining transitions lead to consistent and deterministic behavior.

Specifically:
 * 1) Limit the valid operations in each state (for example can only resume a paused instance)
 * 2) Make some minor changes to state sequence to make the abiove robust
 * 3) Ensure that long running operations check current state rather than assuming it is unchanged

Rationale
Current checks on valid state transitions are limited to a few cases, leading to multiple opportunities for non-deterministic behavior. In addition some long running tasks can lead to odd behavior – for example a VM in the building state can spend a long time in image download, be terminated, and when the image download completes go ahead and launch the VM.

Design
VM State is recorded in three instance attributes:

"power_state" derived from the hypervisor "vm_state" changed by Nova code generally at the start and end of main actions "task_state" changed by Nova code to reflect transient steps within an action

For example the following shows how these state values are updated during a Create action

The full set of state transitions will be mapped out and provided back to the documentation team. From those already mapped we can make the following Observations:


 * Most actions set vm_state and task_state early (in compute/api.py), so in-progress tasks can be determined by task_state != None
 * Most actions clear task_state on completion, so may actions can be checked by a combination of vm_state and task_state = None
 * Always need to leave at least one valid action (terminate)
 * Long running actions (such as image download) should periodically update task_state so users can tell that progress is being made
 * Long running actions should check for and honour state changes (specifcally terminated)
 * The reported state should be a combination of vm_state and task_state

The initial proposal for valid transition is as follows:

UI Changes
No changes are required to the UI.

Code Changes
The checks for valid actions will be implemented as a decorator, for example

@check_vm_state("delete") @scheduler_api.reroute_compute("delete") def delete(self, context, instance_id): """Terminate an instance.""“

Some other changes may be required to ensure that vm_state and task_state are set consistently (for example task_state is currently to None for a short period during Rebuild, and live_migration doesn't update state at all.)

Migration
TBD

Test/Demo Plan
TBD

BoF agenda and discussion
Etherpad from Boston Design Summit