* Launchpad Entry: Improve VM State Management to constrain state transitions
Created: 12 Oct 2011
Contributors: Phil Day (HP Cloud Services)
Summary
This blueprint would constrain the valid state transitions to a limited subset, and ensure that the remaining transitions lead to consistent and deterministic behavior.
Specifically:
- Limit the valid operations in each state (for example can only resume a paused instance)
- Make some minor changes to state sequence to make the abiove robust
- Ensure that long running operations check current state rather than assuming it is unchanged
Rationale
Current checks on valid state transitions are limited to a few cases, leading to multiple opportunities for non-deterministic behavior. In addition some long running tasks can lead to odd behavior – for example a VM in the building state can spend a long time in image download, be terminated, and when the image download completes go ahead and launch the VM.
Design
VM State is recorded in three instance attributes:
"power_state" derived from the hypervisor "vm_state" changed by Nova code generally at the start and end of main actions "task_state" changed by Nova code to reflect transient steps within an action
For example the following shows how these state values are updated during a Create action
Node |
power_state |
vm_state |
task_state |
API |
|
Building |
Scheduling |
Scheduler |
|
Building |
Scheduling |
Compute |
|
Building |
Networking |
|
|
Building |
Block_Device_Mapping |
|
|
Building |
Spawning |
|
Running |
Active |
|
The full set of state transitions will be mapped out and provided back to the documentation team. From those already mapped we can make the following Observations:
- Most actions set vm_state and task_state early (in compute/api.py), so in-progress tasks can be determined by task_state != None
- Most actions clear task_state on completion, so may actions can be checked by a combination of vm_state and task_state = None
- Always need to leave at least one valid action (terminate)
- Long running actions (such as image download) should periodically update task_state so users can tell that progress is being made
- Long running actions should check for and honour state changes (specifcally terminated)
- The reported state should be a combination of vm_state and task_state
The initial proposal for valid transition is as follows:
vm_state |
task_state |
Valid Actions |
<Any> |
!=None |
Terminate |
Active |
Resize_verify |
Terminate, Reboot, Stop, Rebuild, Pause, Suspend, Rescue, Create_Snapshot, Resize, Confirm_Resize, Revert_Resize |
Active |
None |
Terminate, Reboot, Stop, Rebuild, Pause, Suspend, Rescue, Create_Snapshot, Resize |
Building |
<Any> |
Terminate |
<Any> |
Terminate |
|
Paused |
<Any> |
Terminate, Unpause, Rescue |
Suspended |
<Any> |
Terminate, Resume, Rescue |
Rescued |
<Any> |
Terminate, Reboot, Stop, Rebuild, Pause, Suspend, UnRescue |
Deleted |
<Any> |
Terminate |
Stopped |
<Any> |
Terminate, Start |
Migrating |
<Any> |
Terminate |
Resizing |
<Any> |
Terminate |
Error |
<Any> |
Terminate |
UI Changes
No changes are required to the UI.
Code Changes
The checks for valid actions will be implemented as a decorator, for example
@check_vm_state("delete")
@scheduler_api.reroute_compute("delete")
def delete(self, context, instance_id):
"""Terminate an instance.""“Some other changes may be required to ensure that vm_state and task_state are set consistently (for example task_state is currently to None for a short period during Rebuild, and live_migration doesn't update state at all.)
Migration
TBD
Test/Demo Plan
TBD
BoF agenda and discussion
Etherpad from Boston Design Summit