Nova state machine simplification
There are vm_states, task_states, and power_states for each VM. The use of them is complicated. Some states are confusing, and sometimes ambiguous. There also lacks a guideline to extend/add new state. This proposal aims to simplify things, explain and define precisely what they mean, and why we need them. A new user-friendly behavior of deleting a VM is also discussed.
A TL;DR summary:
- power_state is the hypervisor state, loaded “bottom-up” from compute worker;
- vm_state reflects the stable state based on API calls, matching user expectation, revised “top-down” within API implementation.
- task_state reflects the transition state introduced by in-progress API calls.
- “hard” delete of a VM should always succeed as long as the DB is available.
- power_state and vm_state may conflict with each other, which needs to be resolved case-by-case.
power_state should be the state we get by calling virt driver on a particular VM. The actual state lives in the hypervisor is always authoritative, and the power_state in the db should be viewed as a snapshot of the state in the (recent) past. It can be periodically updated, and should also be updated at the end of a task if the task is supposed to affect power_state.
- How is it updated?
- Always “bottom-up”: reported by a compute worker, override the field in DB. The update may trigger a “reconcile” procedure against vm_state. see below.
- Naming convention
- will stick to the existing ones derived from libvirt.
- obsolete state: BLOCKED, which is essentially RUNNING; SHUTOFF, which is mapped to SHUTDOWN; FAILED, which is mapped to NOSTATE.
vm_state should describe a VM’s current stable (not transition) state. That is, if there is no ongoing compute API calls (running tasks), vm_state should reflect what the customer expect the VM to be. A good example of vm_state: ACTIVE, meaning the VM is running normally. A bad example of vm_state SUSPENDING -- It’s a transition state, meaning that the VM is in the process of suspending and could become to suspended in any seconds. A transition state belongs to task_state.
- How is it updated?
- vm_state should only be updated at the end of a task, when the task successfully finishes, and setting the task_state to None. Without API calls, the vm_state should never change. If a task fails, but is properly cleaned up (e.g. live migration fails, but the VM is working fine on the source node from rollback), the vm_state should not change. If the task fails and is not possible to rollback, the vm_state is set to ERROR.
- Naming convention: an adjective for vm_state.
- What is the relationship with power_state?
- There is no one-to-one mapping. They represent slightly different information. You cannot infer from one to another and need both. For example, after you rescue a VM, the VM is running with the rescue image. The power_state could be either RUNNING, or BLOCKED. But vm_state should only be RESCUED. Based on power_state alone, you can’t tell the whether to use ACTIVE or RESCUED.
- When power_state and vm_state disagree, how to reconcile?
- "First, when there is an ongoing task, the vm_state and power_state may, and probably will disagree. This is because vm_state only represents stable state. During a task execution, the state is in transitioning and is stale.
- When there is no task in progress, power_state and vm_state should agree unless errors or failures happen. In those cases, it must be reconciled case by case. For example:
- if power_state=SHUTOFF, but vm_state=ACTIVE, it is very likely because the shutdown command is issued inside the VM. So the power_state is accurate. This is roughly equivalent to an implicit stop() API call. vm_state should be revised to STOPPED.
- If power_state=BLOCKED, vm_state=HARD_DELETED, that means the user has already asked to delete the VM but somehow the process failed. We should try to delete again.
- if power_state=BLOCKED, but vm_state=PAUSED, that means there was probably some unexpected problem during the pause() virt driver call before. FIXME: what’s the most user-friendly behavior in this case? set to ERROR?
- (right now _sync_power_states do not respect ongoing tasks and may lead to weird behavior)
- How do I get EC2 equivalent state from vm_state?
- ec2 state contains both stable (e.g. running) and transition state (e.g. pending, shutting-down). You’ll need task_state together with vm_state to deduce ec2 state.
- vm_state after cleanup:
- INITIALIZED: VM is just created in the database, but has not been built. (was BUILDING)
- ACTIVE: VM is running with the specified image.
- RESCUED: VM is running with the rescue image.
- PAUSED: VM is paused with the specified image.
- SUSPENDED: VM is suspended with the specified image, with a valid memory snapshot.
- STOPPED: VM is not running, and the image is on disk.
- SOFT_DELETED: VM is no longer running on compute, but the disk image remains and can be brought back.
- HARD_DELETED: From quota and billing’s perspective, the VM no longer exists. VM will eventually be destroyed running on compute, disk images too.
- RESIZED: The VM is stopped on the source node but running on the destination node. The VM images exist at two locations (src and dest, with different sizes). The user is expected to confirm the resize or revert it. (the same functionality of the old task_state.RESIZE_VERIFY)
- ERROR: some unrecoverable error happened. Only delete is allowed to be called on the VM.
- the obsolete state: REBUILDING, MIGRATING, RESIZING should all be with task_state.
- SHUTOFF should also be gone. It’s a state that’s very confusing, and should be matched to STOPPED or DELETED based on shutdown_terminate flag.
task_state should represent a transition state, and is precisely associated with one compute API, indicating which task the VM is currently running. The exact task_state should not be needed to determine whether a task is allowed on the vm_state state machine. Only the fact that whether a task‘s progress is needed.
- Special task: force_delete (or hard delete)
- Deleting VMs should always be allowed, and it should always succeed. The user should have freed more resources in her quota and no longer be billed. Unfortunately, it might be the case that a previous task is stuck so that task_state is never going back to None, or the virt driver gets stuck to destroy the VM, or the compute node is not available due to network/hardware issues to execute the destroy. So, we should not wait until the force_delete() task to reach to compute worker then update vm_state to HARD_DELETED. Instead, vm_state should be updated immediately without going through compute workers.In other words, the force_delete() task works as a pure database operation. The actual cleanup is immediately followed, but is no different than a reconcile procedure between the power_state and vm_state, which can also be triggered periodically.
- How is it updated?
- task_state can be set when the task is certain that it is the only running task on the VM. To make the update atomic, a unique task_id (uuid format) must be generated in the beginning and be associated with the VM id. If the VM already has a task_id, it means another task is in progress. During the task execution, task_id is propagated via the RequestContext data structure to workers. To update task_state in the middle of a task (e.g. to report task progress), one must make sure that the task_id of the VM matches the current task_id. Otherwise, the current running task is preempted by another task (right now only possible by a force_delete task). When a task finishes, the task_state is set to None, and the task_id is set to None too.
- Alternatively since hard delete is the only one that can preempt other tasks, we probably do not need to add task_id right now. But we would need to check vm_state to see it’s not HARD_DELETED instead of checking if task_id matches.
- Do we really need to separate vm_state and task_state?
- Technically, vm_state (stable) and task_state (transition) are disjoint and you could combine them together. The biggest benefit of separation is that the state transition diagram is much simpler -- you only need to think about a DFA between the stable vm_state. The space is much smaller. If a certain extension needs a new task_state, the state transition diagram stays untouched.
- Naming convention
- A verb + “ing” is preferred to describe the task_state where the verb is the compute API method. During the task execution, the task_state should never change. To express the progress of the task, a separate field should be used instead to simplify state machine.
- None: no task is currently in progress
- RESIZE_VERIFY is not a transition state, but a stable state. It’s the new RESIZED state in vm_state now.
- The actual state machine diagram