Difference between revisions of "VMStateCleanupService"
Line 4: | Line 4: | ||
* '''Contributors''': [[MandarVaze]] | * '''Contributors''': [[MandarVaze]] | ||
− | = | + | == Summary == |
− | == | + | Clean the various VM instances that are stuck during specific operation |
+ | |||
+ | == Release Note == | ||
+ | |||
+ | == Rationale == | ||
Nova operations depend on various nova services, as well as external components like DB abd RabbitMQ | Nova operations depend on various nova services, as well as external components like DB abd RabbitMQ | ||
Line 14: | Line 18: | ||
Several bugs where instance gets stuck are associated with this blueprint : https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service | Several bugs where instance gets stuck are associated with this blueprint : https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service | ||
+ | |||
+ | == User stories == | ||
+ | |||
+ | User is unable to access or delete the VM instances that are stuck. | ||
== Goal == | == Goal == | ||
Line 27: | Line 35: | ||
* Status of the VM can be derived only from the combination of vm_state and task_state. Unfortunately, this combination does not provide enough granularity to determine at what stage the instance was stuck. This could be useful during recovery. | * Status of the VM can be derived only from the combination of vm_state and task_state. Unfortunately, this combination does not provide enough granularity to determine at what stage the instance was stuck. This could be useful during recovery. | ||
− | == | + | == Assumptions == |
+ | * All the nova processes as well as third party processes like DB and RabbitMQ are up and running when cleanup service is invoked. (Else the cleanup tasks might fail.) | ||
+ | * So it must be a separate script - to be invoked manually. | ||
+ | ** Periodic task in Compute might be overkill plus if other services are still down, it would be wasted effort repeatedly | ||
+ | * Executed on Nova Compute Host | ||
+ | ** It would perform RPC to Nova Network, Nova Volume when needed. All other operations done locally. | ||
+ | ** Compute API can be added, which can be invoked from remote machines if needed | ||
+ | |||
+ | == Design == | ||
''' Define Timeouts ''' | ''' Define Timeouts ''' | ||
Line 67: | Line 83: | ||
at various places during the operation. | at various places during the operation. | ||
− | General format < | + | General format <[[NovaProcess]]>.<subtask>.<start/end> |
− | This is | + | This is based on checkpoints currently used by notification service. |
'''Need to understand how to get the task_info for specific instance (Based on context ?)''' | '''Need to understand how to get the task_info for specific instance (Based on context ?)''' | ||
Line 79: | Line 95: | ||
* Failed too early, set the status to ERROR.None. Nothing to clean | * Failed too early, set the status to ERROR.None. Nothing to clean | ||
** If task_info begins with `compute` : | ** If task_info begins with `compute` : | ||
− | * Depending on subtask | + | *** Depending on subtask - call _deallocate_network, _shutdown_instance and _cleanup methods |
− | * Set status to ERROR.None | + | *** Set status to ERROR.None |
* Cleanup for stuck Delete Server operation : | * Cleanup for stuck Delete Server operation : | ||
** | ** | ||
− | + | * | |
* | * | ||
* | * | ||
− | == | + | == Implementation == |
+ | TBD | ||
− | * Should cleanup service only | + | === UI Changes === |
+ | TBD | ||
+ | |||
+ | === Code Changes === | ||
+ | TBD | ||
+ | |||
+ | === Migration === | ||
+ | TBD | ||
+ | |||
+ | == Test/Demo Plan == | ||
+ | TBD | ||
+ | |||
+ | == BoF agenda and discussion == | ||
+ | |||
+ | * Should cleanup service only _fix_ the vm_state and let user perform explicit Delete operation ? Or it should try to release the resources as well : | ||
** Even if we release the resources, and later User deletes the instance manually (Since it was in ERROR) - There may be some errors in logs, but instance '''will''' get deleted at the end. | ** Even if we release the resources, and later User deletes the instance manually (Since it was in ERROR) - There may be some errors in logs, but instance '''will''' get deleted at the end. | ||
* How can task_info and task_log be used effectively ? | * How can task_info and task_log be used effectively ? | ||
* Do we need (complex) state machine for cleanup service ? | * Do we need (complex) state machine for cleanup service ? | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Links == | == Links == | ||
Line 109: | Line 131: | ||
* http://wiki.openstack.org/TransactionalTaskManagement | * http://wiki.openstack.org/TransactionalTaskManagement | ||
* https://github.com/maoy/nova/tree/orchestration | * https://github.com/maoy/nova/tree/orchestration | ||
+ | |||
+ | ---- | ||
+ | [[Category:Spec]] |
Revision as of 12:58, 9 May 2012
- Launchpad Entry: NovaSpec:compute-instance-cleanup-service
- Created: 09 May 2012
- Contributors: MandarVaze
Summary
Clean the various VM instances that are stuck during specific operation
Release Note
Rationale
Nova operations depend on various nova services, as well as external components like DB abd RabbitMQ During the lifecycle of an operation like create/delete - if one of the component goes down, then status of the instance remains stuck.
User is unable to recover an instance from such state. Some states prevent deletion of such instace, thus resulting into "hung" instances which are just using the resources.
Several bugs where instance gets stuck are associated with this blueprint : https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
User stories
User is unable to access or delete the VM instances that are stuck.
Goal
There needs to be a cleanup service that will identify such instances and fix their status.
- Worst case mark vm_state as Error (So that user can delete the VM and reclaim the resources)
- Best case, roll it back to ACTIVE state (see review comments at https://review.openstack.org/#/c/6632/)
Issues
- No way to determine whether instance is stuck, there are no well defined timeouts per operation
- Status of the VM can be derived only from the combination of vm_state and task_state. Unfortunately, this combination does not provide enough granularity to determine at what stage the instance was stuck. This could be useful during recovery.
Assumptions
- All the nova processes as well as third party processes like DB and RabbitMQ are up and running when cleanup service is invoked. (Else the cleanup tasks might fail.)
- So it must be a separate script - to be invoked manually.
- Periodic task in Compute might be overkill plus if other services are still down, it would be wasted effort repeatedly
- Executed on Nova Compute Host
- It would perform RPC to Nova Network, Nova Volume when needed. All other operations done locally.
- Compute API can be added, which can be invoked from remote machines if needed
Design
Define Timeouts
Introduce set of pre-defined "max time allowed for operation" lookup table. These can be overridden from nova.conf e.g. :
#!highlight python cfg.IntOpt('buildserver_maxtime', default=3600, #One hour help='How long can VM stay in building server state ? (In Seconds)'), cfg.IntOpt('snapshot_maxtime', default=23200, #12 hours help='How long can VM stay in snapshoting state ? (In Seconds)'),
Cleanup service will use these value and "time since last update" to determine the "stuck" VMs.
Define Granular Task Substates
Code changes from https://github.com/maoy/nova/tree/orchestration (Related to http://wiki.openstack.org/TransactionalTaskManagement) provide useful mechanism to capture additional details regarding the task_states.
We need checkpoints like
task.update_task_info(context, "api.scheduling.start") task.update_task_info(context, "api.scheduling.end") task.update_task_info(context, "scheduler.run_instance.start") task.update_task_info(context, "scheduler.run_instance.end") task.update_task_info(context, "compute.networking.start")
at various places during the operation.
General format <NovaProcess>.<subtask>.<start/end> This is based on checkpoints currently used by notification service.
Need to understand how to get the task_info for specific instance (Based on context ?)
Cleanup Service - WORK IN PROGRESS
- Get list of stuck instances task_state is NOT none and time_since_last_update > allowed_timeout for task_state
- Cleanup for stuck Create Server operation :
- If task_info begins with `api` or `scheduler` :
- Failed too early, set the status to ERROR.None. Nothing to clean
- If task_info begins with `compute` :
- Depending on subtask - call _deallocate_network, _shutdown_instance and _cleanup methods
- Set status to ERROR.None
- If task_info begins with `compute` :
- Cleanup for stuck Delete Server operation :
Implementation
TBD
UI Changes
TBD
Code Changes
TBD
Migration
TBD
Test/Demo Plan
TBD
BoF agenda and discussion
- Should cleanup service only _fix_ the vm_state and let user perform explicit Delete operation ? Or it should try to release the resources as well :
- Even if we release the resources, and later User deletes the instance manually (Since it was in ERROR) - There may be some errors in logs, but instance will get deleted at the end.
- How can task_info and task_log be used effectively ?
- Do we need (complex) state machine for cleanup service ?
Links
- https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
- http://etherpad.openstack.org/vmstatemachine
- http://wiki.openstack.org/TransactionalTaskManagement
- https://github.com/maoy/nova/tree/orchestration