- Launchpad Entry: NovaSpec:compute-instance-cleanup-service
- Created: 09 May 2012
- Contributors: MandarVaze
Clean the various VM instances that are stuck during specific operation
Nova operations depend on various nova services, as well as external components like DB abd RabbitMQ During the lifecycle of an operation like create/delete - if one of the component goes down, then status of the instance remains stuck.
User is unable to recover an instance from such state. Some states prevent deletion of such instace, thus resulting into "hung" instances which are just using the resources.
Several bugs where instance gets stuck are associated with this blueprint : https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service
User is unable to access or delete the VM instances that are stuck.
There needs to be a cleanup service that will identify such instances and fix their status.
- Worst case mark vm_state as Error (So that user can delete the VM and reclaim the resources)
- Best case, roll it back to ACTIVE state (see review comments at https://review.openstack.org/#/c/6632/)
- No way to determine whether instance is stuck, there are no well defined timeouts per operation
- Status of the VM can be derived only from the combination of vm_state and task_state. Unfortunately, this combination does not provide enough granularity to determine at what stage the instance was stuck. This could be useful during recovery.
- All the nova processes as well as third party processes like DB and RabbitMQ are up and running when cleanup service is invoked. (Else the cleanup tasks might fail.)
- So it must be a separate script - to be invoked manually.
- Periodic task in Compute might be overkill plus if other services are still down, it would be wasted effort repeatedly
- Executed on Nova Compute Host
- It would perform RPC to Nova Network, Nova Volume when needed. All other operations done locally.
- Compute API can be added, which can be invoked from remote machines if needed
Introduce set of pre-defined "max time allowed for operation" lookup table. These can be overridden from nova.conf e.g. :
#!highlight python cfg.IntOpt('buildserver_maxtime', default=3600, #One hour help='How long can VM stay in building server state ? (In Seconds)'), cfg.IntOpt('snapshot_maxtime', default=23200, #12 hours help='How long can VM stay in snapshoting state ? (In Seconds)'),
Cleanup service will use these value and "time since last update" to determine the "stuck" VMs.
Define Granular Task Substates
Code changes from https://github.com/maoy/nova/tree/orchestration (Related to http://wiki.openstack.org/TransactionalTaskManagement) provide useful mechanism to capture additional details regarding the task_states.
We need checkpoints like
task.update_task_info(context, "api.create.start") task.update_task_info(context, "api.create.end") task.update_task_info(context, "scheduler.run_instance.start") task.update_task_info(context, "scheduler.run_instance.end") task.update_task_info(context, "compute.allocate_for_instance.start")
at various places during the operation.
General format <NovaProcess>.<function_name>.<start/end> This is similar to checkpoints currently used by notification service.
Need to understand how to get the task_info for specific instance (Based on context ?)
Cleanup Service - WORK IN PROGRESS
- Get list of stuck instances where depth is NOT none and time_since_last_update > allowed_timeout for task_state
- Cleanup for stuck Create Server operation :
- If task_info begins with `api` or `scheduler` :
- Failed too early, set the status to ERROR.None. Nothing to clean
- If task_info begins with `compute` :
- Depending on subtask - call _deallocate_network, _shutdown_instance and _cleanup methods
- Set status to ERROR.None
- If task_info begins with `compute` :
- Cleanup for stuck Delete Server operation :
BoF agenda and discussion
- Should cleanup service only _fix_ the vm_state and let user perform explicit Delete operation ? Or it should try to release the resources as well :
- Even if we release the resources, and later User deletes the instance manually (Since it was in ERROR) - There may be some errors in logs, but instance will get deleted at the end.
- How can task_info and task_log be used effectively ?
- Do we need (complex) state machine for cleanup service ?