VMStateCleanupService

Launchpad Entry: NovaSpec:compute-instance-cleanup-service
Created: 09 May 2012
Contributors: MandarVaze

Summary

Clean the various VM instances that are stuck during specific operation

Release Note

Rationale

Nova operations depend on various nova services, as well as external components like DB abd RabbitMQ During the lifecycle of an operation like create/delete - if one of the component goes down, then status of the instance remains stuck.

User is unable to recover an instance from such state. Some states prevent deletion of such instace, thus resulting into "hung" instances which are just using the resources.

Several bugs where instance gets stuck are associated with this blueprint : https://blueprints.launchpad.net/nova/+spec/compute-instance-cleanup-service

User stories

User is unable to access or delete the VM instances that are stuck.

Goal

There needs to be a cleanup service that will identify such instances and fix their status.

Worst case mark vm_state as Error (So that user can delete the VM and reclaim the resources)
Best case, roll it back to ACTIVE state (see review comments at https://review.openstack.org/#/c/6632/)

Issues

No way to determine whether instance is stuck, there are no well defined timeouts per operation
Status of the VM can be derived only from the combination of vm_state and task_state. Unfortunately, this combination does not provide enough granularity to determine at what stage the instance was stuck. This could be useful during recovery.

Assumptions

All the nova processes as well as third party processes like DB and RabbitMQ are up and running when cleanup service is invoked. (Else the cleanup tasks might fail.)
So it must be a separate script - to be invoked manually.
- Periodic task in Compute might be overkill plus if other services are still down, it would be wasted effort repeatedly
Executed on Nova Compute Host
- It would perform RPC to Nova Network, Nova Volume when needed. All other operations done locally.
- Compute API can be added, which can be invoked from remote machines if needed

Design

Define Timeouts

Introduce set of pre-defined "max time allowed for operation" lookup table. These can be overridden from nova.conf e.g. :

#!highlight python
cfg.IntOpt('buildserver_maxtime',
		   default=3600, #One hour
		   help='How long can VM stay in building server state ? (In Seconds)'),
cfg.IntOpt('snapshot_maxtime',
			default=23200, #12 hours
			help='How long can VM stay in snapshoting state ? (In Seconds)'),

Cleanup service will use these value and "time since last update" to determine the "stuck" VMs.

Define Granular Task Substates

Code changes from https://github.com/maoy/nova/tree/orchestration (Related to http://wiki.openstack.org/TransactionalTaskManagement) provide useful mechanism to capture additional details regarding the task_states.

We need checkpoints like

 

task.update_task_info(context, "api.create.start") 
task.update_task_info(context, "api.create.end")

task.update_task_info(context, "scheduler.run_instance.start")
task.update_task_info(context, "scheduler.run_instance.end")

task.update_task_info(context, "compute.allocate_for_instance.start")

at various places during the operation.

General format <NovaProcess>.<function_name>.<start/end> This is similar to checkpoints currently used by notification service.

Need to understand how to get the task_info for specific instance (Based on context ?)

Cleanup Service - WORK IN PROGRESS

Get list of stuck instances where depth is NOT none and time_since_last_update > allowed_timeout for task_state
Cleanup for stuck Create Server operation :
- If task_info begins with `api` or `scheduler` :
Failed too early, set the status to ERROR.None. Nothing to clean
- If task_info begins with `compute` :
  - Depending on subtask - call _deallocate_network, _shutdown_instance and _cleanup methods
  - Set status to ERROR.None
Cleanup for stuck Delete Server operation :

Implementation

TBD

UI Changes

TBD

Code Changes

TBD

Migration

TBD

Test/Demo Plan

TBD

BoF agenda and discussion

Should cleanup service only _fix_ the vm_state and let user perform explicit Delete operation ? Or it should try to release the resources as well :
- Even if we release the resources, and later User deletes the instance manually (Since it was in ERROR) - There may be some errors in logs, but instance will get deleted at the end.
How can task_info and task_log be used effectively ?
Do we need (complex) state machine for cleanup service ?

Links