AutoTerminateBackup


 * Created: Tue, 20 Dec 2011 13:50:49 -0800
 * Contributors: Joshua Harlow

= Autobackup Deprovisioning =

Summary
On virtual machine deletion (termination) there needs to be a way to safely transmit the state of that virtual machine (disk state to begin with, memory state maybe later in the future) to a location for a given period of time. This is useful for debugging, tracking and auditing (and even backing up) of terminated machines for a given period of time. Most of this functionality already exists in openstack but there are some slight additions that need to be added. This type of feature should be available for the people that control the cloud as well as users who run on the cloud (as a private company that runs ontop of the cloud may want the same functionality for there own uses).

Rationale
Cloud security practitioners have identified a number of use cases which require that when a virtual machine is terminated, the final state of that virtual machine is stored securely. Local policy may specify the period of time for which the data is stored; there may also be mechanisms for transferring the data to a long-term repository. One use case is to permit forensic examination following a potential security compromise; we may wish to transfer the state to a "safe" environment for analysis or replay. Another is where operations notices that a system is repeatedly crashing, and wishes to enable the developer to debug the problem, based on its final state. Overall there are a number of cases where it is useful for the system to save the state automatically. Note that this is distinct from situations where the user or operator chooses to invoke an explicit snapshot API; the automated nature of the state backup is key.

Design
It would be great to use the current snapshot ability that openstack already has to be able to store auto-terminate backups but instead of allowing the original user to access those backups (which may not always be the case) we want it to be configurable how these images are accessible and how they are created. Access control to the saved state should be controlled through a set of explicit keystone RBAC permissions, so that access can easily be restricted to instance owners or specialized operational roles. The current snapshot capability though seems to be a good starting point, since a snapshot "should" track a "root" image and should not allow deletion of that "root" image if there are "children" snapshots. If this is not the current operating protocol we will need to enable this, since we would not like to store a full "raw" snapshot but instead a COW snapshot if possible. This will of course not be the case if a machines image is "raw" which means there is no COW snapshotting possible anyway.

The following would be needed though for a auto-terminate backup snapshot.


 * 1) Changes to allow a terminate call to perform additional actions on behalf of the user
 * 2) * It is not always wanted to have a user know/request there machine for auto-terminate backup (from a cloud operator perspective)
 * 3) Sending & downloading of these images in glance in a secure manner
 * 4) * This means glance needs some way of encrypting/decrypting content since compromising a glance storage server should not expose auto-terminate backup stored there by the cloud operator or the cloud user
 * 5) * This may be also useful in general for clients that do not want a glance to store there image snapshots in general in a decrypted form...
 * 6) Making it configurable who the owner of this image is in glance (since it now may not always be the owner of that virtual machine)
 * 7) Having nova/glance track the parent image of a snapshot and ensuring that a parent can not be deleted if a snapshot exists and that snapshot is a COW difference (instead of the full raw image)
 * 8) * This may mean instead of storing snapshot metadata in a nova database that they are stored in glance instead and glance knows about snapshot dependencies, if needed the nova database can store nova specific information but image hierarchies seem better in glance.
 * 9) An ability to restrict starting up of auto-terminate backups to a limited set of users
 * 10) * This may not be needed if we shift the user who we store the backup as to a different user then who the initial vm started as
 * 11) * From a cloud operator perspective there would have to be a special user that may receive all backups (but is not publicly viewable/useable?)

For the first item  listed above this would initially be a null call (as the current code is behaving) but should have a pluggable termination module that can transfer the image to a given location for backup queuing. This queuing on the nova-compute node is done so that the termination can be responded to quickly. A new daemon may then need to be activated (or watches for events) that picks this just deleted image up (+ some metadata) and processes that backup. This processing would likely include the following.


 * 1) Ensure the metadata + image is valid
 * 2) Encrypt the image (talking with keystone here)
 * 3) Tell glance to store the image with the given metadata (this may not happen under the user/project who created the virtual machine)
 * 4) After transmission scrub the file that had the image (for secure deletion)
 * 5) * This should be a pluggable and optional feature (a company may only care about doing "rm" instead of a federally compliant delete process)

For the second item listed we would need to integrate with keystone to fetch a image encryption key and then ensure that when storing images that this key is used and that it is also used for decryption (and only used on the compute nodes for decryption). This will require enhancing glance and any image store/fetch api's in openstack.

For the third item listed we would need glance to store a hierarchy of images, or at least a pointer of a parent image so that when nova-compute requests a snapshot that is in COW form it can know that it (nova-compute) needs to traverse this hierarchy to get a complete image for a given virtual machine. This may mean additions or changes to the glance api (and metadata) and nova-compute to be able to give and store this information as well as download the correct components when creating a virtual machine. Glance should also ensure that when a image delete call occurs that no children "images" depend on that image and should stop deletion if this is the case.

For the fourth item listed we should be able to accomplish this by having the "backup daemon" either use the virtual machine creators user and password & project (when storing with glance) or having it change this information to be able to store under a different user (or both). There might need to be a new api field that controls this (or a runtime config option). If a new api field is chosen then there may need to be a way of enabling/disabling this from an api (instead of a config file). A new api and configuration option would be preferable so that not only can the cloud operator ensure the backup daemon transfers images but can also ensure that users of the cloud can have there images backed up as well.

New nova api's could be the following:

GET /{instance_id}/terminate_backup RESPONSE: returns whether backup on termination is enabled

POST /{instance_id}/terminate_backup DATA: { "enabled": true, "store_user": "bob", "store_project": "myendbackups" } RESPONSE: turns on backup termination for a given instance

New glance api's could be the following:

GET /link/{parent}/{child}/ RESPONSE: returns whether child is linked to parent

POST /link/{parent}/{child}/ RESPONSE: creates a link between parent and child image

Store api's for glance may also need metadata to determine if the image is encrypted. This may not be needed if from now on glance stores all images encrypted and nova-compute goes to keystone and gets a decryption key when starting it up and when a user submits a image to glance through the glance cli that cli (or api) also goes to keystone and grabs a encryption key. Delete api's in glance would also have to be modified to ensure that a parent link can not be deleted before a child is deleted.

Expected Code Changes
glance and nova

Expected Documentation Changes
glance and nova

Test/Demo Plan
A idea for tests would involve the following:


 * 1) Enable terminate backup for a created virtual machine via the api for my user
 * 2) * Terminate
 * 3) * Ensure backup image created in glance for my user
 * 4) Enable terminate backup for a created virtual machine via the api for other user
 * 5) * Terminate
 * 6) * Ensure backup image created in glance for other user
 * 7) Enable terminate backup for a created virtual machine via the config for other user
 * 8) * Terminate
 * 9) * Ensure backup image created in glance for other user
 * 10) Enable terminate backup for a created virtual machine via the config for other user Y and via api for other user Z
 * 11) * Terminate
 * 12) * Ensure backup image created in glance for other user Y and other user Z
 * 13) Link a image in glance to a parent image
 * 14) * Attempt to delete parent image
 * 15) * Receive failure response
 * 16) Link a image diff (cow) in glance to parent image
 * 17) * Start nova compute with image diff
 * 18) * Ensure nova compute downloads parent image (and connected images)
 * 19) * Ensure nova compute configures correct chain of children->parent->parent.... images
 * 20) Upload a image for a user to glance and that user has a image encryption key in keystone
 * 21) * Download that image (raw) from glance and ensure that its encrypted
 * 22) Upload a image for a user to glance and that user has a image encryption key in keystone
 * 23) * Start that image in nova-compute and ensure that keystone is contacted and decryption key is sent back (ssl likely needed here)
 * 24) * Ensure image starts up (after decryption)

Migration Plan
It is the hope that there should be no migration changes needed as this should be fully backward compatible with glance as it exists (no encryption and no child->parent hierarchy - ie in the new way to retain backward compatibility all "old" apis would just sent the parent to NULL). The new image backup daemon would also just do a "rm" to retain backward compatibility instead of doing more complicated actions. For encryption, the lack of a keystone image encryption key for a given uploader/downloader will signify that this image is not encrypted (thus it will work as it currently does).

Unresolved Issues

 * 1) If a terminate happens we need to lock the image in glance and only after confirmation from the "backup daemon" should we unlock that image, this is to ensure that that root image can not be deleted due to the forensic backup possibly requiring it (ie for COW). How complicated is this?
 * 2) When a user requests that a backed up image go to another user how do we ensure that the other user is allowed to receive those images? Part of the same group? A special user property?

Contacts
harlowja@yahoo-inc.com