Trove/PointInTimeRecovery

Introduction

 * Every once in a while, an event might happen that corrupts a database. We have all made a stupid mistake at least once that had trashed a database. When this happens what do you do? If you do not have a database backup, then you had better own up to the problem you caused and tell your boss that you screwed up. If you do have at least a complete database backup then you most likely will be able to recover the corrupted database, up to the point that you corrupted the data. This article will discuss how to use a point in time restore to recover your databases.
 * If you google “Point in time recovery” you also could find “Point in time restore”. So, let decide how to call it. Historically, database has a feature called Point in time recovery.

What is a point-in-time recovery?

 * So what is a point in time recovery? A point in time recovery is restoring a database to a specified date and time. When you have completed a point in time recovery, your database will be in the state it was at the specific date and time you identified when restoring your database. A point in time recovery is a method to recover your database to any point in time since the last database backup.
 * Point in time is an industry standard term, and refers to the ability for the user to restore from _any_ point in time (not from explicit snapshots). For a more detailed explanation about what point in time recovery is, please see the following:
 * http://en.wikipedia.org/wiki/Point-in-time_recovery
 * https://dev.mysql.com/doc/refman/5.0/en/point-in-time-recovery.html
 * http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIT.html

What does it take to do a point-in-time recovery?

 * In order to perform a point in time recovery you will need to have an entire series of backups (complete, differential, and transaction log backups) up to and/or beyond the point in time in which you want to recover. If you are missing any backups, or have truncated the transaction log without first performing a transaction log backup, then you will not be able to perform a point in time recovery. At a minimum, you will need a complete backup and all the transaction log backups taken following the complete backup. Optionally if you are taking differential backups, then you will need the complete backup, the last differential backup prior to the corruption, then all the transaction log backups taken following the differential backup.

Description

 * OpenStack DBaaS Trove is able to perform instance restoration (whole new instance, from scratch) from previously stored backup in remote storage (OpenStack Swift, Amazon AWS S3, etc). From administration/regular user perspective Trove should be able to perform point in time recovery.

Justification

 * From the user perspective, i'd want to able to perform restoring my data at any time, but now users are able to do it only at provisioning. The actual difference between restore (in terms of Trove) and recovery means that user can perform given operation over already running instance.

Benefits

 * Restore gives an ability to spin-up new instance from backup (as mentioned earlier), but the recovery gives an ability to restore already running instance from the given point in time.

Impacts

 * All proposed changes are backward compatible. Feature improves the approach of the backup usage, and extends the restoring API.

Database

 * There are no expected changes to the database.

Configuration

 * There are no expected changes to the configuration.

Recovering flow

 * 1) API service takes the proposed instances id and point in time in an appropriate format.
 * 2) API service finds closest backup to proposed point in time.
 * 3) API service casts the call to taskmanager that spins up the whole new instance with given backup as the restore data.

QA

 * Q1: Why is it valid to restore a new instance instead of current?
 * Answer: Because the restore destroys all the data on instance, so it is too dangerous of an api call to screw up. Bad backup or corrupted backup or just any random error during restore can kill the instance, so it's woundn't be reachable to users anymore.
 * Q2: So, after the recovery, how many instances will user have?
 * Answer: At the end user will recieve two instances with almost equal data (depends on how often user performs the backup process).
 * Q3: With what kind of attributes (instane name, volume, flavor, datastore, datastore version etc.) will new instance be provisioned?
 * Answer: Recoverd instance will have same attributes as the parent instance.

Public API

 * New routes will be added. Recovery public API described below.

Request body

 * Point in time is the optional parameter, only instance id is required.

{   "recovery": { "instance": "UUDI", "point_in_time": "2014-04-04T11:22:45", } }

Response object
{   "recovery": { "id": "UUDI", "recoverd_from_instance": "UUDI", "name": "instance", "datastore": "mysql", "datastore_version": "mysql-5.5", "closest_point_in_time": "2014-03-04T11:22:45", } }

RPC message type

 * CAST

RPC message
Nothing new.

Guest Agent

 * All changes made for the agent are backward compatible. Reused restore functionality (restore from full backup).

= Points to consider =


 * Renaming. Possible names: Recover from the backup. Instance data recovery from the backup.
 *  - For something to be "point in time" recovery, you should be able to recover to any point in time (with reasonable limits on granularity). This doesn't enable that.


 * The answer - for this point of view, point in time is described by the existing backups, so if user has a set of the backups the their create_at time is the only available point in times from which user is able to recover.

 Then this is not "Point in Time" recovery. "Point in time" is an industry standard term, and refers to the ability for the user to restore from _any_ point in time (not from explicit snapshots). For a more detailed explanation about what point in time recovery is, please see the following:
 * http://en.wikipedia.org/wiki/Point-in-time_recovery
 * https://dev.mysql.com/doc/refman/5.0/en/point-in-time-recovery.html
 * http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PIT.html

 SlickNik, please suggest the appropirate name for this feature. I'm open for the discussion.


 *  - This currently doesn't enable anything new over restoring to a new instance (and additionally has the issue that we may overwrite valid data by mistake, making it dangerous).


 * The answer - yes, this feature re-uses restore functionality, but, as said at previos topics, this feature avoids quota usage, applying backup to the ACTIVE instance is rather faster than provisioning of new with pre-defined data. And it's not dangerous from developers perspective, API is the service that allowed to be used by the endpoint user, so, i want to say that it's up to user what he wants to do with his data.

= Long term goals =
 * This feature would be very useful when replication will come. Simple use case (from A section) - join operation, user has two standalone servers and he wants to use instance A as the master node and instance B as the slave node. The most valid way is to create the backup from the master node and then apply it to the slave node and then do joining (specific to datastore).