Trove/Blueprints/Trove-v1-MySQL-Replication

Description

Providing support for the various replication use cases is critical for use of Trove in production. This will describe the various use cases and related requirements and then propose a scoping for an initial V1 implementation for MySQL.

Justification/Benefits

Most of the datastores currently supported by Trove have replication capabilities to fulfill various use cases such as:

scale out via read replicas
operational recovery (aka failover)
offline backup

In order to be production ready, Trove needs to support easy configuration and management of these use cases. Today Amazon RDS fulfills the first use cases and part of the second use case for MySQL.

Over time all of these requirements should be evaluated; the goal of this blueprint is to focus on read replicas for scale out and target the MySQL datastore. It is expected that implementation of this scoping will occur for other datastores and then further work can be scoped out to meet the remaining requirements.

Use Case Requirements

While the specific details of each datastore need to be investigated against this list, these are seen as the use-cases that would motivate the replication feature:

A. Read Replicas (Slaves)

The master can exist before the slave such that the master already contains data
N Slaves for one master slicknik (talk) * To clarify, the v1 implementation will allow for this but will require N separate create calls. We may optimize this in a later implementation.
Slaves can be marked read-only (probably by default)
A slave can be detached from "replication set" to act as independent site
A pre-existing non-replication site can become the master of a new "replication set"
The health of a slave will be monitaorable

(When master fails, a slave can be chosen to be promoted to new master, with other slaves switched to follow new master) (not needed for v1; to be addressed later -SlickNik)
(All slaves should be in the same zone. (is this necessary or desired?) (not needed for v1; to be addressed later -SlickNik)

B. MultiZone Disaster Recovery

A master in one zone is mirrored by a slave in a different zone
Some mechanism should exist where cloud admin can set up "zone configuration" so that the user can simply select "MultiZone DR" and Trove will know where to put both the master and the slave
Should be able to restore master from slave, either directly or by making backup stored in Swift
Should be able to "click the switch" on an already running mysql instance

C. Single Zone Failover

Implements master-master replication between 2 instances in the same zone
Can be set up on pre-existing instance
Should be able to switch "active master", i.e., the site to which data is being written (other site could be marked read-only)

Scope

Phase I implementation, targeted for the Juno release, will implement Use Case A core functionality and the MySql datastore replication support. Replication support for other datastores beyond MySQL are beyond the scope of this blueprint and will be covered elsewhere.

Impacts

This change would impact all components of Trove.

Configuration

There should be no changes to configuration files required.

Database

Does this impact any existing tables? If so, which ones?
Are the changes forward and backward compatible?
Be sure to include the expected migration process

Public API

Does this change any API that an end-user has access to?
Are there any exceptions in terms of consistency with other APIs?

The addition of a replication feature would required changes to the trove command and corresponding APIs. At a minimum, there would need to be APIs to add a slave to a master, and to remove a slave from a master. Additionally, there may be changes to how a potential master site is created.

Internal API

Does this change any internal messages between API and Task Manager or Task Manager to Guest

There would be API changes at every level to support replication.

Guest Agent

Does this change behavior on the Guest Agent? If so, is it backwards compatible with API and Task Manager?

The guest agent would require support for the new replication operations, but backward compatibility should be maintained.