Swift/ClusterFederationBlueprint


 * Launchpad Swift blueprint: cluster-federation

= Abstract =

The goal of this work is to enable account contents to be dispersed across multiple clusters, motivated by (a) accounts that might grow beyond the remaining capacity of a single cluster and (b) clusters offering differentiated service levels such as different levels of redundancy or different storage tiers. Following feedback at the Portland summit, the work is initially limited to dispersal at the container level, i.e. each container within an account may be stored on a different cluster, whereas every object within a container will be stored on the same cluster.

Note that this is distinct from container sync in that objects are only stored in one container.

=Proposed Solution=

The proposed approach maintains an affinity between an account and a single 'home' cluster. Containers are created on an account's home cluster. When configuration and/or policy dictates, this 'home container' may be annotated with a pointer to an associated 'target container' in which objects and user metadata will actually be stored. This pointer will be stored as metadata in the home container database. New middleware on the home cluster will handle all subsequent container and object requests by attempting to retrieve a pointer from the home container and, if successful, forwarding requests to the target container.

The target container will typically be in a duplicate account on a different cluster (the 'target cluster'). When necessary, duplicate accounts are created on demand on target clusters. All clusters use a common keystone identity service to authenticate requests on accounts.

A forwarding home container never stores objects. It serves two purposes: first, it is an expedient means to store a pointer to the target container; second, its existence causes the container to appear in the account listing.

When configuration and/or policy dictates that the target container is the same as the home container (i.e. that objects in the container should be stored on the home cluster) then no pointer metadata is added to the home container and it behaves as a normal container.

Example operation: container creation
Consider a PUT request to a container path (figure 1). The home cluster middleware will first attempt to retrieve a pointer from the home container metadata. Using the existing container_info function, this metadata may be found in memcache; otherwise a HEAD request will be issued to the container server.

Assuming the container does not already exist, config or policy determines the target cluster for the new container. The home container is created first, with metadata including the pointer to the target container. A PUT to the target cluster then creates the target container.

 Figure 1: Container creation

home cluster_A               home cluster_A     target cluster_B middleware                     proxy               proxy --

>PUT /a/c |    |>HEAD a/c |< 404 Not Found |    |     | [choose target cluster] |    |     |>PUT a/c |                                {ptr:cluster_B/a/c} |    |     |>PUT a/c | < 

Example operation: object PUT
An object PUT requires the pointer to be retrieved from the home container metadata (which may already be held in memcache) before the object PUT is forwarded to the target cluster (figure 2). Note that no object is stored in the home container.

 Figure 2: Object creation

home cluster_A               home cluster_A     target cluster_B middleware                     proxy               proxy --

->PUT /a/c/o |    |>HEAD a/c |<{ptr:cluster_B/a/c} |    |>PUT a/c/o | < 

Example operation: container deletion
When deleting a container, the home cluster middleware similarly retrieves the pointer from the home container metadata, then attempts to delete the target container, and if that succeeds then deletes the home container (figure 3).

 Figure 3: Container deletion home cluster_A               home cluster_A     target cluster_B middleware                     proxy               proxy --

->DELETE /a/c |    |>HEAD a/c |<{ptr:cluster_B/a/c} |    |>DELETE a/c |    |     |>DELETE a/c | < 

Container stats
Since no objects are put in the home container, its stats (object count, byte count) are not immediately updated during an object PUT. A new background process, similar to the container updater daemon, will periodically scan all home container databases and update their stats by performing a HEAD on the target container to fetch authoritative data. Note that this HEAD is made to the target cluster proxy: this only requires container servers to have network visibility of other cluster proxies (as per container sync). An account listing (GET a?format=json) or account metadata retrieval (HEAD a) may therefore return out of date stats until the updater daemon reconciles home container stats with target container stats.

Note that a request for container metadata (HEAD a/c or GET a/c) will always be forwarded to the target container and will therefore return accurate container stats.

Home/target container consistency
One challenge we face is achieving consistent ordering of container lifecycle events on the home and target clusters. For example, consider concurrent client PUT and DELETE requests on a container: unless we can guarantee consistent ordering of these operations on the home and target clusters then their outcomes may be inconsistent and result in a target container that has no home container, and therefore no pointer to the target (see figure 4). Worse still, a concurrent object put (PUT a/c/o) could result in an object being successfully put into a target container for which we have no home container.

 Figure 4: risk of concurrent requests resulting in inconsistent home/target states

home cluster_A               home cluster_A     target cluster_B middleware                     proxy               proxy --

->PUT a/c |    |>(t1)PUT a/c |    |        DELETE a/c |          -     |                |--->HEAD a/c |               |<{ptr:cluster_B/a/c} |               |     |                |        |                |--->(t2)DELETE a/c |               |        |                |--->(t3)DELETE a/c |          <     |     |     |>(t4)PUT a/c | <

Outcome:                          container deleted      container exists t3 > t1            t4 > t2 

The risk of inconsistent outcomes can be avoided by ensuring that the same timestamp is used for requests that are replicated on the home and target clusters. (Note that the state of a container is determined by the values of put and delete timestamps in the container database, not by the order in which those values are written.) We therefore propose to forward timestamps from the home cluster proxies to the target cluster proxies (using the X-Timestamp header) and have the target cluster proxy use the forwarded timestamp when present in a request. (A similar mechanism is already used to ensure timestamp consistency when objects are PUT to a synchronised container destination).

 Figure 5: forwarding timestamps guarantees consistent outcomes

home cluster_A               home cluster_A     target cluster_B middleware                     proxy               proxy --

->PUT a/c |    |>(t1)PUT a/c |    |        DELETE a/c |          -     |                |--->HEAD a/c |               |<{ptr:cluster_B/a/c} |               |     |                |        |                |--->(t2)DELETE a/c |               |                                    {X-Timestamp:t3} |               |        |                |--->(t3)DELETE a/c |          <     |     |     |>(t4)PUT a/c |< 409 Conflict                                  {X-Timestamp:t1} | <

Outcome:                          container deleted      container deleted t3 > t1            t3 > t1 

Failure scenarios
Should a federated operation fail, it is possible that we will be left with a home container with a pointer to a target container that does not exist e.g. we create the home container but fail to create the target container. Should this occur, we have two options:


 * 1) all requests to the container fail (except DELETE); for a period of time the home container may appear in account container listing; eventually, the new updater process will discover that the target container does not exist and cleanup by deleting the home container.
 * 2)  any request to the container causes a missing target container to be created 'on-demand'.

Note that it is not possible for a target container to exist without a corresponding home container, since we always PUT the home before the target, and DELETE the target before the home.

Security
Federation requires some behaviour of the target cluster to be conditional upon whether a request has been received directly from a client, or has been forwarded by another cluster. For example, the X-Timestamp header should be accepted for a request that has been forwarded by another cluster but not for a request received from a client. This requires some means to validate the source of requests.

Since our initial use case is federation of clusters within a single data centre, we will initially assume that this validation is derived from network visibility i.e. that target clusters are not visible to external clients and can therefore only receive requests from forwarding home clusters. Checking source IP addresses provides some weak authentication.

True symmetric federation, in which any cluster may receive requests directly from clients or forwarded within the federation, will require some inter-cluster authentication mechanism in addition to user authentication. For example, clusters might add credentials to forwarded requests that a target cluster may use to verify that the request has arrived from an authorised peer cluster.

In every case we need to ensure that the user’s credentials are preserved when requests are forwarded to a target cluster so that operations in the target cluster are authorized as if the system was not federated.

Summary of anticipated code additions

 * New middleware implementing federation logic
 * Enable target container pointer to be stored in and retrieved from container database
 * Enable X-Timestamp header to be accepted with container requests
 * New container server daemon to update home container stats from target container