Swift/FastPost

Motivation

The purpose of this discussion is to examine a potential solution to enable fast-POST modifications of content-type to be accurately reflected in container listings (recognising that others have been over this ground before and may immediately find flaws in this proposal…).

The timing of this discussion is prompted by the new two-vector timestamp feature introduced with storage policies, which *may* provide a mechanism to support fast-POST content-type updates. However, the fast-POST use case would demand a larger number of digits for the timestamp offset than is otherwise required for the storage policy reconciler. So it is worth considering if there is any hope for the fast-POST use case before the timestamp format is set in stone (although, as discussed later, it may be possible to implement this proposal without re-using the two-vector timestamp).

Background

Container update of content-type after a fast-POST does not currently happen because an object row will only be updated if a newer timestamp is given, and doing so for a fast-POST would erroneously risk marking ‘stale’ etag and size information as ‘fresh’ in the container db. My understanding is that this is the reason for post-as-copy being the default mode for handling object POSTs.

Proposed modifications (abstract description):

When handling a fast-POST that modifies an object’s content-type, modify object server behaviour to issue a container update which conveys the timestamp of the existing object .data file and a second timestamp t_meta that indicates the time at which the object content-type was modified. (t_meta would be the timestamp of the .meta file).

Modify the container server to record t_meta in the object’s db row (NB not assuming that the object table is altered to add a new column for t_meta – the exact mechanism for this is discussed later).

Modify the container broker’s merge_items() method so that the most recent content-type (as indicated by t_meta) is persisted with the most recent object row (as indicated by the .data file timestamp) – details described in discussion below.

Example scenarios

Consider initial state for an object that was PUT at time t1.

 Obj server 1,2,3: …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE}
 Container server 1,2,3: {o, etag=ETAG, size=s1, c_type=OLD_TYPE, ts=t1 }

Scenario 1: All servers initially consistent, successful fast-POST that modifies an object’s content-type.

When all is well our object servers will end up in a consistent state:

 Obj server 1,2,3: …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE}
                   …/t2.meta {c_type=NEW_TYPE}

The proposal is for the fast-POST to trigger a container update that requires slightly different handling in the container backend merge_items() method:

Currently no db change is made if a row already exists where name=o and created_at>=t1. We now require that if such a row exists AND its t_meta<t2 then that row should be updated to record content_type=NEW_TYPE and t_meta=t2. Note that the existing row’s created_at value is never changed.

This leaves us with:

 Container server 1,2,3: {o, etag=ETAG, size=s1, c_type=NEW_TYPE, ts=t1, t_meta=t2}

(Again, this is an abstract representation of the content of the db row - it is not proposed that the db must be altered to add a column for t_meta - see later discussion). Note that the container server timestamp remains at t1, the timestamp of the .data file.

Now consider some failure scenarios:

Scenario 2: Container update after fast-POST fails to subset of container servers e.g.:

 Container server 1,2  : {o, etag=ETAG, size=s1, c_type= NEW_TYPE, ts=t1, t_meta=t2}
 Container server 3    : {o, etag=ETAG, size=s1, c_type= OLD_TYPE, ts=t1}

The db inconsistency will be fixed during replication given our modification to the merge_items() logic i.e. the row in server 3 will be updated with the new content_type and t_meta.

Scenario 3: Object fast_POST fails to subset of object servers.

e.g.

 Obj server 1,3  : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE}
                   …/t2.meta {c_type=NEW_TYPE}
 Obj server 2    : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE}

The object replicator will in time copy t2.meta to obj server 2. No change from existing behaviour.

Scenario 4: Stale object data on subset of object servers when fast-POST occurs.

E.g.

 Obj server 1,3  : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE}
                   …/t2.meta {c_type=NEW_TYPE}
 Obj server 2    : …/t0.data {etag=OLD_ETAG, size=s0, c_type=OLD_TYPE}
                   …/t2.meta {c_type=NEW_TYPE}

The object replicator will in time copy t2.meta to obj server 2. No change from existing behaviour.

Object server 2 will send container updates {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2}. Assuming container servers are up to date, their newest row for the object will have created_at=t1 (> t0) and t_meta=t2, so the updates will be ignored.

Scenario 5: Stale object data on a subset of object servers, fast POST only succeeds on the stale object servers:

 Obj server 1,3  : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE}
 Obj server 2    : …/t0.data {etag=OLD_ETAG, size=s0, c_type=OLD_TYPE}
                   …/t2.meta {c_type=NEW_TYPE}

The object replicator will in time copy t2.meta to obj server 2. No change from existing behaviour.

Object server 2 will send container updates {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2}. Assuming container servers are up to date, their newest row for the object will have created_at=t1 (> t0) but our modification to the merge_items() would mean that those rows would be updated with content_type=NEW_TYPE and t_meta=t2.

Scenario 6: Stale object data on a subset of object servers, fast POST only succeeds on the stale object servers, async updates from fresh object servers.

Similar to previous scenario except the container servers are not up to date when the fast-POST induced update arrives from the stale object server. Container servers are initially in state:

 Container server 1,2,3: {o, etag=OLD_ETAG, size=s0, c_type= OLD_TYPE, ts=t0}

They are then updated to:

 Container server 1,2,3: {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2}

Then async updates arrive with {etag=ETAG, size=s1, c_type=OLD_TYPE, ts=t1}. Currently, merge_items() will delete the row with created_at=t0 and insert a new row. We need a further modification to merge_items():

Before deleting ‘old’ rows, t_meta is selected from the newest current row, and if that value is greater than both the data timestamp and t_meta of the update, then the existing t_meta is preserved in the newly inserted row.

This results in:

 Container server 1,2,3: {o, etag=ETAG, size=s1, c_type= NEW_TYPE, ts=t1, t_meta=t2}

Scenario 7: Container servers out of sync

A combination of failures could leave the container servers in an inconsistent state:

 Container server 1,2  : {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2}
 Container server 3    : {o, etag=ETAG, size=s1, c_type= OLD_TYPE, ts=t1}

The proposed modifications to the merge_items() method will result in the container servers reaching consistency after replication.

How to store t_meta in the container db

We can avoid altering existing object tables to add a new column for t_meta by encoding t_meta into an existing text field.

(a) Encode in the content_type field in the form [type, t_meta]. RFC 2616 [1] specifies that content-type values should not contain square brackets, making it straightforward to disambiguate between a JSON-encoded list [type, t_meta] and a simple content-type value. This approach would separate the implementation of this fast-POST proposal from the storage policies two-vector timestamp.

OR

(b) The Storage Policy feature introduces a two vector timestamp to enable reconciliation of objects placed in incorrect policies. Timestamps now take the form normal_offset where ‘normal’ is the original timestamp form and ‘offset’ is an optional time delta from the ‘normal’ timestamp. Since t_meta will always be greater than the data timestamp, we can store its value in the data timestamp offset (i.e. offset = t_meta - t_data). This will require a larger offset field than necessary for storage policies – currently the offset is 16 digits in anticipation of this, but could be less digits if not required for this fast-POST proposal.

How many digits are required for the timestamp offset part?

This proposal is motivating an increased number of digits for the timestamp offset (offsets < 16 are likely to be sufficient for misplaced storage policy reconciliation). If the fast-POST use case for timestamp offsets is invalid or unwanted then more compact timestamp offsets should be used.

Assumptions: 1. The offset resolution is the same as the normal timestamp i.e. 10usecs (is this reasonable?) 2. The offset is a hex encoded integer number of 10usec units

12 digits will allow for approx. 89 years max between object PUT and POST with content-type update. 13 digits will allow for approx. 1424 years max between object PUT and POST with content-type update, which is well beyond the 272 years until the normal timestamp starts to spill over. 16 digits will allow for many millennia between object PUT and POST with content-type update.

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7 and http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2

Swift/FastPost

Contents

Motivation

Background

Proposed modifications (abstract description):

Example scenarios

Scenario 1: All servers initially consistent, successful fast-POST that modifies an object’s content-type.

Scenario 2: Container update after fast-POST fails to subset of container servers e.g.:

Scenario 3: Object fast_POST fails to subset of object servers.

Scenario 4: Stale object data on subset of object servers when fast-POST occurs.

Scenario 5: Stale object data on a subset of object servers, fast POST only succeeds on the stale object servers:

Scenario 6: Stale object data on a subset of object servers, fast POST only succeeds on the stale object servers, async updates from fresh object servers.

Scenario 7: Container servers out of sync

How to store t_meta in the container db