Swift/FastPost

Summary

 * Container updates on fast-POST seems tractable using separate 'content-type' specific timestamps, not altering the row['created_at'] value.
 * That would fix the container listing inconsistency and allow container sync to 'find' updated objects.
 * Container sync throws up some other nasty challenges when actually sync'ing updated objects: the timestamp of content-type must be separately communicated, as must the timestamp of POSTed user metadata. Receiving ends may need to reconstruct both .data and .meta files.
 * Two-vector timestamps *may* provide a suitable mechanism to represent 'meta' timestamps, but encoding (metadata, timestamp) tuples may be more suitable and would decouple the pending decision over timestamp offset digits from the viability of fast-POST.

Motivation
The purpose of this discussion is to examine a potential solution to enable fast-POST modifications of content-type to be accurately reflected in container listings (recognising that others have been over this ground before and may immediately find flaws in this proposal…).

The timing of this discussion is prompted by the new two-vector timestamp feature introduced with storage policies, which *may* provide a mechanism to support fast-POST content-type updates. However, the fast-POST use case would demand a larger number of digits for the timestamp offset than is otherwise required for the storage policy reconciler. So it is worth considering if there is any hope for the fast-POST use case before the timestamp format is set in stone (although, as discussed later, it may be possible to implement this proposal without re-using the two-vector timestamp).

Background
Container update of content-type after a fast-POST does not currently happen because an object row will only be updated if a newer timestamp is given, and doing so for a fast-POST would erroneously risk marking ‘stale’ etag and size information as ‘fresh’ in the container db. My understanding is that this is the reason for post-as-copy being the default mode for handling object POSTs.

Proposed modifications (abstract description):
''When handling a fast-POST that modifies an object’s content-type, modify object server behaviour to issue a container update which conveys the timestamp of the existing object .data file and a second timestamp t_meta that indicates the time at which the object content-type was modified. (t_meta would be the timestamp of the .meta file).

Modify the container server to record t_meta in the object’s db row (NB not assuming that the object table is altered to add a new column for t_meta – the exact mechanism for this is discussed later).

Modify the container broker’s merge_items method so that the most recent content-type (as indicated by t_meta) is persisted with the most recent object row (as indicated by the .data file timestamp) – details described in discussion below. ''

Example scenarios
Consider initial state for an object that was PUT at time t1.

Obj server 1,2,3: …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE} Container server 1,2,3: {o, etag=ETAG, size=s1, c_type=OLD_TYPE, ts=t1 }

Scenario 1: All servers initially consistent, successful fast-POST that modifies an object’s content-type.
When all is well our object servers will end up in a consistent state:

Obj server 1,2,3: …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE} …/t2.meta {c_type=NEW_TYPE}

The proposal is for the fast-POST to trigger a container update that requires slightly different handling in the container backend merge_items method:

''Currently no db change is made if a row already exists where name=o and created_at>=t1. We now require that if such a row exists AND its t_meta<t2 then that row should be updated to record content_type=NEW_TYPE and t_meta=t2. Note that the existing row’s created_at value is never changed.''

This leaves us with: Container server 1,2,3: {o, etag=ETAG, size=s1, c_type=NEW_TYPE, ts=t1, t_meta=t2}

(Again, this is an abstract representation of the content of the db row - it is not proposed that the db must be altered to add a column for t_meta - see later discussion). Note that the container server timestamp remains at t1, the timestamp of the .data file.

Now consider some failure scenarios:

Scenario 2: Container update after fast-POST fails to subset of container servers:
e.g.:

Container server 1,2 : {o, etag=ETAG, size=s1, c_type= NEW_TYPE, ts=t1, t_meta=t2} Container server 3   : {o, etag=ETAG, size=s1, c_type= OLD_TYPE, ts=t1}

The db inconsistency will be fixed during replication given our modification to the merge_items logic i.e. the row in server 3 will be updated with the new content_type and t_meta.

Scenario 3: Object fast_POST fails to subset of object servers.
e.g.

Obj server 1,3 : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE} …/t2.meta {c_type=NEW_TYPE} Obj server 2   : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE}

The object replicator will in time copy t2.meta to obj server 2. No change from existing behaviour.

Scenario 4: Stale object data on subset of object servers when fast-POST occurs.
E.g.

Obj server 1,3 : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE} …/t2.meta {c_type=NEW_TYPE} Obj server 2   : …/t0.data {etag=OLD_ETAG, size=s0, c_type=OLD_TYPE} …/t2.meta {c_type=NEW_TYPE}

The object replicator will in time copy t2.meta to obj server 2. No change from existing behaviour.

Object server 2 will send container updates {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2}. Assuming container servers are up to date, their newest row for the object will have created_at=t1 (> t0) and t_meta=t2, so the updates will be ignored.

Scenario 5: Stale object data on a subset of object servers, fast POST only succeeds on the stale object servers:
Obj server 1,3 : …/t1.data {etag=ETAG, size=s1, c_type=OLD_TYPE} Obj server 2   : …/t0.data {etag=OLD_ETAG, size=s0, c_type=OLD_TYPE} …/t2.meta {c_type=NEW_TYPE}

The object replicator will in time copy t2.meta to obj server 2. No change from existing behaviour.

Object server 2 will send container updates {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2}. Assuming container servers are up to date, their newest row for the object will have created_at=t1 (> t0) but our modification to the merge_items would mean that those rows would be updated with content_type=NEW_TYPE and t_meta=t2.

Scenario 6: Stale object data on a subset of object servers, fast POST only succeeds on the stale object servers, async updates from fresh object servers.
Similar to previous scenario except the container servers are not up to date when the fast-POST induced update arrives from the stale object server. Container servers are initially in state: Container server 1,2,3: {o, etag=OLD_ETAG, size=s0, c_type= OLD_TYPE, ts=t0}

They are then updated to: Container server 1,2,3: {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2}

Then async updates arrive with {etag=ETAG, size=s1, c_type=OLD_TYPE, ts=t1}. Currently, merge_items will delete the row with created_at=t0 and insert a new row. We need a further modification to merge_items:

Before deleting ‘old’ rows, t_meta is selected from the newest current row, and if that value is greater than both the data timestamp and t_meta of the update, then the existing t_meta is preserved in the newly inserted row.

This results in: Container server 1,2,3: {o, etag=ETAG, size=s1, c_type= NEW_TYPE, ts=t1, t_meta=t2}

Scenario 7: Container servers out of sync
A combination of failures could leave the container servers in an inconsistent state: Container server 1,2 : {o, etag=OLD_ETAG, size=s0, c_type= NEW_TYPE, ts=t0, t_meta=t2} Container server 3   : {o, etag=ETAG, size=s1, c_type= OLD_TYPE, ts=t1}

The proposed modifications to the merge_items method will result in the container servers reaching consistency after replication.

Container Sync
There are (at least) two considerations w.r.t. container-sync:

Sync points:
The container sync daemon keeps track of objects that have been sync’d by maintaining sync points, which are ROW_IDs. So, if we modify a row in the object table to reflect a change in the object’s content-type, then we need to delete/insert a new row to ensure that the sync daemon picks up the change.

Selecting ‘Newest object’:
When pushing an object to its sync peer, the container sync daemon issue a GET to all local object servers and selects the response with the most recent x-timestamp. If an object has been updated with a fast-POST the x-timestamp returned with a subsequent GET is the timestamp of the POST, not the .data file. So, there’s a risk that the container sync daemon will choose a response from a server with a stale .data file but fresh .meta (BAD THING).

ASIDE: this is also the behaviour that results when using X-Newest on objects that have had a fast-POST i.e. X-Newest returns object with most recent .meta time, not most recent .data time.

This requires some wrangling, but could be addressed by having object servers return separate data file and content-type timestamps with GET responses (e.g. either as X-Backend-Data-Timestamp and X-Backend-Content-Type-Timestamp, or as a two vector X-Backend-Composite-Timestamp). The combination of these timestamps would be used to select the most ‘up to date’ response to act as a source for the sync. ‘Up to date’ would be defined as having a data timestamp >= the row’s created_at value AND a content_type timestamp >= the row’s t_meta value. (The sync daemon already checks that the GET response timestamp is >= row[‘created_at’]).

'''QUESTION: the sync daemon uses the row[‘created_at’] value to set the x-timestamp of the object PUT to the peer container, even if this is < x-timestamp from object server GET. Why not use the ‘up to date’ timestamp returned from the object server?'''

Next, the content-type timestamp would need to be sent with the object PUT so that the sync peer can reconstruct the same view of the object (e.g. send X-Content-Type-Timestamp=t_meta*), either by time-stamping the data file with a new-style two vector timestamp, or embedding the content-type timestamp in the data file metadata.

(In fact, more generally, if the sending end has a .meta file then a universal .meta timestamp should be sent, so that at the receiving end the object server can reconstruct a consistent object state - which I think means creating a .meta file if the metadata timestamp != data timestamp. The content-type timestamp is a special case because - with this proposal - content-type could exist in both a .data file or a .meta file so needs to be explicitly time-stamped, as opposed to other metadata which can be treated as a whole).

What should the object server do if it already has a t1.data file and receives a PUT with data timestamp = t0 and content-type (or meta) timestamp=t2 (t0<t1<t2)? Ignore the data, treat the PUT as a POST of new content-type and generate t2.meta??? That would be justifiable since the meta timestamp indicates the presence of a .meta file on the sending end.

ASIDE - Similar issues may arise (and therefore need to be addresed) if we are ever to support updateable object sysmeta (individually timestamped sysmeta), and perhaps during replication with ssync.

How to store t_meta in the container db
We can avoid altering existing object tables to add a new column for t_meta by encoding t_meta into an existing text field.

(a) Encode in the content_type field in the form [type, t_meta]. RFC 2616 [1] specifies that content-type values should not contain square brackets, making it straightforward to disambiguate between a JSON-encoded list [type, t_meta] and a simple content-type value. This approach would separate the implementation of this fast-POST proposal from the storage policies two-vector timestamp.

OR

(b)	The Storage Policy feature introduces a two vector timestamp to enable reconciliation of objects placed in incorrect policies. Timestamps now take the form normal_offset where ‘normal’ is the original timestamp form and ‘offset’ is an optional time delta from the ‘normal’ timestamp. Since t_meta will always be greater than the data timestamp, we can store its value in the data timestamp offset (i.e. offset = t_meta - t_data). This will require a larger offset field than necessary for storage policies – currently the offset is 16 digits in anticipation of this, but could be less digits if not required for this fast-POST proposal.

How many digits are required for the timestamp offset part?

This proposal is motivating an increased number of digits for the timestamp offset (offsets < 16 are likely to be sufficient for misplaced storage policy reconciliation). If the fast-POST use case for timestamp offsets is invalid or unwanted then more compact timestamp offsets should be used.

Assumptions: 1.	The offset resolution is the same as the normal timestamp i.e. 10usecs (is this reasonable?) 2.	The offset is a hex encoded integer number of 10usec units

12 digits will allow for approx. 89 years max between object PUT and POST with content-type update. 13 digits will allow for approx. 1424 years max between object PUT and POST with content-type update, which is well beyond the 272 years until the normal timestamp starts to spill over. 16 digits will allow for many millennia between object PUT and POST with content-type update.

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7 and http://www.w3.org/Protocols/rfc2616/rfc2616-sec2.html#sec2.2