Swift/ObjectSystemMetadata


 * Launchpad Swift blueprint: object-system-metadata

= UPDATE = The content of this wiki page is stale:
 * object sysmeta support on PUTs has been added to Swift in change If716bc15730b7322266ebff4ab8dd31e78e4b962 https://review.openstack.org/#/c/79991/
 * consensus was NOT reached on the approach described below (multiple meta files)
 * there is further discussion of updateable object sysmeta in this (unapproved) spec https://review.openstack.org/#/c/109314/

= Overview = The original system metadata patch (https://review.openstack.org/#/c/51228/) supported only account and container system metadata.

There are now patches in review that store middleware-generated metadata against objects, e.g.:
 * on demand migration https://review.openstack.org/#/c/64430/
 * server side encryption https://review.openstack.org/#/c/76578/1

Object system metadata should not be stored in the x-object-meta- user metadata namespace because (a) there is a potential name conflict with arbitrarily user metadata and (b) system metadata in the x-object-meta- namespace will be lost if a user sends a POST request to the object.

The goal of this work is to enable system metadata to be persisted with objects in a similar way to user metadata, but with the ability to update individual items of system metadata independently when making a POST request, unlike user metadata which is replaced as a whole by a POST request.

Proposed approach:

Initially enable object system metadata only on PUT requests to support existing use cases:
 * System metadata stored in .data file xattrs and copied unchanged to .meta files.
 * Use the x-object-sysmeta- namespace already supported by proxy and reserved by gatekeeper middleware.
 * Store x-object-sysmeta- headers in object file xattrs alongside other object metadata.
 * Store object system metadata attributes as key: (value, timestamp) as per account and container metadata, enabling most-recent-wins rule to be applied to individual attributes in future.

Follow up with enabling system metadata updates on POST:
 * Allow for concurrent POST requests resulting in multiple .meta files with divergent sets of system metadata.
 * When handling POST the object server (DiskFile) reads potentially multiple existing .meta files and merges system metadata to a unified set written to a new .meta file – the system metadata key: (value, timestamp) format allows most recent version of each system metadata key to be selected from the set of .meta files.
 * When handling GET/HEAD similarly read and merge system metadata from potentially multiple .meta files.
 * Only delete .meta files when their system metadata has been merged into a newer .meta file.

=Details=

System metadata can be stored alongside user metadata on the object server by adding the x-object-sysmeta- namespace to the set of persisted headers. However, the object user metadata semantics (POST replaces the whole set) are not appropriate for system metadata. Instead, POSTs with system metadata headers should result in those system metadata items being updated while other existing system metadata is preserved.

'fast-POST'
In the absence of concurrent requests on the same object, item-by-item updates to system metadata be achieved by simple changes to the metadata read-modify-write cycle in the object server. System metadata (distinguished by the namespace x-object-sysmeta-) is read from existing .data or .meta file and updated item-by-item, whereas user metadata continues to be updated as a whole set.

Concurrent POSTs create the potential for multiple .meta files to be written. For user metadata this is not a problem, since the most recent .meta file is considered to contain the most recent whole set of metadata. For system metadata, each concurrently generated .meta file might contain unique items of metadata that should be preserved and merged when handling subsequent requests.

The proposed new behavior is therefore to preserve multiple .meta files in the obj_dir until their system metadata is known to have been read and merged into a newer .meta file. Then, when constructing a diskfile object, all existing .meta files (usually just one) should be read for potential system metadata contributions. This requires a subtle change to the diskfile cleanup code (diskfile.hash_cleanup_listdir): after creating a new .meta file, instead of deleting all older .meta files, only those that were read during construction of the new .meta file are deleted. In most cases the result will be same, but if a second concurrent request has written a .meta file that was not read by the first request handler then this .meta file will be left in place.

Similarly, a change is required in the async cleanup process (called by the replicator daemon). Instead of deleting all older files, the cleanup process must inspect their system metadata to determine which files have no unique contributions to the unified metadata set, and delete only those.

To enable a per-item most-recent-wins semantic when merging contributions from multiple .meta files, system metadata should be stored as key: (value, timestamp) pairs (as per account and container metadata).

Deleting system metadata items
An item of system metadata with key 'x-object-sysmeta-x' should be deleted when a header 'x-object-sysmeta-x':"" is included with a POST request. This can be achieved in two ways:
 * 1) Do not include the key 'x-object-sysmeta-x' in the latest .meta file. A risk with this approach is that if an older .meta file fails to be deleted, and that file contains an obsolete value for 'x-object-sysmeta-x', then the obsolete value will be re-introduced during a future merge. This can be avoided by including a record of obsolete .meta files as part of each new .meta file, but this list might grow to include all .meta files created during the history of an object.
 * 2) Persist the system metadata item with an empty value, i.e. key : ('"", timestamp), to indicate to any future metadata merges that the item has been deleted. This guards against inclusion of obsolete values form older .meta files at the expense of storing the empty value. The empty-valued system metadata may be finally removed during a subsequent merge when it is observed that no existing .meta file contains a value for that key, i.e. when there is no obsolete value.

For the following example we will assume that the second solution is adopted.

fast-POST example
Consider the following scenario. Initially the object dir contains just the original data file:

 obj_dir: t1.data : x-object-sysmeta-p: ('p1', t0) 

Two concurrent POSTs update the object, with timestamps t2 and t3:

 POST X-Timestamp: t2 x-object-sysmeta-p: 'p2' x-object-sysmeta-x: 'x1' x-object-sysmeta-y: 'y1'

POST X-Timestamp: t3 x-object-sysmeta-x: 'x2' x-object-sysmeta-z: 'z1' 

These result in two .meta files being added to the object directory:

 obj_dir: t1.data : x-object-sysmeta-p: ('p1', t0) t2.meta: x-object-sysmeta-p: ('p2', t2) x-object-sysmeta-x: ('x1', t2) x-object-sysmeta-y: ('y1', t2) t3.meta: x-object-sysmeta-p: ('p1', t0) x-object-sysmeta-x: ('x2', t3) x-object-sysmeta-z: ('z1', t3) 

Currently t2.meta would be deleted at some point. The proposed new behavior is to read both t2.meta and t3.meta when a diskfile object is next constructed, merging the results such that when duplicate system metadata keys are encountered, the item with the most recent timestamp is kept and other items for that key are discarded, i.e. a response to a subsequent HEAD request would contain:

 HEAD response (new): x-object-sysmeta-p: 'p2' x-object-sysmeta-x: 'x2' x-object-sysmeta-y: 'y1' x-object-sysmeta-z: 'z1' 

Now consider a further POST request received at t4:

 POST X-Timestamp: t4 x-object-sysmeta-p: '' x-object-sysmeta-x: 'x3' 

This POST is handled as follows. The existing system metadata is obtained as above by merging the contents of the t2.meta and t3.meta files. The existing metadata is then updated with any new items from the POST request, and resultant set is written to t4.meta. At this point the divergent system metadata sets in t2.meta and t3.meta are unified in t4.meta, rendering t2.meta and t3.meta obsolete. Once t4.meta has been written, the diskfile cleanup code deletes t2.meta and t3.meta, leaving just t1.data and t4.meta in the obj_dir:

 obj_dir: t1.data : x-object-sysmeta-p: ('p1', t0) t4.meta: x-object-sysmeta-p: ('', t4) x-object-sysmeta-x: ('x3', t3) x-object-sysmeta-z: ('z1', t3) x-object-sysmeta-y: ('y1', t2) 

Note that as discussed above, x-object-sysmeta-p: ("", t4) is stored in case the delete of t2.meta and/or t3.meta should fail.

'POST-as-copy'
When post-as-copy is enabled two concurrent POSTS result in two PUTs at the object server which may cause two .data files to be generated in obj_dir. The current cleanup code will delete the oldest of these before exiting the PUT handler. Again, this would be incorrect behavior for system metadata since the .data files might contain unique metadata items that should subsequently be merged.

To support system metadata POST semantics, multiple .data files may need to be preserved in the obj_dir when concurrent PUTs occur. However, this is only necessary when the PUTs are due to POST-as-copy events. Fortunately PUTs due to POST-as-copy can be distinguished from regular PUTs via an X-Fresh-Metadata header that is added by the proxy controller while processing POST-as-copy (this header is used as a flag internal to the proxy object controller to signal that all existing user metadata should but be replaced during the copy, but the header ends up being sent to the backend object server as a side-effect). By persisting this header along with other metadata in the .data file, those .data files created by a POST-as-copy event can be distinguished from .data files created by a regular PUT, and treated in a similar way to .meta files for the purposes of system metadata handling.

To avoid storing obsolete object data, older .data files can be truncated to zero length leaving just their xattrs intact.

'POST-as-copy' example
Consider our example once more. Initially obj_dir contains t1.data, the result of the original object PUT:

 obj_dir: t1.data : x-object-sysmeta-p: ('p1', t0) 

Two concurrent PUTs (due to POST-as-copy) update the object, with timestamps t2 and t3:

 POST X-Timestamp: t2 x-fresh-metadata: 'True' x-object-sysmeta-p: 'p2' x-object-sysmeta-x: 'x1' x-object-sysmeta-y: 'y1'

POST X-Timestamp: t3 x-fresh-metadata: 'True' x-object-sysmeta-x: 'x2' x-object-sysmeta-z: 'z1' 

These result in two .data files being added to the object directory and t1.data being deleted:

 obj_dir: t2.data: x-fresh-metadata: 'True' x-object-sysmeta-p: ('p2', t2) x-object-sysmeta-x: ('x1', t2) x-object-sysmeta-y: ('y1', t2) t3.data: x-fresh-metadata: 'True' x-object-sysmeta-p: ('p1', t0) x-object-sysmeta-x: ('x2', t3) x-object-sysmeta-z: ('z1', t3) </tt>

When a diskfile object is next constructed the xattrs of t2.data and t3.data are read, and based on the x-fresh-metadata attribute being True, their system metadata items are merged. Data file t2.data can be truncated to zero-length since its data is obsolete.

Now consider a further PUT (due to a POST-as-copy) received at t4:

 POST X-Timestamp: t4 x-fresh-metadata: 'True' x-object-sysmeta-p: '' x-object-sysmeta-x: 'x3' </tt>

This PUT is handled as follows. The diskfile object constructs existing system metadata by merging the contents of the t2.data and t3.data files. The existing metadata is then updated with any new items from the PUT request, and the resultant set is written to t4.data. At this point the divergent system metadata sets in t2.data and t3.data are unified in t4.data, so once t4.data has been written, the diskfile cleanup code can delete t2.data and t3.data, leaving just t4.data in the obj_dir:

 obj_dir: t4.meta: x-object-sysmeta-p: ('', t4) x-object-sysmeta-x: ('x3', t3) x-object-sysmeta-z: ('z1', t3) x-object-sysmeta-y: ('y1', t2) </tt>

Note that as with the fast-POST example, x-object-sysmeta-p: ("", t4) is stored in case the delete of t2.data and/or t3.data should fail.

Still to be considered

 * Impact on replicator
 * Impact on other back-ends
 * Any impact on performance