Swift/ideas/metadata-sync

Background
Searching for an object in a Swift cluster is difficult. If some information about the object name and its container is known, one can rely on the prefix/delimiter options to try to narrow the search. If only a part of the object name is known, however, Swift requires listing the entire container. If the container is unknown, then search becomes even harder as it must be repeated across all of the containers.

Lastly, if only a metadata key associated with the object is known, the search requires a HEAD request for every object. This process could take days. This document describes how this problem could be tackled using an external index (such as Elasticsearch).

Design discussion
The fundamental idea is to leverage the container database (like the container sync approach). The container databases contain exactly the information required to be propagated to the indexing service:
 * the name of the object
 * object etag
 * last modified date
 * the deleted flag

The only missing part is the metadata associated with the object, which can be looked up with HEAD.

The design adds a process scanning the container databases continuously, updating Elasticsearch as required. An instance of this synchronization process runs on every container node. Each one is configured with a list of  tuples that are to be indexed. For each tuple, the first step is to check that the database is present on the node and to create an instance of the. Secondly, the crawling process retrieves the set of recently changed items using the  API.

Multiple instances crawling the database do not interact with each other. Each process records the last processed row ID. In each iteration it retrieves the set of changed rows. From the set of rows, only a portion for a given node is at first propagated. The node ID determines this set through the test.

Before moving to the next set of rows, the work of other nodes must be verified. In the verification step each node considers all of the rows for which it was not responsible. All of the records are then retrieved from the secondary index and the last modified dates in the index are compared against those in the container database. Any missing updates -- ones where the time stamps don't match -- are then patched up. If all nodes are making progress at a similar rate, the verification step will not cause additional HEAD requests. When using Elasticsearch, the indexed documents to be verified can be retrieved through a single batch request.

Storing indexing state
The indexing state is stored in  on every node with the indexing process. This allows nodes to make progress even if they cannot communicate with each other or large parts of the system. If a node is replaced, a new node would need to catch up by processing items and finding that they have been indexed. It would be interesting to understand how much time catching up would take. Using batch requests with Elasticsearch could reduce the catch-up time.

To identify that the node has been replaced -- or that the database has been replaced when one of the drives fails -- we should also embed its UUID when keeping track of state. One way to accomplish this would be if the database itself had a table to keep track of the progress (the last row ID). When replaced, the database would be reset and all of the row IDs invalidated. Otherwise, if we use local storage, embedding the UUID into the status file is probably the way to go.

Implementation
A proof of concept implementation of this is available as two projects: Container Crawler and Swift Metadata Sync.

Middleware
Object metadata could be also replicated through Swift middleware. The middleware would submit an Elasticsearch request following any successful request through the proxy node. The benefit of this approach is that it's simple to implement and the required components already exist in Swift. However, what to do if an update to the indexing service cannot be made? There could be a network partition or the indexing service could be offline. If Swift continues to accept changes, they must be later reconciled in some way. As the reconciliation step is still required, the proposal is to start only with scanning container databases.