Propagating Swift object metadata to an external index (e.g. Elasticsearch)
Searching for an object in a Swift cluster is difficult. If some information about the object name and its container is known, one can rely on the prefix/delimiter options to try to narrow the search. If only a part of the object name is known, however, Swift requires listing the entire container. If the container is unknown, then search becomes even harder as it must be repeated across all of the containers.
Lastly, if only a metadata key associated with the object is known, the search requires a HEAD request for every object. This process could take days. This document describes how this problem could be tackled using an external index (such as Elasticsearch).
The fundamental idea is to leverage the container database (like the container sync approach). The container databases contain exactly the information required to be propagated to the indexing service:
- the name of the object
- object etag
- last modified date
- the deleted flag
The only missing part is the metadata associated with the object, which can be looked up with HEAD.
The design adds a process scanning the container databases continuously, updating Elasticsearch as required. An instance of this synchronization process runs on every container node. Each one is configured with a list of
(account, container) tuples that are to be indexed. For each tuple, the first step is to check that the database is present on the node and to create an instance of the
ContainerBroker. Secondly, the crawling process retrieves the set of recently changed items using the
Multiple instances crawling the database do not interact with each other. Each process records the last processed row ID. In each iteration it retrieves the set of changed rows. From the set of rows, only a portion for a given node is at first propagated. The node ID determines this set through the test
row_id % node_id == 0.
Before moving to the next set of rows, the work of other nodes must be verified. In the verification step each node considers all of the rows for which it was not responsible (
row_id % node_id != 0). All of the records are then retrieved from the secondary index and the last modified dates in the index are compared against those in the container database. Any missing updates -- ones where the time stamps don't match -- are then patched up. If all nodes are making progress at a similar rate, the verification step will not cause additional HEAD requests. When using Elasticsearch, the indexed documents to be verified can be retrieved through a single batch request.
Storing indexing state
The indexing state is stored in
/var/lib on every node with the indexing process. This allows nodes to make progress even if they cannot communicate with each other or large parts of the system. If a node is replaced, a new node would need to catch up by processing items and finding that they have been indexed. It would be interesting to understand how much time catching up would take. Using batch requests with Elasticsearch could reduce the catch-up time.
To identify that the node has been replaced -- or that the database has been replaced when one of the drives fails -- we should also embed its UUID when keeping track of state. One way to accomplish this would be if the database itself had a table to keep track of the progress (the last row ID). When replaced, the database would be reset and all of the row IDs invalidated. Otherwise, if we use local storage, embedding the UUID into the status file is probably the way to go.
Object metadata could be also replicated through Swift middleware. The middleware would submit an Elasticsearch request following any successful request through the proxy node. The benefit of this approach is that it's simple to implement and the required components already exist in Swift. However, what to do if an update to the indexing service cannot be made? There could be a network partition or the indexing service could be offline. If Swift continues to accept changes, they must be later reconciled in some way. As the reconciliation step is still required, the proposal is to start only with scanning container databases.