Swift/ideas/metadata-sync

Searching for an object in a Swift cluster is difficult. If some information about the object name and its container is known, one can rely on the prefix/delimiter options to try to narrow the search. If only a part of the object name is known, however, Swift requires listing the entire container. If the container is unknown, then search becomes even harder as it must be repeated across all of the containers.

Lastly, if only a metadata key associated with the object is known, the search requires a HEAD request for every object. This process could take days. This document describes how this problem could be tackled using an external index (such as Elasticsearch).

Design discussion

The fundamental idea is to leverage the container database (like the container sync approach). The container databases contain exactly the information required to be propagated to the indexing service:

the name of the object
object etag
last modified date
the deleted flag

The only missing part is the metadata associated with the object, which can be looked up with HEAD.

The design adds a process scanning the container databases continuously, updating Elasticsearch as required. An instance of this synchronization process runs on every container node. Each one is configured with a list of `(account, container)` tuples that are to be indexed. For each tuple, the first step is to check that the database is present on the node and to create an instance of the `ContainerBroker`. Secondly, the crawling process retrieves the set of recently changed items using the `get_items_since()` API.

Multiple instances crawling the database do not interact with each other. Each process records the last processed row ID. In each iteration it retrieves the set of changed rows. From the set of rows, only a portion for a given node is at first propagated. The node ID determines the working set following a simple test: `row_id % node_id == 0`.

Before moving to the next set of rows, the work of other nodes must be verified. In the verification step each node considers all of the rows for which it was not responsible (`row_id % node_id != 0`). All of the records are then retrieved from the secondary index and the last modified dates in the index are compared against those in the container database. Any missing updates -- ones where the timestamps don't match -- are then patched up. If all nodes are making progress at a similar rate, the verification step will not cause additional HEAD requests and when using Elasticsearch can be done in a single batch request from each indexing process.

Storing indexing state

The indexing state is stored in `/var/lib` on every node with the indexing process. This allows nodes to make progress even if they cannot communicate with each other or large parts of the system. If a node is replaced, a new node would need to catch up by processing items and finding that they have been indexed. It would be interesting to understand how much time catching up would take. There may be optimizations that we could add in the catching up stage -- similar to the verification stage using bulk requests -- which would reduce the required time.

Middleware

The goal of any solution is to propagate a change to an object in Swift -- PUT, POST, or DELETE -- into the secondary index. This can be accomplished through Swift middleware. This means for every request, another request submitted to the indexing service with the full request parameters (whether an object is added, metadata updated, or deleted).

The benefit of this approach is that it's simple to implement and the required components already exist in Swift. The open issue is what to do if an update to the indexing service cannot be made. There could be a network partition or the indexing service could be offline. If Swift continues to accept changes, they must be later reconciled in some way. As the reconciliation step is still required, the proposal is to start only with scanning container databases.

Swift/ideas/metadata-sync

Contents

Propagating Swift object metadata to an external index (e.g. Elasticsearch)

Background

Design discussion

Storing indexing state

Middleware