Swift/ideas/small files/experimentations

= Experimentations on small files optimization in Swift = (irc: rledisez)

Note: Despite the official recommendation of the deployment guide (and the performances impact), at OVH we run Swift on XFS filesystems with barrier=on. So, this summary mainly focus on synchronous performances. The performance numbers are given for a 4TB SAS disk, C2750 CPU, 16GB of RAM. "Constrained memory environment" means that 12GB of memory are voluntarily consumed to reduce the available memory to 4GB.

Goals

 * reduce IO needed to read FS metadata (inodes)
 * Maintain (at least) the performances of XFS:
 * object creation: 43/s
 * object read in memory constraint environment: 40/s

Constraints

 * concurrency: many process will need access to the small files store (object server, auditor, replicator, ...)
 * allocation: must not waste to much space (saving space is not the goal, but could be nice)
 * integrity: no data corruption nor store corruption

Ideas
Seems the easy way to meet the concurrency is to have an RPC serving data requests. There is some interesting RPC protocol that allow communications without needs for copying data (eg: cap'n proto). Also, side effect, an RPC server would transform the blocking IO call in a non blocking RPC call (nice for python :))

As we will need an index, we should store all objects by their hash in a flat namespace. That would allow to increase/reduce the part power with no actions except updating the ring. With the correct structure, a range scan is very efficient and allows to simulate partitions for replication purpose.

Developing a filesystem
We first tried to develop a filesystem, running on top of XFS (to benefit of all the caching done at the VFS layer). Specifications was the following:
 * only contiguous allocation: no fragmentation so we don't need defragmentation or compaction logic
 * having all necessary in index to run auditor/replicator/reconstructor: no need to access data to do a os.listdir
 * small footprint index so that it can fit in memory: an achievable target seems to be between 50 and 60 bytes per object (preference is to burn CPU to save memory)

While we had a working POC in few days, the amount of work to go from POC to production (= reliability, performance) was considered enormous, so we tried looking at other solutions.

Using a well-know key-value store
Idea is:
 * For small files: store data in the key-value store
 * For bigger files: store the filehandle of the file in the key-value store

Storing the filehandles can save many IOs because it become unnecessary to do all the usual lookups to "reach" a file (reading the inodes of all parents directory before reading the file inode). Filehandles have a downside: they bypass all security, because the file is directly accessed without any check on parent directory. So, it should possible to disable the use of filehandles if Swift processes are not running in a safe environment (eg: inside of docker on a server running multiples services).

Storing filehandles can also open a door to getting rid of the current files hierarchy (part/sfx/hash/ts), thus helping a lot for part power modifications. Because if the real information is in the DB, the way big files are stored in XFS does not matters anymore, so there is no more need to hardlink & co to change a part power.

The key-value store tried were:
 * kyotocabinet: unacceptable performances in synchronous mode
 * boltdb: bad performances in random insertion, decreasing as the DB grow
 * lmdb: same as boltdb
 * leveldb: bad performances in random insertion
 * forestdb: bad performances in random insertion
 * rocksdb: good performances in random insertion, about 1.8x XFS, low disk overhead (it actually saves space compared to XFS). But very easy corruption with a basic example (the one from the doc). Project seems young and not really mature (there is recent reports of corruption on the github).

Positive points of this solution:
 * small dev, pretty simple to implement (an RPC in front of a key-value store)
 * no need for "part power increase" logic (flat namespaces)
 * impact of double-lookup (db+fs) should be saved by filehandles (TODO: need to be benchmarked)

Downside:
 * did not find any project that is fast/reliable enough

key-value store + transaction groups
The idea is to use a key value store (see previous point), but instead of commiting each objects individually, we group them in a transaction group (inspired from ZFS transaction groups) and commit them all together every N ms. This can easily double the objects creations per seconds. Downside is that to keep the synchronous behavior, we must wait for the transaction to be commited before answering the client. So, it can make the client to wait up to N ms before validating the upload. 10ms gives good results in term of creations per seconds.

Using the DMU of ZFS
The Lustre team developed an OSD based on ZFS. Not ZFS as a filesystem (ZPL), but as an object store (DMU). This is an interesting approach as they benefit from all the ZFS cool features from the DMU (Copy On Write, Transactions, Snapshot, ...), but they don't get the overhead of the ZPL (inodes). Idea is to write the data in a DMU object, and index this object by an identifier (eg: hash) in the ZAP.

Developing around ZFS code proven to be really easy, working code in a day or two.

Positive points are:
 * No need to prove it, ZFS is rock solid (it's a fact :))
 * Random write performances are very good (compared to XFS, about 2.5x in synchronous mode, about 16x in asynchronous mode)
 * Random read performance in constrained memory environment are a bit better than XFS. It would probably be better with a ZAP replacement (see Downside)
 * It's maintained and active, no reason to think it won't be in the future
 * Run on top of XFS in a file (about 5% performance lost) as directly on devices, so it could be an easy migration path
 * Cool features that could be used in Swift in the future (zfs send to replicate? see https://www.ixsystems.com/blog/openzfs-devsummit-2016/ "Redacted send/receive" and "Compressed Send and Receive")

Downsides are:
 * ZFS code is very low level (even if it runs in userland), developed in C. Even if it's clear and well written, it would require some effort to fully and correctly wrap our minds around it.
 * (FIXED) Had some instabilities when running from golang/cgo while it's perfectly stable with C (todo: try runtime.LockOSThread)
 * ZAP is not optimized for our use case, one record is about 258 bytes while we only really need around 50-60 bytes/. Also it does not allow range scan. So we would probably end up developing a replacement for ZAP, some kind of b-tree or equivalent on top of DMU (unless we find something compatible with our needs, maybe https://github.com/timtadh/fs2 ?)