Swift/ideas/small files/implementation

The purpose of this page is to describe the proposed implementation with some benchmarks. Please note the implementation is at early stage, as the benchmarks.

Code (old): https://review.openstack.org/#/c/436406/ Slides (old): https://www.slideshare.net/rledisez/slide-smallfiles newer slide : https://fr.slideshare.net/AlexandreLecuyer/openstack-swift-lots-of-small-files

Problem
A swift object uses at least two inodes on the filesystem. For clusters with many small files (say, 10 million objects per disk) the performance degradation is important, as the directory structure does not fit in memory. Replication / auditor operations trigger a lot of IO. Over 40% of disk activity may be caused by "listdir" operations. The goal is to serve listdir operations without any disk IO.

Overview
Store swift objects in large files, similar to what haystack does. We do not need all the information stored in an inode (owner, group..). Make the "inode" as small as possible so that listdir requests can be served from memory. These "inodes" will be stored in a key value store, per disk, to ease cluster maintenance.

New python code
A new "vfile" module presents an interface similar to a regular python file, but will store "vfiles" in large files, which we call "volumes". A volume is append-only, and is dedicated to a given swift partition. There may be multiple volumes for a partition, to allow for write concurrency. Once a file has been written to a volume, its location (volume index, offset) is stored in a key-value store.
 * vfile

"kvfile" is a copy of diskfile, modified to use "vfile" instead of regular POSIX files.
 * kvfile

the "rpc_grpc" module handles communication with the local RPC server, through a socket. (registering a file, listing directories..) It makes calls to generated code from gRPC
 * rpc

Existing python code
Minor changes to replicator / reconstructor / diskfile / utils Mostly, abstract file and directory operations (os.*). Call to the diskfile implementation instead.

New golang code
The RPC server runs as a separate process, accessed over a socket, using gRPC. There is one RPC server instance per disk. Each RPC server embeds a leveldb key-value database. The basic operations are storing and retrieving information about volumes and files, and recreating the swift directory structure on the fly (directories are not stored)

Consistency
The object server currently issues an fsync before replying, to ensure data is on disk. The vfile module will also sync the volume after writes before replying to a client. However, the leveldb key-value store is written to asynchronously. Synchronous operations on the key-value store would be too costly, performance wise.

If the kv has not been closed properly, upon restart a check will be triggered, and the end of volumes will be scanned, to reconcile any difference with the kv content.

Freeing space
We rely on the filesystem's "punch hole" support, which lets us free space within a file. https://lwn.net/Articles/415889/

Examples
serving a PUT request. Instead of creating a temp file and renaming it :
 * Find an unlocked volume for that partition. If needed, create one and register it in the KV.
 * lock the volume and write the object at the end of the volume. when swift closes the "file", write the object header and metdata, then fsync the volume.
 * register the file in the KV

serving a GET request
 * get the object location (volume index, offset in the volume) from the KV
 * open the volume file, seek to offset, serve the object.

Preliminary test results (outdated)
We have not yet tested the pathological case we see in production with replication. Hardware setup : Atom C2750 2.40Ghz 16GB RAM Drives : HGST HUS726040ALA610 (4TB)

3 drives per server, but the tests below exercise a single drive.

Single threaded PUT from a machine to one patched object server, and an unpatched 2.12 server
(test using the object server API directly, no proxy server involved, objects are < 100 bytes)

From zero to 4 millions objects, on one disk.
 * 2.12 version : 3360 minutes (19,8 PUT/s)
 * patched version : 2540 minutes (26,2 PUT/s) - About 42 bytes used in leveldb per object

From 4 millions to 8 million objects
 * 2.12 version : 3900 minutes (17 PUT/s)
 * patched version : 1700 minutes (39,2 PUT/s) - faster, likely because most "volume files" have already been created (not measured, to be confirmed)

The key value size for the disk at the end of the test is 320MB.

Single threaded GET from a machine to one patched object server, and an unpatched 2.12 server. Both servers have 8 million objects on one disk.

 * 2.12 version : 39 GET/s
 * patched version : 93 GET/s

Concurrent PUT requests, 20 per second, for 10 minutes, with "hot inode cache"
Latencies    [mean, 50, 95, 99, max]  641.274117ms, 67.31248ms, 3.526835534s, 4.68917307s, 5.971909s 100% success
 * 2.12 version response time distribution :

Latencies    [mean, 50, 95, 99, max]  82.581295ms, 50.487793ms, 261.475566ms, 615.565045ms, 1.245540101s success 100%
 * patched version response time distribution :

Concurrent PUT requests, 20 per second, for 10 minutes, after dropping vm cache
Latencies    [mean, 50, 95, 99, max]  29.211369875s, 30.002788029s, 30.003025069s, 31.001143056s, 33.005231569s response below 30s: 6,11%
 * 2.12 version response time distribution :

Latencies    [mean, 50, 95, 99, max]  9.290393071s, 8.216053491s, 24.212567799s, 29.46094486s, 30.001358218s response below 30s: 99.26%
 * patched version response time distribution :

LOSF v2
We are now looking at 14TB drives. LOSF main issue is the CPU consumption. Another thing we'd like to do is optimize HEAD requests.

New objet key format (done)
The new format is: partition / suffix / object hash / filename. The old one was: object hash / filename. Grows the DB by about 5%, but listing partition and suffix content is now much cheaper. The downside is that we will have to support the relinker: the previous format did not care about the part power

Bring back hashes.pkl (done)
Because listing partition and suffix is now cheaper, __get_hashes can use the same code as diskfile.py, and we can bring back hashes.pkl

New volume selection method
Profiling shows the current method to be CPU intensive. Stop dedicating a volume to a partition. Create volumes until a minimum count is reached, to avoid lock contention between object-server processes, and then use these until they grow too large, then mark as read-only.

Store metadata in the DB
Optionally (if the LOSF key value lives on a dedicated SSD), store a copy of the swift metadata in the DB

Add volume statistics
We have a FUSE mount which is nice, but getting per-volume statistics (how much space was punched, quarantine size, ...) is costly to get. Store that info in the DB

Benchmarks !
Can't wait, but nothing to see here yet ;)