Swift/ideas/small files/implementation

The purpose of this page is to describe the proposed implementation with some benchmarks. Please note the implementation is at early stage, as the benchmarks.

Code (old): https://review.openstack.org/#/c/436406/ Slides (old): https://www.slideshare.net/rledisez/slide-smallfiles newer slide : https://fr.slideshare.net/AlexandreLecuyer/openstack-swift-lots-of-small-files

Problem
A swift object uses at least two inodes on the filesystem. For clusters with many small files (say, 10 million objects per disk) the performance degradation is important, as the directory structure does not fit in memory. Replication / auditor operations trigger a lot of IO. Over 40% of disk activity may be caused by "listdir" operations. The goal is to serve listdir operations without any disk IO.

Overview
Store swift objects in large files, similar to what haystack does. We do not need all the information stored in an inode (owner, group..). Make the "inode" as small as possible so that listdir requests can be served from memory. These "inodes" will be stored in a key value store, per disk, to ease cluster maintenance.

New python code
A new "vfile" module presents an interface similar to a regular python file, but will store "vfiles" in large files, which we call "volumes". A volume is append-only, and is dedicated to a given swift partition. There may be multiple volumes for a partition, to allow for write concurrency. Once a file has been written to a volume, its location (volume index, offset) is stored in a key-value store.
 * vfile

"kvfile" is a copy of diskfile, modified to use "vfile" instead of regular POSIX files.
 * kvfile

the "rpc_grpc" module handles communication with the local RPC server, through a socket. (registering a file, listing directories..) It makes calls to generated code from gRPC
 * rpc

Existing python code
Minor changes to replicator / reconstructor / diskfile / utils Mostly, abstract file and directory operations (os.*). Call to the diskfile implementation instead.

New golang code
The RPC server runs as a separate process, accessed over a socket, using gRPC. There is one RPC server instance per disk. Each RPC server embeds a leveldb key-value database. The basic operations are storing and retrieving information about volumes and files, and recreating the swift directory structure on the fly (directories are not stored)

Consistency
The object server currently issues an fsync before replying, to ensure data is on disk. The vfile module will also sync the volume after writes before replying to a client. However, the leveldb key-value store is written to asynchronously. Synchronous operations on the key-value store would be too costly, performance wise.

If the kv has not been closed properly, upon restart a check will be triggered, and the end of volumes will be scanned, to reconcile any difference with the kv content.

Freeing space
We rely on the filesystem's "punch hole" support, which lets us free space within a file. https://lwn.net/Articles/415889/

Examples
serving a PUT request. Instead of creating a temp file and renaming it :
 * Find an unlocked volume for that partition. If needed, create one and register it in the KV.
 * lock the volume and write the object at the end of the volume. when swift closes the "file", write the object header and metdata, then fsync the volume.
 * register the file in the KV

serving a GET request
 * get the object location (volume index, offset in the volume) from the KV
 * open the volume file, seek to offset, serve the object.

Preliminary test results (outdated)
We have not yet tested the pathological case we see in production with replication. Hardware setup : Atom C2750 2.40Ghz 16GB RAM Drives : HGST HUS726040ALA610 (4TB)

3 drives per server, but the tests below exercise a single drive.

Single threaded PUT from a machine to one patched object server, and an unpatched 2.12 server
(test using the object server API directly, no proxy server involved, objects are < 100 bytes)

From zero to 4 millions objects, on one disk.
 * 2.12 version : 3360 minutes (19,8 PUT/s)
 * patched version : 2540 minutes (26,2 PUT/s) - About 42 bytes used in leveldb per object

From 4 millions to 8 million objects
 * 2.12 version : 3900 minutes (17 PUT/s)
 * patched version : 1700 minutes (39,2 PUT/s) - faster, likely because most "volume files" have already been created (not measured, to be confirmed)

The key value size for the disk at the end of the test is 320MB.

Single threaded GET from a machine to one patched object server, and an unpatched 2.12 server. Both servers have 8 million objects on one disk.

 * 2.12 version : 39 GET/s
 * patched version : 93 GET/s

Concurrent PUT requests, 20 per second, for 10 minutes, with "hot inode cache"
Latencies    [mean, 50, 95, 99, max]  641.274117ms, 67.31248ms, 3.526835534s, 4.68917307s, 5.971909s 100% success
 * 2.12 version response time distribution :

Latencies    [mean, 50, 95, 99, max]  82.581295ms, 50.487793ms, 261.475566ms, 615.565045ms, 1.245540101s success 100%
 * patched version response time distribution :

Concurrent PUT requests, 20 per second, for 10 minutes, after dropping vm cache
Latencies    [mean, 50, 95, 99, max]  29.211369875s, 30.002788029s, 30.003025069s, 31.001143056s, 33.005231569s response below 30s: 6,11%
 * 2.12 version response time distribution :

Latencies    [mean, 50, 95, 99, max]  9.290393071s, 8.216053491s, 24.212567799s, 29.46094486s, 30.001358218s response below 30s: 99.26%
 * patched version response time distribution :