Swift/ideas/small files/implementation

The purpose of this page is to describe the proposed implementation with some benchmarks. Please note the implementation is at early stage, as the benchmarks.

Code: https://review.openstack.org/#/c/436406/ Slides: https://www.slideshare.net/rledisez/slide-smallfiles

A swift object uses at least one inode on the filesystem. For clusters with many small files (say, 10 million objects per disk) the performance degradation is important, as the directory structure does not fit in memory. Replication / auditor operations trigger a lot of IO. Over 40% of disk activity may be caused by "listdir" operations. The goal is to serve listdir operations without any disk IO.

Principle

Store swift objects in large files, as haystack does. We do not need all the information stored in an inode (owner, group..). Make the "inode" as small as possible so that listdir requests can be served from memory. These "inodes" will be stored in a key value store, per disk, to ease cluster maintenance. The "volumes" (actual large files on disk) are currently tied to a partition, which makes moving a parition easy (but changing the partition bit count harder)

Implementation

Two parts :

swift patches, mostly to diskfile.py, which is patched to use the "vfile" module, providing file like semantics for "virtual files" within volumes.
the key value store, based on leveldb, written in golang.

Those two parts communicates over RPC on a socket (using gRPC)

Volumes are always appended to, and data is fsync()ed. The key value store is written to asynchronously. In case of a crash, the end of the volume should be read and checked against the KV to update missing objects.

When files are deleted, the corresponding hole in the volume is punched using fallocate(). 4k aligned blocks will be returned to the filesystem, which means we should not have to recreate/defragment a volume file very often. (on initial open(), XFS will need to read in all extents, so it will still be needed at some point).

Examples

serving a PUT request instead of creating a temp file and renaming it :

check in the partition directory if a volume already exists and is not locked. If needed, create one and register it in the KV
lock the volume and write the object at the end of the volume. when swift closes the "file", seek back and write the object header for which we have reserved space, and fsync() the volume.
register the file in the KV

serving a GET request

get the object location (volume index, offset in the volume) from the KV
open the partition directory, open the volume file, serve the object.

Preliminary test results

We have not yet tested the pathological case we see in production with replication. Hardware setup : Atom C2750 2.40Ghz 16GB RAM Drives : HGST HUS726040ALA610 (4TB)

3 drives per server, but the tests below exercise a single drive.

Single threaded PUT from a machine to one patched object server, and an unpatched 2.12 server

(test using the object server API directly, no proxy server involved, objects are < 100 bytes)

From zero to 4 millions objects, on one disk.

2.12 version : 3360 minutes (19,8 PUT/s)
patched version : 2540 minutes (26,2 PUT/s) - About 42 bytes used in leveldb per object

From 4 millions to 8 million objects

2.12 version : 3900 minutes (17 PUT/s)
patched version : 1700 minutes (39,2 PUT/s) - faster, likely because most "volume files" have already been created (not measured, to be confirmed)

The key value size for the disk at the end of the test is 320MB.

Single threaded GET from a machine to one patched object server, and an unpatched 2.12 server. Both servers have 8 million objects on one disk.

2.12 version : 39 GET/s
patched version : 93 GET/s

Concurrent PUT requests, 20 per second, for 10 minutes, with "hot inode cache"

2.12 version response time distribution :

  Latencies     [mean, 50, 95, 99, max]  641.274117ms, 67.31248ms, 3.526835534s, 4.68917307s, 5.971909s
  100% success

patched version response time distribution :

  Latencies     [mean, 50, 95, 99, max]  82.581295ms, 50.487793ms, 261.475566ms, 615.565045ms, 1.245540101s
  success 100%

Concurrent PUT requests, 20 per second, for 10 minutes, after dropping vm cache

2.12 version response time distribution :

  Latencies     [mean, 50, 95, 99, max]  29.211369875s, 30.002788029s, 30.003025069s, 31.001143056s, 33.005231569s
  response below 30s: 6,11%

patched version response time distribution :

  Latencies     [mean, 50, 95, 99, max]  9.290393071s, 8.216053491s, 24.212567799s, 29.46094486s, 30.001358218s
  response below 30s: 99.26%