Jump to: navigation, search

Difference between revisions of "Swift/ideas/small files/implementation"

< Swift‎ | ideas‎ | small files
(New volume selection method)
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
The purpose of this page is to describe the proposed implementation with some benchmarks. Please note the implementation is at early stage, as the benchmarks.
 
The purpose of this page is to describe the proposed implementation with some benchmarks. Please note the implementation is at early stage, as the benchmarks.
  
Code: https://review.openstack.org/#/c/436406/
+
Code (old): https://review.openstack.org/#/c/436406/
Slides: https://www.slideshare.net/rledisez/slide-smallfiles
+
Slides (old): https://www.slideshare.net/rledisez/slide-smallfiles
 +
newer slide :
 +
https://fr.slideshare.net/AlexandreLecuyer/openstack-swift-lots-of-small-files
  
=== Problem we want to address ===
+
=== Problem ===
A swift object uses at least one inode on the filesystem. For clusters with many small files (say, 10 million objects per disk) the performance degradation is important, as the directory structure does not fit in memory. Replication / auditor operations trigger a lot of IO. Over 40% of disk activity may be caused by "listdir" operations. The goal is to serve listdir operations without any disk IO.
+
A swift object uses at least two inodes on the filesystem. For clusters with many small files (say, 10 million objects per disk) the performance degradation is important, as the directory structure does not fit in memory. Replication / auditor operations trigger a lot of IO. Over 40% of disk activity may be caused by "listdir" operations. The goal is to serve listdir operations without any disk IO.
  
=== Principle ===
+
=== Overview ===
Store swift objects in large files, as haystack does. We do not need all the information stored in an inode (owner, group..). Make the "inode" as small as possible so that listdir requests can be served from memory. These "inodes" will be stored in a key value store, per disk, to ease cluster maintenance.
+
Store swift objects in large files, similar to what haystack does. We do not need all the information stored in an inode (owner, group..). Make the "inode" as small as possible so that listdir requests can be served from memory. These "inodes" will be stored in a key value store, per disk, to ease cluster maintenance.
The "volumes" (actual large files on disk) are currently tied to a partition, which makes moving a parition easy (but changing the partition bit count harder)
 
  
 
=== Implementation ===
 
=== Implementation ===
Two parts :
+
== New python code ==
* swift patches, mostly to diskfile.py, which is patched to use the "vfile" module, providing file like semantics for "virtual files" within volumes.
+
*vfile
* the key value store, based on leveldb, written in golang.
+
A new "vfile" module presents an interface similar to a regular python file, but will store "vfiles" in large files, which we call "volumes".
 +
A volume is append-only, and is dedicated to a given swift partition. There may be multiple volumes for a partition, to allow for write concurrency.
 +
Once a file has been written to a volume, its location (volume index, offset) is stored in a key-value store.
  
Those two parts communicates over RPC on a socket (using gRPC)
+
*kvfile
 +
"kvfile" is a copy of diskfile, modified to use "vfile" instead of regular POSIX files.
  
Volumes are always appended to, and data is fsync()ed. The key value store is written to asynchronously. In case of a crash, the end of the volume should be read and checked against the KV to update missing objects.
+
*rpc
 +
the "rpc_grpc" module handles communication with the local RPC server, through a socket. (registering a file, listing directories..)
 +
It makes calls to generated code from gRPC
  
When files are deleted, the corresponding hole in the volume is punched using fallocate(). 4k aligned blocks will be returned to the filesystem, which means we should not have to recreate/defragment a volume file very often. (on initial open(), XFS will need to read in all extents, so it will still be needed at some point).
+
== Existing python code ==
 +
Minor changes to replicator / reconstructor / diskfile / utils
 +
Mostly, abstract file and directory operations (os.*). Call to the diskfile implementation instead.
 +
 
 +
== New golang code ==
 +
The RPC server runs as a separate process, accessed over a socket, using gRPC.
 +
There is one RPC server instance per disk. Each RPC server embeds a leveldb key-value database.
 +
The basic operations are storing and retrieving information about volumes and files, and recreating the swift directory structure on the fly (directories are not stored)
 +
 
 +
== Consistency ==
 +
The object server currently issues an fsync() before replying, to ensure data is on disk.
 +
The vfile module will also sync the volume after writes before replying to a client. However, the leveldb key-value store is written to asynchronously.
 +
Synchronous operations on the key-value store would be too costly, performance wise.
 +
 
 +
If the kv has not been closed properly, upon restart a check will be triggered, and the end of volumes will be scanned, to reconcile any difference with the kv content.
 +
 
 +
== Freeing space ==
 +
We rely on the filesystem's "punch hole" support, which lets us free space within a file.
 +
https://lwn.net/Articles/415889/
  
 
=== Examples ===
 
=== Examples ===
serving a PUT request
+
serving a PUT request. Instead of creating a temp file and renaming it :
instead of creating a temp file and renaming it :
+
* Find an unlocked volume for that partition. If needed, create one and register it in the KV.
* check in the partition directory if a volume already exists and is not locked. If needed, create one and register it in the KV
+
* lock the volume and write the object at the end of the volume. when swift closes the "file", write the object header and metdata, then fsync() the volume.
* lock the volume and write the object at the end of the volume. when swift closes the "file", seek back and write the object header for which we have reserved space, and fsync() the volume.
 
 
* register the file in the KV
 
* register the file in the KV
  
 
serving a GET request
 
serving a GET request
 
* get the object location (volume index, offset in the volume) from the KV
 
* get the object location (volume index, offset in the volume) from the KV
* open the partition directory, open the volume file, serve the object.
+
* open the volume file, seek to offset, serve the object.
  
 
+
=== Preliminary test results (outdated) ===
=== Preliminary test results ===
 
 
We have not yet tested the pathological case we see in production with replication.
 
We have not yet tested the pathological case we see in production with replication.
 
Hardware setup :
 
Hardware setup :
Line 79: Line 101:
 
   Latencies    [mean, 50, 95, 99, max]  9.290393071s, 8.216053491s, 24.212567799s, 29.46094486s, 30.001358218s
 
   Latencies    [mean, 50, 95, 99, max]  9.290393071s, 8.216053491s, 24.212567799s, 29.46094486s, 30.001358218s
 
   response below 30s: 99.26%
 
   response below 30s: 99.26%
 +
 +
== LOSF v2 ==
 +
We are now looking at 14TB drives. LOSF main issue is the CPU consumption. Another thing we'd like to do is optimize HEAD requests.
 +
 +
=== New objet key format (done) ===
 +
The new format is: partition / suffix / object hash / filename. The old one was: object hash / filename.
 +
Grows the DB by about 5%, but listing partition and suffix content is now much cheaper.
 +
The downside is that we will have to support the relinker: the previous format did not care about the part power
 +
 +
=== Bring back hashes.pkl (done) ===
 +
Because listing partition and suffix is now cheaper, __get_hashes() can use the same code as diskfile.py, and we can bring back hashes.pkl
 +
 +
=== New volume selection method ===
 +
Profiling shows the current method to be CPU intensive. Stop dedicating a volume to a partition. Create volumes until a minimum count is reached, to avoid lock contention between object-server processes, and then use these until they grow too large, then mark as read-only.
 +
 +
=== Store metadata in the DB ===
 +
Optionally (if the LOSF key value lives on a dedicated SSD), store a copy of the swift metadata in the DB
 +
 +
=== Add volume statistics ===
 +
We have a FUSE mount which is nice, but getting per-volume statistics (how much space was punched, quarantine size, ...) is costly to get.
 +
Store that info in the DB
 +
 +
=== Benchmarks ! ===
 +
Can't wait, but nothing to see here yet ;)

Latest revision as of 21:01, 18 March 2020

The purpose of this page is to describe the proposed implementation with some benchmarks. Please note the implementation is at early stage, as the benchmarks.

Code (old): https://review.openstack.org/#/c/436406/ Slides (old): https://www.slideshare.net/rledisez/slide-smallfiles newer slide : https://fr.slideshare.net/AlexandreLecuyer/openstack-swift-lots-of-small-files

Problem

A swift object uses at least two inodes on the filesystem. For clusters with many small files (say, 10 million objects per disk) the performance degradation is important, as the directory structure does not fit in memory. Replication / auditor operations trigger a lot of IO. Over 40% of disk activity may be caused by "listdir" operations. The goal is to serve listdir operations without any disk IO.

Overview

Store swift objects in large files, similar to what haystack does. We do not need all the information stored in an inode (owner, group..). Make the "inode" as small as possible so that listdir requests can be served from memory. These "inodes" will be stored in a key value store, per disk, to ease cluster maintenance.

Implementation

New python code

  • vfile

A new "vfile" module presents an interface similar to a regular python file, but will store "vfiles" in large files, which we call "volumes". A volume is append-only, and is dedicated to a given swift partition. There may be multiple volumes for a partition, to allow for write concurrency. Once a file has been written to a volume, its location (volume index, offset) is stored in a key-value store.

  • kvfile

"kvfile" is a copy of diskfile, modified to use "vfile" instead of regular POSIX files.

  • rpc

the "rpc_grpc" module handles communication with the local RPC server, through a socket. (registering a file, listing directories..) It makes calls to generated code from gRPC

Existing python code

Minor changes to replicator / reconstructor / diskfile / utils Mostly, abstract file and directory operations (os.*). Call to the diskfile implementation instead.

New golang code

The RPC server runs as a separate process, accessed over a socket, using gRPC. There is one RPC server instance per disk. Each RPC server embeds a leveldb key-value database. The basic operations are storing and retrieving information about volumes and files, and recreating the swift directory structure on the fly (directories are not stored)

Consistency

The object server currently issues an fsync() before replying, to ensure data is on disk. The vfile module will also sync the volume after writes before replying to a client. However, the leveldb key-value store is written to asynchronously. Synchronous operations on the key-value store would be too costly, performance wise.

If the kv has not been closed properly, upon restart a check will be triggered, and the end of volumes will be scanned, to reconcile any difference with the kv content.

Freeing space

We rely on the filesystem's "punch hole" support, which lets us free space within a file. https://lwn.net/Articles/415889/

Examples

serving a PUT request. Instead of creating a temp file and renaming it :

  • Find an unlocked volume for that partition. If needed, create one and register it in the KV.
  • lock the volume and write the object at the end of the volume. when swift closes the "file", write the object header and metdata, then fsync() the volume.
  • register the file in the KV

serving a GET request

  • get the object location (volume index, offset in the volume) from the KV
  • open the volume file, seek to offset, serve the object.

Preliminary test results (outdated)

We have not yet tested the pathological case we see in production with replication. Hardware setup : Atom C2750 2.40Ghz 16GB RAM Drives : HGST HUS726040ALA610 (4TB)

3 drives per server, but the tests below exercise a single drive.

Single threaded PUT from a machine to one patched object server, and an unpatched 2.12 server

(test using the object server API directly, no proxy server involved, objects are < 100 bytes)

From zero to 4 millions objects, on one disk.

  • 2.12 version : 3360 minutes (19,8 PUT/s)
  • patched version : 2540 minutes (26,2 PUT/s) - About 42 bytes used in leveldb per object

From 4 millions to 8 million objects

  • 2.12 version : 3900 minutes (17 PUT/s)
  • patched version : 1700 minutes (39,2 PUT/s) - faster, likely because most "volume files" have already been created (not measured, to be confirmed)

The key value size for the disk at the end of the test is 320MB.

Single threaded GET from a machine to one patched object server, and an unpatched 2.12 server. Both servers have 8 million objects on one disk.

  • 2.12 version : 39 GET/s
  • patched version : 93 GET/s


Concurrent PUT requests, 20 per second, for 10 minutes, with "hot inode cache"

  • 2.12 version response time distribution :
  Latencies     [mean, 50, 95, 99, max]  641.274117ms, 67.31248ms, 3.526835534s, 4.68917307s, 5.971909s
  100% success
  • patched version response time distribution :
  Latencies     [mean, 50, 95, 99, max]  82.581295ms, 50.487793ms, 261.475566ms, 615.565045ms, 1.245540101s
  success 100%


Concurrent PUT requests, 20 per second, for 10 minutes, after dropping vm cache

  • 2.12 version response time distribution :
  Latencies     [mean, 50, 95, 99, max]  29.211369875s, 30.002788029s, 30.003025069s, 31.001143056s, 33.005231569s
  response below 30s: 6,11%
  • patched version response time distribution :
  Latencies     [mean, 50, 95, 99, max]  9.290393071s, 8.216053491s, 24.212567799s, 29.46094486s, 30.001358218s
  response below 30s: 99.26%

LOSF v2

We are now looking at 14TB drives. LOSF main issue is the CPU consumption. Another thing we'd like to do is optimize HEAD requests.

New objet key format (done)

The new format is: partition / suffix / object hash / filename. The old one was: object hash / filename. Grows the DB by about 5%, but listing partition and suffix content is now much cheaper. The downside is that we will have to support the relinker: the previous format did not care about the part power

Bring back hashes.pkl (done)

Because listing partition and suffix is now cheaper, __get_hashes() can use the same code as diskfile.py, and we can bring back hashes.pkl

New volume selection method

Profiling shows the current method to be CPU intensive. Stop dedicating a volume to a partition. Create volumes until a minimum count is reached, to avoid lock contention between object-server processes, and then use these until they grow too large, then mark as read-only.

Store metadata in the DB

Optionally (if the LOSF key value lives on a dedicated SSD), store a copy of the swift metadata in the DB

Add volume statistics

We have a FUSE mount which is nice, but getting per-volume statistics (how much space was punched, quarantine size, ...) is costly to get. Store that info in the DB

Benchmarks !

Can't wait, but nothing to see here yet ;)