Difference between revisions of "Swift/ideas/small files/implementation"
(Add LOSF v2 planned changes) |
m (→LOSF v2) |
||
Line 103: | Line 103: | ||
== LOSF v2 == | == LOSF v2 == | ||
− | We are now looking at 14TB drives. LOSF main issue is the CPU consumption. Another thing we'd like to do is | + | We are now looking at 14TB drives. LOSF main issue is the CPU consumption. Another thing we'd like to do is optimize HEAD requests. |
=== New objet key format (done) === | === New objet key format (done) === |
Revision as of 17:38, 18 March 2020
The purpose of this page is to describe the proposed implementation with some benchmarks. Please note the implementation is at early stage, as the benchmarks.
Code (old): https://review.openstack.org/#/c/436406/ Slides (old): https://www.slideshare.net/rledisez/slide-smallfiles newer slide : https://fr.slideshare.net/AlexandreLecuyer/openstack-swift-lots-of-small-files
Contents
- 1 Problem
- 2 Overview
- 3 Implementation
- 4 New python code
- 5 Existing python code
- 6 New golang code
- 7 Consistency
- 8 Freeing space
- 8.1 Examples
- 8.2 Preliminary test results (outdated)
- 8.2.1 Single threaded PUT from a machine to one patched object server, and an unpatched 2.12 server
- 8.2.2 Single threaded GET from a machine to one patched object server, and an unpatched 2.12 server. Both servers have 8 million objects on one disk.
- 8.2.3 Concurrent PUT requests, 20 per second, for 10 minutes, with "hot inode cache"
- 8.2.4 Concurrent PUT requests, 20 per second, for 10 minutes, after dropping vm cache
- 9 LOSF v2
Problem
A swift object uses at least two inodes on the filesystem. For clusters with many small files (say, 10 million objects per disk) the performance degradation is important, as the directory structure does not fit in memory. Replication / auditor operations trigger a lot of IO. Over 40% of disk activity may be caused by "listdir" operations. The goal is to serve listdir operations without any disk IO.
Overview
Store swift objects in large files, similar to what haystack does. We do not need all the information stored in an inode (owner, group..). Make the "inode" as small as possible so that listdir requests can be served from memory. These "inodes" will be stored in a key value store, per disk, to ease cluster maintenance.
Implementation
New python code
- vfile
A new "vfile" module presents an interface similar to a regular python file, but will store "vfiles" in large files, which we call "volumes". A volume is append-only, and is dedicated to a given swift partition. There may be multiple volumes for a partition, to allow for write concurrency. Once a file has been written to a volume, its location (volume index, offset) is stored in a key-value store.
- kvfile
"kvfile" is a copy of diskfile, modified to use "vfile" instead of regular POSIX files.
- rpc
the "rpc_grpc" module handles communication with the local RPC server, through a socket. (registering a file, listing directories..) It makes calls to generated code from gRPC
Existing python code
Minor changes to replicator / reconstructor / diskfile / utils Mostly, abstract file and directory operations (os.*). Call to the diskfile implementation instead.
New golang code
The RPC server runs as a separate process, accessed over a socket, using gRPC. There is one RPC server instance per disk. Each RPC server embeds a leveldb key-value database. The basic operations are storing and retrieving information about volumes and files, and recreating the swift directory structure on the fly (directories are not stored)
Consistency
The object server currently issues an fsync() before replying, to ensure data is on disk. The vfile module will also sync the volume after writes before replying to a client. However, the leveldb key-value store is written to asynchronously. Synchronous operations on the key-value store would be too costly, performance wise.
If the kv has not been closed properly, upon restart a check will be triggered, and the end of volumes will be scanned, to reconcile any difference with the kv content.
Freeing space
We rely on the filesystem's "punch hole" support, which lets us free space within a file. https://lwn.net/Articles/415889/
Examples
serving a PUT request. Instead of creating a temp file and renaming it :
- Find an unlocked volume for that partition. If needed, create one and register it in the KV.
- lock the volume and write the object at the end of the volume. when swift closes the "file", write the object header and metdata, then fsync() the volume.
- register the file in the KV
serving a GET request
- get the object location (volume index, offset in the volume) from the KV
- open the volume file, seek to offset, serve the object.
Preliminary test results (outdated)
We have not yet tested the pathological case we see in production with replication. Hardware setup : Atom C2750 2.40Ghz 16GB RAM Drives : HGST HUS726040ALA610 (4TB)
3 drives per server, but the tests below exercise a single drive.
Single threaded PUT from a machine to one patched object server, and an unpatched 2.12 server
(test using the object server API directly, no proxy server involved, objects are < 100 bytes)
From zero to 4 millions objects, on one disk.
- 2.12 version : 3360 minutes (19,8 PUT/s)
- patched version : 2540 minutes (26,2 PUT/s) - About 42 bytes used in leveldb per object
From 4 millions to 8 million objects
- 2.12 version : 3900 minutes (17 PUT/s)
- patched version : 1700 minutes (39,2 PUT/s) - faster, likely because most "volume files" have already been created (not measured, to be confirmed)
The key value size for the disk at the end of the test is 320MB.
Single threaded GET from a machine to one patched object server, and an unpatched 2.12 server. Both servers have 8 million objects on one disk.
- 2.12 version : 39 GET/s
- patched version : 93 GET/s
Concurrent PUT requests, 20 per second, for 10 minutes, with "hot inode cache"
- 2.12 version response time distribution :
Latencies [mean, 50, 95, 99, max] 641.274117ms, 67.31248ms, 3.526835534s, 4.68917307s, 5.971909s 100% success
- patched version response time distribution :
Latencies [mean, 50, 95, 99, max] 82.581295ms, 50.487793ms, 261.475566ms, 615.565045ms, 1.245540101s success 100%
Concurrent PUT requests, 20 per second, for 10 minutes, after dropping vm cache
- 2.12 version response time distribution :
Latencies [mean, 50, 95, 99, max] 29.211369875s, 30.002788029s, 30.003025069s, 31.001143056s, 33.005231569s response below 30s: 6,11%
- patched version response time distribution :
Latencies [mean, 50, 95, 99, max] 9.290393071s, 8.216053491s, 24.212567799s, 29.46094486s, 30.001358218s response below 30s: 99.26%
LOSF v2
We are now looking at 14TB drives. LOSF main issue is the CPU consumption. Another thing we'd like to do is optimize HEAD requests.
New objet key format (done)
The new format is: partition / suffix / object hash / filename. The old one was: object hash / filename Grows the DB by about 5%, but listing partition and suffix content is now much cheaper. The downside is that we will have to support the relinker: the previous format did not care about the part power
Bring back hashes.pkl (done)
Because listing partition and suffix is now cheaper, __get_hashes() can use the same code as diskfile.py, and we can bring back hashes.pkl
New volume selection method
Profiling shows the current method to be CPU intensive. Stop dedicating volume to a partition. Create volumes until a minimum count is reached, to avoid lock contention between object-server processes, and then use these until they grow too large, then mark as read-only.
Store metadata in the DB
Optionally (if the LOSF key value lives on a dedicated SSD), store a copy of the swift metadata in the DB
Add volume statistics
We have a FUSE mount which is nice, but getting per-volume statistics (how much space was punched, quarantine size, ...) is costly to get. Store that info in the DB