Small file optimization in Swift
Influences: Haystack, bluestore, git pack files
One of the big problems with storing each file as a a separate file is that this creates a lot of inodes on the drive. If you have small objects in your cluster (common) and big drives (more common every day), then just the inodes and dentries for the XFS partition can exhaust your RAM. Swift tries to keep these things in page cache, but it's just too big. This means that:
- there's a lot of FS metadata overhead for storage
- anything that has to iterate over each file is *slow*
- small erasure-coded objects can end up being relatively huge when considering the FS overhead
In each suffix directory (or partition dir?) keep two FS trees. One is "normal", ie the way things are now. The other is for small files and uses a slab file and index system. The slab file is one file on disk that is the concatenated data+metadata of small objects. The index file references each object in the slab by name or hash and its offset in the slab.
- fragmentation or compaction
- chunked transfer encoding (where content-length isn't known up front)
- object server could spool eg 1MB (or whatever "small" is) and if it's in that first read, use a slab. otherwise, use the normal FS file
- extra disk seek to find slab or flat file
- finding the right spot in the slab file
- only append to the slab file and use a compaction process to deal with "holes" for deleted data
- reconciling diverging slabs during replication
- do we keep the index files in RAM, or read them on demand? If the former, is that more or less costly than inodes/dentries; if the latter, how big is the penalty for the extra IO?
Unexepected side benefits (?)
- global replication might be faster (copy one slab file instead of lots of little files)
- faster ingestion of new drives
- small-file optimization in EC
Why use a slab allocator? Would something else be better? Maybe the trick (or interesting part) is simply in the bookkeeping for what's in the slab (ring buffers, LSM trees (or tries), skip lists, etc). Does the actual media the data is on matter (spinning drives vs flash)? I suspect someone with experience writing memory allocators my have interesting ideas here. Should we be using one or more larger files on a filesystem? Why not talk to the block device itself? At what point do we need to invent a whole filesystem ourselves, and at that point, what benefit do we have over using what's available to us already?
Worth to read/look at:
- https://github.com/chrislusf/seaweedfs (Haystack implementation)
Want to talk more? Find notmyname on IRC