Swift/ideas/small files

= Small file optimization in Swift =

Influences: Haystack, bluestore, git pack files

One of the big problems with storing each file as a a separate file is that this creates a lot of inodes on the drive. If you have small objects in your cluster (common) and big drives (more common every day), then just the inodes and dentries for the XFS partition can exhaust your RAM. Swift tries to keep these things in page cache, but it's just too big. This means that:
 * there's a lot of FS metadata overhead for storage
 * anything that has to iterate over each file is *slow*
 * small erasure-coded objects can end up being relatively huge when considering the FS overhead

Idea
In each suffix directory (or partition dir?) keep two FS trees. One is "normal", ie the way things are now. The other is for small files and uses a slab file and index system. The slab file is one file on disk that is the concatenated data+metadata of small objects. The index file references each object in the slab by name or hash and its offset in the slab.

Challenges

 * fragmentation or compaction
 * chunked transfer encoding (where content-length isn't known up front)
 * object server could spool eg 1MB (or whatever "small" is) and if it's in that first read, use a slab. otherwise, use the normal FS file
 * extra disk seek to find slab or flat file
 * finding the right spot in the slab file
 * only append to the slab file and use a compaction process to deal with "holes" for deleted data
 * reconciling diverging slabs during replication
 * do we keep the index files in RAM, or read them on demand? If the former, is that more or less costly than inodes/dentries; if the latter, how big is the penalty for the extra IO?

Unexepected side benefits (?)

 * global replication might be faster (copy one slab file instead of lots of little files)
 * faster ingestion of new drives
 * small-file optimization in EC

Alternative Ideas
Why use a slab allocator? Would something else be better? Maybe the trick (or interesting part) is simply in the bookkeeping for what's in the slab (ring buffers, LSM trees (or tries), skip lists, etc). Does the actual media the data is on matter (spinning drives vs flash)? I suspect someone with experience writing memory allocators my have interesting ideas here. Should we be using one or more larger files on a filesystem? Why not talk to the block device itself? At what point do we need to invent a whole filesystem ourselves, and at that point, what benefit do we have over using what's available to us already?

Links
Worth to read/look at:
 * https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Beaver.pdf
 * https://github.com/chrislusf/seaweedfs (Haystack implementation)
 * http://www.ssrc.ucsc.edu/Papers/wang-mss04b.pdf

Want to talk more? Find notmyname on IRC