Swift/Fixing-rebalance-and-golang
< Swift
Contents
symptom:
- rebalance is slow, especially for dense servers
- uncertain latency for end-user requests
- hard to monitor and requires a lot of intervention to get out of bad situations (eg cluster full)
problem:
- swift is not in the transport data path for rsync
- too much walking the disk
- poor job scheduling/finding the work to be done
- eventlet hub can't touch disk
- mitigation: use lots of processes -- "easy" in python but hard to coordinate work
- solution: use nonblocking io -- "hard rewrite" but efficiently solves the problem
things in-progress to fix these problems:
- tsync protocol for data moving
- puts swift in the data path (more efficient for actual transport and writing to disk (as opposed to rsync))
- use an external and supported data transport and wire protocol instead of something we invent (http2+grpc vs repconn or ssync)
- see also https://etherpad.openstack.org/p/swift-rebalance
- better scheduling of work in reconstructor and replicator
- threads not eventlet
- more concurrency == more faster (to HW limits)
- identifying the work to be done (rebuilds vs rebalance; includes backpressure from tsync)
- fix proxy<->storage protocol (can't depend on bespoke features in our current framework)
- golang object server itself to more efficiently take network data and write it to disk
how do we get there (subject to change):
0. hummingbird branch is an interesting R&D reference but not going to be merged (done) 1. make replication/reconstruction tolerable to the point that we can make it fast by changing a config value (more workers, more connections, etc) (nearly done) 2. build a better scheduler for consistency engine work 2. build the tsync protocol now: build a feature-complete golang object server (might or might not borrow from hummingbird) now: infra/devstack CI work (ie swift consumable in the gate) now: ask other deployment projects what needs to be done to make them happy with swift as a golang thing (eg kolla, ansible, tripleo, etc)