Difference between revisions of "Swift/Fixing-rebalance-and-golang"
< Swift
m (Notmyname moved page Fixing-rebalance-and-golang to Swift/Fixing-rebalance-and-golang) |
|||
Line 1: | Line 1: | ||
+ | ==symptom:== | ||
+ | # rebalance is slow, especially for dense servers | ||
+ | # uncertain latency for end-user requests | ||
+ | # hard to monitor and requires a lot of intervention to get out of bad situations (eg cluster full) | ||
+ | ==problem:== | ||
+ | * swift is not in the transport data path for rsync | ||
+ | * too much walking the disk | ||
+ | * poor job scheduling/finding the work to be done | ||
+ | * eventlet hub can't touch disk | ||
+ | ** mitigation: use lots of processes -- "easy" in python but hard to coordinate work | ||
+ | ** solution: use nonblocking io -- "hard rewrite" but efficiently solves the problem | ||
+ | |||
+ | ==things in-progress to fix these problems:== | ||
+ | * tsync protocol for data moving | ||
+ | ** puts swift in the data path (more efficient for actual transport and writing to disk (as opposed to rsync)) | ||
+ | ** use an external and supported data transport and wire protocol instead of something we invent (http2+grpc vs repconn or ssync) | ||
+ | ** see also https://etherpad.openstack.org/p/swift-rebalance | ||
+ | * better scheduling of work in reconstructor and replicator | ||
+ | ** threads not eventlet | ||
+ | ** more concurrency == more faster (to HW limits) | ||
+ | ** identifying the work to be done (rebuilds vs rebalance; includes backpressure from tsync) | ||
+ | * fix proxy<->storage protocol (can't depend on bespoke features in our current framework) | ||
+ | * golang object server itself to more efficiently take network data and write it to disk | ||
+ | |||
+ | ==how do we get there (subject to change):== | ||
+ | 0. hummingbird branch is an interesting R&D reference but not going to be merged (done) | ||
+ | 1. make replication/reconstruction tolerable to the point that we can make it fast by changing a config value (more workers, more connections, etc) (nearly done) | ||
+ | 2. build a better scheduler for consistency engine work | ||
+ | 2. build the tsync protocol | ||
+ | now: build a feature-complete golang object server (might or might not borrow from hummingbird) | ||
+ | now: infra/devstack CI work (ie swift consumable in the gate) | ||
+ | now: ask other deployment projects what needs to be done to make them happy with swift as a golang thing (eg kolla, ansible, tripleo, etc) |
Latest revision as of 17:49, 15 March 2017
Contents
symptom:
- rebalance is slow, especially for dense servers
- uncertain latency for end-user requests
- hard to monitor and requires a lot of intervention to get out of bad situations (eg cluster full)
problem:
- swift is not in the transport data path for rsync
- too much walking the disk
- poor job scheduling/finding the work to be done
- eventlet hub can't touch disk
- mitigation: use lots of processes -- "easy" in python but hard to coordinate work
- solution: use nonblocking io -- "hard rewrite" but efficiently solves the problem
things in-progress to fix these problems:
- tsync protocol for data moving
- puts swift in the data path (more efficient for actual transport and writing to disk (as opposed to rsync))
- use an external and supported data transport and wire protocol instead of something we invent (http2+grpc vs repconn or ssync)
- see also https://etherpad.openstack.org/p/swift-rebalance
- better scheduling of work in reconstructor and replicator
- threads not eventlet
- more concurrency == more faster (to HW limits)
- identifying the work to be done (rebuilds vs rebalance; includes backpressure from tsync)
- fix proxy<->storage protocol (can't depend on bespoke features in our current framework)
- golang object server itself to more efficiently take network data and write it to disk
how do we get there (subject to change):
0. hummingbird branch is an interesting R&D reference but not going to be merged (done) 1. make replication/reconstruction tolerable to the point that we can make it fast by changing a config value (more workers, more connections, etc) (nearly done) 2. build a better scheduler for consistency engine work 2. build the tsync protocol now: build a feature-complete golang object server (might or might not borrow from hummingbird) now: infra/devstack CI work (ie swift consumable in the gate) now: ask other deployment projects what needs to be done to make them happy with swift as a golang thing (eg kolla, ansible, tripleo, etc)