In the Vancouver summit we held an ops working group session about technology choices, seeking to frame and quantify operator concerns around technology choices. This is not yet solid enough to form a hard and fast policy, but it does have enough clarity to form a useful first-pass filter on proposed choices. Ultimately, of course, the willingness of operators en mass to deploy any given thing is the real test a technology choice must pass.
The discussion around choices is somewhat complicated by both the big tent - projects with niche needs may have different tradeoffs to make than e.g. nova or magnum - and pluggable backends to projects like oslo.db and oslo.messaging, where many different backends with different tradeoffs are possible.
However, we formed the following heuristic: the first backend (for pluggable things) needs to meet the criteria we came up with; this doesn't preclude other backends which operators could opt into based on their own needs and tolerances. However, the first backend needs to be production ready and suitable for the same scale that Nova or Swift (or other very mature components) are themselves.
The basic checklist
One thing that emerged during the discussion was that we're often not clear about the requirements: CAP theorem states that there is a tradeoff to be made - and for example ceph makes a different tradeoff to swift. So we propose the following be gathered when examining any choice:
- what scale is the choice expected to support. 1-5 nodes, < hundreds, < thousands, < 10-thousands
- what behaviour is being designed for: consistency or availability
Security was a major concern. The fact that we're failing to test TLS configurations in the gate was considered a weak link in the current security story, but not one we should carry forward.
We summarised the needs as:
- must support TLS or equivalent - end to end crypto - for any network surfaces
- no requirement for encryption at rest (as providers tend to just do full disk encryption)
- multi-tenancy ready (presumably except for things which are entirely isolated from users)
Documentation was the second major concern. Being able to operate a new thing is crucial. We need documentation for scaling, high availability, hardening. The thing, whatever it is, needs to document its behaviour under partitions. How it scales: e.g. 'one spinning disk needed per hundred nodes generating metrics'. tricks / app tuning / related os tuning / etc
Availability was perhaps the third hottest point: a clear understanding of the behaviour of systems is crucial for operation at scale. How will it behave when partitioned, does it offer consistency or availability, what other sorts of failures can cause a lack of availability.
The main concern here was being able to predict the investment needed to deploy at a given scale (at least with a reasonable margin of error). Being super-efficient isn't a goal in-and-of-itself, but it should be feasible to run components at any size of cloud that OpenStack can deploy today ... where that makes sense.
Things we choose to use need to be able to scale to the size of clouds that OpenStack can deliver today.
For instance components would need to be able to be scaled to such sizes:
- something that would deploy within cells
- only needs to scale to the size of a cell
- things that operate across an entire region, or global clusters
Where the identified need is for small scale, then that clearly is ok. The discussion around that implied things like simple-to-operate backends for test clouds, vs. harder to operate ones that need a larger scale.
There were two key requirements identified here. Firstly, the thing must be freely redistributable: OpenStack has many distributors in its ecosystem, and they must be able to distribute something to their users. Secondly, the thing must be F/LOSS - its not acceptable to require proprietary things.
On particular licenses:
- GPL/3 is tricky but doable for everyone in the room
- AGPL gives some folk headaches but it seemed everyone in the room had made their peace with it
- Apache2/MIT/BSD are no brainers
- custom licenses are hard, please don't do that.
This obviously depends on what the component is, but since OpenStack currently is deployed on AMD64, ARM64, Power, (and possibly i386 still somewhere.. but no-one cared :)), bringing in a new component that doesn't (or isn't nearly-ready-to) support those architectures would be a step backwards.
This was a relatively minor concern. Something that was new but meets all the other criteria is probably still ok.
The big thing identified in the session was a need for some operator-involved change control process around mandatory backends / services etc. Some examples that were called out were - the stability of extension points (such as Django custom panels which are now being superceded by Angular-JS, no-one knew the deprecation story there); the identification of extension points - where operators should hook in.
Part of this is a desire to avoid proliferation of databases (both SQL and NoSQL). Anything that needs backup or locking down for good security is additional work. Monitoring outside of 'make sure process X is running' is also complex.
We flagged that the community needs to have a discussion and set some scalability goals: lifeless will take that up with the TC.
Mongo got an honorable mention: "We are investigatiing moving away from MongoDB due to the following reasons :
Makes service tougher to deploy + manage (replication /sharding/ DR) Licensing concerns HA possible - but adds additional licensing costs On the pros side, operating it has been relatively straightforward. (performance, availability, data maintenance ..)."
App catalog interaction
There is an open question : "Can we leverage portions of any checklists being created for the app catalog ?" that we didn't have the knowledge in the room to address.