I'm gathering notes and thoughts about our big picture architecture at the moment, seeking to understand how we ended up with things like rpc-over-rabbit (which is a really bad fit for RPC), or the nova-compute -> neutron-api-> neutron-ovs-agent -> nova-api -> nova-compute VIF plugging weirdness.
It's a group effort - current collaborators [Joe Gordon, Gus Lees, Matthew Treinish, Joshua Harlow]. Ideally we'd have someone familiar with all the early projects and their split-outs to aid with insight.... drop me a mail if you're interested! [or just dive in and annotate these thoughts].
I'm not using etherpad because its too hard to track deltas there - its great for realtime collaboration, not so much for evolving over weeks/months.
Goals of an architecture
IMO an architecture isn't a control mechanism - its a planning tool: it makes the context for decisions explicit, and articulates the broad principles involved so that we can all be discussing our decisions in a shared context. I wrote a presentation when I was back at Canonical that aimed to do that for Launchpad (part of the LP Architecture Guide - its not perfect and I'd do things a little differently now, but I think its also a pretty good model: in short every developer is making architectural decisions, and its only through shared understandings that we can effectively raise the bar on quality and consistency -> enforcement is way to hard, a control based strategy will inevitable fail (usually by not having enough control resources).
A good architecture needs then to be rooted in the desires and needs our users have for OpenStack, needs to explain what structures and designs will help us deliver those user concerns effectively, and needs to be updated as the world evolves. It needs to act as a blueprint a level above that of the design for any specific component or feature. It needs to help us choose between concerns like robustness and performance. Questions like the use of ACID or BASE in our data storage design can only be answered when we have broad goals like 'support 1000 API requests/second in a single cell without specialist servers' - and so that ties back into our understanding of our users needs.
What do our users want
- User survey 2014: "Specifically drawing out the comments around neutron, over 61% of them were general concerns, including performance, stability and ease of use.High Availability ranked second (11%), with SDN use cases and IPv6 requirements following (7%) ... On the more technical side, the modular architecture was seen to provide a big advantage. Flexible, fixable, self-service, extensible, modifiable, adaptable, hardware and vendor agnostic, interoperable were key words commonly sighted.
The API becoming a defacto standard and the good user experience were also positively mentioned."
What does our architecture say today?
TBH, I can't find it. There used to be some really good basic principles, somewhere on the wiki :( The spec review
What should it say?
- The basic characteristics we want our code to have, along with their relative importance. E.g. robust, fast, scale-out, secure, extensible
- Concrete patterns (and anti-patterns) we can follow that will help deliver such characteristics
- Ways in which we can assess a project / component / feature to see if it is aligned with those characteristics
- WHY each of these things is important / relevant / chosen - so that we can update it in future without repeating ourselves
What should it not say?
- Use component X - OpenStack values flexability and a broad ecosystem. Requiring specific components at the very highest level doesn't fit with our community. Testing concerns aside, keeping that flexability is broadly good - but care needs to be applied to avoid bad abstractions that don't fit our needs.
Inspiration thats worth reading
- Release It! - little dated and most examples in Java, but the principles are sound and if you're mainly dev, not ops, it will be invaluable.
- http://12factor.net/ - great resource for cloudy apps, much of which is relevant to OpenStack's API services themselves.
- https://plus.google.com/+RipRowan/posts/eVeouesvaVX and https://plus.google.com/110981030061712822816/posts/AaygmbzVeRq - Steve Yegge on platforms and design. Highly entertaining.
Grab bag of ideas to follow up
These ideas are not yet *very* categorised or thought through - caveat emptor.
* Pervasive name service * Pervasive *service-only for now* availability/liveness service (should be no polling or periodic database writes with information about liveness) [JH] * HA coordinator (that automatically can resume work unfinished, using WAL or other...) (is heat + convergence becoming this?) this HA coordinator should be easily shardable to scale out (or it should be easily able to acquire work to do from some pool, which will autorelease back to the pool if it crashes, to be resumed by another HA coordinator) [JH] * Capability monitor/service (knows capabilities of resources in cloud) - likely tied to scheduler service (but may not be) [JH] * Resource reservation before allocation (not during, or piecemeal); build a manifest in HA coordinator; reserve all resources in manifest; then allocate/build resources in manifest; then give over control of those allocated/built resources to user (in that order and iff allocation/building succeeds) --- all of this is done in the HA coordinator (which uses API's of downstream services to as needed, those API's should be simple and avoid doing complicated operations, as that is the HA coordinators job to do complex things) -- have each downstream service 'do one thing well' (and leave complexity elsewhere for tying things together) [JH] * Scheduler service (likely connected to capability service in some manner); we should encourage experimentation/research/optimization here [JH]
* Protobufs rather than home-brew * Direct RPC rather than rabbit * WAL journalling to survive restarts * No singletons (oslo.config...) * Secured/verifiable/signed (something...) RPC messages (it's taken to long...) [JH] * Systematic tracing (osprofiler or other...) * Debuggability and operations are primary features * Leave no resource left behind (don't leave garbage for others to cleanup later) [JH] * Versioning (and introspectable) all the things (by default) [JH]