Lifeless/ArchitectureThoughts

I'm gathering notes and thoughts about our big picture architecture at the moment, seeking to understand how we ended up with things like rpc-over-rabbit (which is a really bad fit for RPC), or the nova-compute -> neutron-api-> neutron-ovs-agent -> nova-api -> nova-compute VIF plugging weirdness.

Its a group effort - current collaborators, Joe Gordon, Gus Lees, Matthew Treinish. Ideally we'd have someone familiar with all the early projects and their split-outs to aid with insight.... drop me a mail if you're interested! [or just dive in and annotate these thoughts].

I'm not using etherpad because its too hard to track deltas there - its great for realtime collaboration, not so much for evolving over weeks/months.

Goals of an architecture

IMO an architecture isn't a control mechanism - its a planning tool: it makes the context for decisions explicit, and articulates the broad principles involved so that we can all be discussing our decisions in a shared context. I wrote a presentation when I was back at Canonical that aimed to do that for Launchpad (part of the LP Architecture Guide - its not perfect and I'd do things a little differently now, but I think its also a pretty good model: in short every developer is making architectural decisions, and its only through shared understandings that we can effectively raise the bar on quality and consistency -> enforcement is way to hard, a control based strategy will inevitable fail (usually by not having enough control resources).

A good architecture needs then to be rooted in the desires and needs our users have for OpenStack, needs to explain what structures and designs will help us deliver those user concerns effectively, and needs to be updated as the world evolves. It needs to act as a blueprint a level above that of the design for any specific component or feature. It needs to help us choose between concerns like robustness and performance. Questions like the use of ACID or BASE in our data storage design can only be answered when we have broad goals like 'support 1000 API requests/second in a single cell without specialist servers' - and so that ties back into our understanding of our users needs.

What do our users want

User survey 2014: "Specifically drawing out the comments around neutron, over 61% of them were general concerns, including performance, stability and ease of use.High Availability ranked second (11%), with SDN use cases and IPv6 requirements following (7%) ... On the more technical side, the modular architecture was seen to provide a big advantage. Flexible, fixable, self-service, extensible, modifiable, adaptable, hardware and vendor agnostic, interoperable were key words commonly sighted.

The API becoming a defacto standard and the good user experience were also positively mentioned."

What does our architecture say today?

TBH, I can't find it. There used to be some really good basic principles, somewhere on the wiki :( The spec review

What should it say?

The basic characteristics we want our code to have, along with their relative importance. E.g. robust, fast, scale-out, secure, extensible
Concrete patterns (and anti-patterns) we can follow that will help deliver such characteristics
Ways in which we can assess a project / component / feature to see if it is aligned with those characteristics
WHY each of these things is important / relevant / chosen - so that we can update it in future without repeating ourselves

What should it not say?

Use component X - OpenStack values flexability and a broad ecosystem. Requiring specific components at the very highest level doesn't fit with our community. Testing concerns aside, keeping that flexability is broadly good - but care needs to be applied to avoid bad abstractions that don't fit our needs.

Grab bag of ideas to follow up

These ideas are not yet categorised or thought through - caveat emptor.

* Protobufs rather than home-brew
* Direct RPC rather than rabbit
* Pervasive name service 
* WAL journalling to survive restarts
* No singletons (oslo.config...)
* Systematic tracing
* Debuggability and operations are primary features
* HA coordinator (that automatically can resume work unfinished, using WAL above or other...) [JH]
* Capability monitor/service (knows capabilities of resources in cloud) - likely tied to scheduler service (but may not be) [JH]
* Resource reservation before allocation (not during, or piecemeal); build a manifest in HA coordination; reserve all resources in manifest; then allocate/build resources in manifest; then give over control of those allocated/built resources to user (in that order and iff allocation/building succeeds) [JH]
* Scheduler service (likely connected to capability service in some manner); we should encourage experimentation/research/optimization here [JH]