Lifeless/ArchitectureThoughts

I'm gathering notes and thoughts about our big picture architecture at the moment, seeking to understand how we ended up with things like rpc-over-rabbit (which is a really bad fit for RPC), or the nova-compute -> neutron-api-> neutron-ovs-agent -> nova-api -> nova-compute VIF plugging weirdness. More importantly I want to help OpenStack deliver features more quickly, be more resilient and robust, and perform better, all at once :). I've done this before with Launchpad, but OpenStack is a rather bigger project :).

It's a group effort - current collaborators [Joe Gordon, Gus Lees, Matthew Treinish, Joshua Harlow]. Ideally we'd have someone familiar with all the early projects and their split-outs to aid with insight.... drop me a mail if you're interested! [or just dive in and annotate these thoughts].

I'm not using etherpad because its too hard to track deltas there - its great for realtime collaboration, not so much for evolving over weeks/months.

Goals of an architecture
IMO an architecture isn't a control mechanism - its a planning tool: it makes the context for decisions explicit, and articulates the broad principles involved so that we can all be discussing our decisions in a shared context. I wrote a presentation when I was back at Canonical that aimed to do that for Launchpad (part of the LP Architecture Guide - its not perfect and I'd do things a little differently now, but I think its also a pretty good model: in short every developer is making architectural decisions, and its only through shared understandings that we can effectively raise the bar on quality and consistency -> enforcement is way to hard, a control based strategy will inevitable fail (usually by not having enough control resources).

A good architecture needs then to be rooted in the desires and needs our users have for OpenStack, needs to explain what structures and designs will help us deliver those user concerns effectively, and needs to be updated as the world evolves. It needs to act as a blueprint a level above that of the design for any specific component or feature. It needs to help us choose between concerns like robustness and performance. Questions like the use of ACID or BASE in our data storage design can only be answered when we have broad goals like 'support 1000 API requests/second in a single cell without specialist servers' - and so that ties back into our understanding of our users needs.

What do our users want

 * User survey 2014: "Specifically drawing out the comments around neutron, over 61% of them were general concerns, including performance, stability and ease of use.High Availability ranked second (11%), with SDN use cases and IPv6 requirements following (7%) ... On the more technical side, the modular architecture was seen to provide a big advantage. Flexible, fixable, self-service, extensible, modifiable, adaptable, hardware and vendor agnostic, interoperable were key words commonly sighted.

The API becoming a defacto standard and the good user experience were also positively mentioned."

What does our architecture say today?
We have some top level tenets - BasicDesignTenets - but there's no supporting material about the tradeoffs involved, and we've not followed the tenets in a lot of recent code. Further, with the split out of teams the existing guidance doesn't address the impact of integration points at all. We have hacking guidelines (http://docs.openstack.org/developer/hacking/, https://wiki.openstack.org/wiki/CodingStandards) but they don't address many important issues, such as working with external resources, data integrity in a distributed system (e.g. cinder volumes and nova vm's, or ports and vm's).

What should it say?

 * The basic characteristics we want our code to have, along with their relative importance. E.g. robust, fast, scale-out, secure, extensible
 * Concrete patterns (and anti-patterns) we can follow that will help deliver such characteristics
 * Ways in which we can assess a project / component / feature to see if it is aligned with those characteristics
 * WHY each of these things is important / relevant / chosen - so that we can update it in future without repeating ourselves

What should it not say?

 * Use component X - OpenStack values flexability and a broad ecosystem. Requiring specific components at the very highest level doesn't fit with our community. Testing concerns aside, keeping that flexability is broadly good - but care needs to be applied to avoid bad abstractions that don't fit our needs.

Inspiration thats worth reading

 * Release It! - little dated and most examples in Java, but the principles are sound and if you're mainly dev, not ops, it will be invaluable.
 * http://12factor.net/ - great resource for cloudy apps, much of which is relevant to OpenStack's API services themselves.
 * https://plus.google.com/+RipRowan/posts/eVeouesvaVX and https://plus.google.com/110981030061712822816/posts/AaygmbzVeRq - Steve Yegge on platforms and design. Highly entertaining.

Process for building an architecture
Read our code and our bugs, think hard, discuss with the greybeards in our projects, write it down.

Data gathering

 * Systematic review of nova bugs: This is a review of a 100 fairly recent nova bugs - all bugs filed between two endpoints except for duplicates.

Grab bag of ideas to follow up
These ideas are not yet *very* categorised or thought through - caveat emptor.

structural concepts
These are possibly structural concepts we might bring in (with what that means varying per thing) to make the system more resilient/dynamic/robust.

Permits real-time adjustment to changing deployments (JH: application configuration as well?) May bootstrap via static configuration. Permits scaling out individual hotspots Avoids restarting services and the attendant burst of work when a component has failed. Deals with power failures, bugs, and fat fingered admins gracefully May need both node-local or centralised, or some combination thereof. Different than dealing with crashes because non-crash situations can interrupt things - e.g. a very slow migration combined with a security update being rolled out Allows scaling within a single node (e.g. servers with 4K CPUs may need multiple nova-compute processes running to deal with load) cleanly The designed use case (and they are very good at this) The exception being daemons whose job *is* state storage [swift and cinder have daemons with this job] Avoid unnecessary network traffic Deal with the reality that networks and datacentres are fragile. Encrypt and sign all network traffic by default, require opt-out. * How does one reproduce a failure * Do we dump enough diagnostic details * Can the request path through the DC be traced? * Deploying new processes isn't atomic * Include configuration data
 * Use a name service rather than static configuration.
 * Single purpose services. E.g. quota, service status, ...
 * No single-instance daemons : integrate with a quorum service.
 * Crash-safe processes: a crash at any point must not unrecoverably leak resources *or* cause expensive reconciliations
 * Persist in-progress non-idempotent work using some form of WAL.
 * Use direct RPC calls - process to process
 * Use message bus for fire-and-forget operations.
 * Stateless daemons: if a node dies, another machine can take over immediately without needing an expensive warm-up period or access to information that was only held by the node that died
 * Keep work close to the data where possible
 * Build timeouts and load thresholds into everything
 * Assume the network is hostile
 * Design and implement with debugging and operations as key use cases
 * All interfaces with other processes versioned
 * Try to structure things so that mistakes in the use of some data field or code error rather than doing the wrong action. For instance, the Nova VM state of (DELETED, RUNNING) isn't valid and updates that create that situation should error.

Concrete things we might work on
These are specific projects that would shift our design towards some of the structural things above.

https://review.openstack.org/#/c/157596/ single purpose service, scaled out -- Not sure that this is actually needed, it seems to me that its a shorthand for a number of key structural changes which I've tried to capture above. In particular a single big coordinator runs the risk of centralising all our logic into one big ball of twine. Some things like live migration may work better with a third service coordinating, but since we have to have a point to point link anyway, we don't seem to gain a lot. Any single location could fail and we need to pick the work up again. If either of the computes has failed, we have to get it back up again to resume that local work. -- JH: understood and u are probably right although 'one big ball of twine' could be anything developed by anyone (it seems we are already pretty good in openstack at developing twine, haha); we control the ability to make big balls of twine so I'd hope we could do it correctly (and not create said big one); most of the projects just manipulate resources and reserving, preparing them, returning them back to the user so it sorta feels odd that we have so many projects that do the same thing (with different types of resources); if we imagined the coordination/smarts of that manipulation was in some coordinator then the projects just become the nice driver API (or something similar) that is exposed; perhaps this isn't reality/possible (likely isn't) but it's a nice thought :-P -- Gant is heading in this direction I think; good case of a single purpose service -- I think this would be a fine thing to do. I don't think it has any systemic lessons though - can we generalise it? -- JH: will think of some way to generalize it (the concept I guess is to reserve as much as u can, across services before doing much else, instead of having disparate services where u reserve something at one place, do something, then send to next service, it reserves some more stuff, and so-on; making the whole reservation workflow sorta wonky/hard to figure out/understand). -- Gantt. https://github.com/grpc/grpc/ * migrations / resize - any non-idempotent operation * https://review.openstack.org/#/c/147879/ (is one such approach) * nova scheduler * are there others left? Perhaps some cinder bits? (JH: cinder-volume-manager is still reliant on file-locks and can't be scaled out to 1+ manager at the current time)
 * Nova VM_state+task_state validation layer.
 * Nova task_state enforcement - preventing leaked task state.
 * Pervasive *service-only for now* availability/liveness service (should be no polling or periodic database writes with information about liveness) [JH]
 * HA coordinator (that automatically can resume work unfinished, using WAL or other...) (is heat + convergence becoming this?) this HA coordinator should be easily shardable to scale out (or it should be easily able to acquire work to do from some pool, which will autorelease back to the pool if it crashes, to be resumed by another HA coordinator) [JH]
 * Capability monitor/service (knows capabilities of resources in cloud) - likely tied to scheduler service (but may not be) [JH]
 * Resource reservation before allocation (not during, or piecemeal); build a manifest in HA coordinator; reserve all resources in manifest; then allocate/build resources in manifest; then give over control of those allocated/built resources to user (in that order and iff allocation/building succeeds) --- all of this is done in the HA coordinator (which uses API's of downstream services to as needed, those API's should be simple and avoid doing complicated operations, as that is the HA coordinators job to do complex things) -- have each downstream service 'do one thing well' (and leave complexity elsewhere for tying things together) [JH]
 * Scheduler service (likely connected to capability service in some manner); we should encourage experimentation/research/optimization here [JH]
 * Protobufs for RPC rather than home-brew- still need a layer above to consolidate domain code.
 * Implement a direct RPC facility - perhaps building on the 0mq layer in oslo.messaging, perhaps a new backend with e.g. HTTP+protobufs
 * WAL journalling of local work to survive restarts
 * No singletons
 * Secured/verifiable/signed (something...) RPC messages (it's taken to long...) [JH]
 * Systematic tracing (osprofiler or other...)