Heat/Blueprints/Convergence

'''THIS HAS BEEN MOVED TO https://review.openstack.org/95907 please make all comments there. THANK YOU'''

Summary
Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure. Heat should be resilient and allow concurrent operations on any sized stack.

Release Note
TBD

Rationale

 * stacks that fail during creation / update
 * stacks where resources have (silently) stopped working - either disappearing or have an error of some sort (e.g. loadbalancer that isn't forwarding traffic, or nova instance in ERROR state)
 * Heat engines are also noisy:
 * they get restarted when servers need to get updated
 * they may fail due to hardware or network failure (see under hypervisor failure)
 * Heat engine failures show up as a _FAILED stack, but also have a lock preventing other operations
 * This is a bug if so; if the engine working on a stack has failed then we are supposed to steal its lock.
 * Large stacks exceed the capacity of a single heat-engine process to update / manage efficiently.
 * Large clusters - e.g. 10K VMs should be directly usable
 * Stack updates lock state until the entire thing has converged again which prevents admins making changes until its completed
 * This makes it hard/impossible to do autoscaling as autoscaling decisions may be more frequent than the completion time from each event
 * Huh? Why would you make a controller that makes decisions so frequently that it does not have time to observe the effects of one decision before making the next?
 * Large admin teams are forced to use an external coordination service to ensure they don't do expensive updates except when there is scheduled time
 * Reacting to emergencies is problematic

User Stories

 * Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+environment+parameters. E.g. if a cinder volume attached to a non-scaling-group resource goes offline, that requires administrative intervention -> STACK_FAILED
 * Examples that this would handle without intervention
 * nova instances that never reach ACTIVE
 * neutron ports that aren't reachable
 * Servers in a scaling group that disappear / go to ERROR in the nova api
 * Examples that may need intervention
 * servers that are not in a scaling group which go to ERROR after running for a while or just disappear
 * Scaling groups that drop below a specified minimum due to servers erroring/disappearing.


 * Heat users can expect Heat to bring a stack into line with the template+parameters even if the world around it changes after STACK_READY - e.g. due to a server being deleted by the user.


 * Operators should not need to manually wait-or-prepare heat engines for maintenance: assume crash/shutdown/failure will happen and have that be seamless to the user.
 * Stacks that are being updated must not be broken / interrrupted in a user visible way due to a heat engine reboot/restart/redeploy.


 * Users should be able to deploy stacks that scale to the size of the backend storage engine - e.g. we should be able to do a million resources in a single heat stack (somewhat arbitrary number as a target)


 * Users need to be able to tell heat their desired template+parameters at any time, not just when heat believes the stack is 'READY'.
 * Autoscaling is a special case of 'user' here in that it tunes the sizes of groups but otherwise is identical to a user.
 * Admins reacting to problematic situations may well need to make 'overlapping' changes in rapid fire.


 * Users deploying stacks with excess of 10K instances (and thus perhaps 50K resources) should expect Heat to deploy and update said stacks quicky and gracefully, given appropriate cloud capacity.


 * Existing stacks should continue to function. "We don't break user-space".

Assumptions
We assume the stack wide lock is a side-effect of multi-engine, and not a desired feature. We almost certainly want the ability to make fine grained changes to stacks - e.g. to change the scaling parameters of a resource without submitting a full new template+parameters. This is an orthogonal problem, but much easier to solve once the work in this specification is completed. In the interim it may be desirable to permit an explicit 'I want the locking-until-READY behaviour' for teams / systems that may tread on each others toes.

Design

 * move from using in-process-polling to observe resource state, to an observe-and-notify approach
 * move from a call-stack implementation to a continual-convergence implementation, triggered by change notification
 * run each individual convergence step via taskflow via a distributed set of workers

For the Juno summit session dump see https://etherpad.openstack.org/p/heat-workflow-vs-convergence