Heat/Blueprints/Convergence
Contents
Summary
Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure. Heat should be resilient and allow concurrent operations on any sized stack.
Release Note
TBD
Rationale
- stacks that fail during creation / update
- stacks where resources have (silently) stopped working - either disappearing or have an error of some sort (e.g. loadbalancer that isn't forwarding traffic, or nova instance in ERROR state)
- Heat engines are also noisy:
- they get restarted when servers need to get updated
- they may fail due to hardware or network failure (see under hypervisor failure)
- Heat engine failures show up as a _FAILED stack, but also have a lock preventing other operations
- This is a bug if so; if the engine working on a stack has failed then we are supposed to steal its lock.
- Large stacks exceed the capacity of a single heat-engine process to update / manage efficiently.
- Large clusters - e.g. 10K VMs should be directly usable
- Stack updates lock state until the entire thing has converged again which prevents admins making changes until its completed
- This makes it hard/impossible to do autoscaling as autoscaling decisions may be more frequent than the completion time from each event
- Large admin teams are forced to use an external coordination service to ensure they don't do expensive updates except when there is scheduled time
- Reacting to emergencies is problematic
User Stories
- Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+environment+parameters. E.g. if a cinder volume attached to a non-scaling-group resource goes offline, that requires administrative intervention -> STACK_FAILED
- Examples that this would handle without intervention
- nova instances that never reach ACTIVE
- neutron ports that aren't reachable
- Servers in a scaling group that disappear / go to ERROR in the nova api
- Examples that may need intervention
- servers that are not in a scaling group which go to ERROR after running for a while or just disappear
- Scaling groups that drop below a specified minimum due to servers erroring/disappearing.
- Examples that this would handle without intervention
- Heat users can expect Heat to bring a stack into line with the template+parameters even if the world around it changes after STACK_READY - e.g. due to a server being deleted by the user.
- Operators should not need to manually wait-or-prepare heat engines for maintenance: assume crash/shutdown/failure will happen and have that be seamless to the user.
- Stacks that are being updated must not be broken / interrrupted in a user visible way due to a heat engine reboot/restart/redeploy.
- Users should be able to deploy stacks that scale to the size of the backend storage engine - e.g. we should be able to do a million resources in a single heat stack (somewhat arbitrary number as a target)
- Users need to be able to tell heat their desired template+parameters at any time, not just when heat believes the stack is 'READY'.
- Autoscaling is a special case of 'user' here in that it tunes the sizes of groups but otherwise is identical to a user.
- Admins reacting to problematic situations may well need to make 'overlapping' changes in rapid fire.
- Users deploying stacks with excess of 10K instances (and thus perhaps 50K resources) should expect Heat to deploy and update said stacks quicky and gracefully, given appropriate cloud capacity.
- Existing stacks should continue to function. "We don't break user-space".