Heat/Blueprints/Convergence

Summary

   Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure. Heat should be resilient and allow concurrent operations on any sized stack.

Release Note

TBD

Rationale

stacks that fail during creation / update
stacks where resources have (silently) stopped working - either disappearing or have an error of some sort (e.g. loadbalancer that isn't forwarding traffic, or nova instance in ERROR state)
Heat engines are also noisy:
they get restarted when servers need to get updated
they may fail due to hardware or network failure (see under hypervisor failure)
Heat engine failures show up as a _FAILED stack, but also have a lock preventing other operations
This is a bug if so; if the engine working on a stack has failed then we are supposed to steal its lock.
Large stacks exceed the capacity of a single heat-engine process to update / manage efficiently.
Large clusters - e.g. 10K VMs should be directly usable
Stack updates lock state until the entire thing has converged again which prevents admins making changes until its completed
- This makes it hard/impossible to do autoscaling as autoscaling decisions may be more frequent than the completion time from each event
- Large admin teams are forced to use an external coordination service to ensure they don't do expensive updates except when there is scheduled time
- Reacting to emergencies is problematic

User Stories

Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+environment+parameters. E.g. if a cinder volume attached to a non-scaling-group resource goes offline, that requires administrative intervention -> STACK_FAILED
- Examples that this would handle without intervention
  - nova instances that never reach ACTIVE
  - neutron ports that aren't reachable
  - Servers in a scaling group that disappear / go to ERROR in the nova api
- Examples that may need intervention
  - servers that are not in a scaling group which go to ERROR after running for a while or just disappear
  - Scaling groups that drop below a specified minimum due to servers erroring/disappearing.

Heat users can expect Heat to bring a stack into line with the template+parameters even if the world around it changes after STACK_READY - e.g. due to a server being deleted by the user.

Operators should not need to manually wait-or-prepare heat engines for maintenance: assume crash/shutdown/failure will happen and have that be seamless to the user.
- Stacks that are being updated must not be broken / interrrupted in a user visible way due to a heat engine reboot/restart/redeploy.

Users should be able to deploy stacks that scale to the size of the backend storage engine - e.g. we should be able to do a million resources in a single heat stack (somewhat arbitrary number as a target)

Users need to be able to tell heat their desired template+parameters at any time, not just when heat believes the stack is 'READY'.
- Autoscaling is a special case of 'user' here in that it tunes the sizes of groups but otherwise is identical to a user.
- Admins reacting to problematic situations may well need to make 'overlapping' changes in rapid fire.

Users deploying stacks with excess of 10K instances (and thus perhaps 50K resources) should expect Heat to deploy and update said stacks quicky and gracefully, given appropriate cloud capacity.

Existing stacks should continue to function. "We don't break user-space".

Heat/Blueprints/Convergence

Contents

Summary

Release Note

Rationale

User Stories