Heat/Blueprints/Convergence

Summary

   Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure.

stacks that fail during creation / update
stacks where resources have (silently) stopped working - either disappearing or have an error of some sort (e.g. loadbalancer that isn't forwarding traffic, or nova instance in ERROR state)
Heat engines are also noisy:
they get restarted when servers need to get updated
they may fail due to hardware or network failure (see under hypervisor failure)
Heat engine failures show up as a _FAILED stack, but also have a lock preventing other operations
This is a bug if so; if the engine working on a stack has failed then we are supposed to steal its lock.
Large stacks exceed the capacity of a single heat-engine process to update / manage efficiently.
Large clusters - e.g. 10K VMs should be directly usable
Stack updates lock state until the entire thing has converged again which prevents admins making changes until its completed
- This makes it hard/impossible to do autoscaling as autoscaling decisions may be more frequent than the completion time from each event
- Large admin teams are forced to use an external coordination service to ensure they don't do expensive updates except when there is scheduled time
- Reacting to emergencies is problematic