Revision as of 19:48, 21 May 2014 by SpamapS (Created page with "== Summary == Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure. === Problem === * stacks that fail d...")
Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure.
- stacks that fail during creation / update
- stacks where resources have (silently) stopped working - either disappearing or have an error of some sort (e.g. loadbalancer that isn't forwarding traffic, or nova instance in ERROR state)
- Heat engines are also noisy:
- they get restarted when servers need to get updated
- they may fail due to hardware or network failure (see under hypervisor failure)
- Heat engine failures show up as a _FAILED stack, but also have a lock preventing other operations
- This is a bug if so; if the engine working on a stack has failed then we are supposed to steal its lock.
- Large stacks exceed the capacity of a single heat-engine process to update / manage efficiently.
- Large clusters - e.g. 10K VMs should be directly usable
- Stack updates lock state until the entire thing has converged again which prevents admins making changes until its completed
- This makes it hard/impossible to do autoscaling as autoscaling decisions may be more frequent than the completion time from each event
- Large admin teams are forced to use an external coordination service to ensure they don't do expensive updates except when there is scheduled time
- Reacting to emergencies is problematic