Jump to: navigation, search

Heat/Blueprints/Convergence

< Heat
Revision as of 19:48, 21 May 2014 by SpamapS (talk | contribs) (Created page with "== Summary == Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure. === Problem === * stacks that fail d...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Summary

   Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure.

Problem

  • stacks that fail during creation / update
  • stacks where resources have (silently) stopped working - either disappearing or have an error of some sort (e.g. loadbalancer that isn't forwarding traffic, or nova instance in ERROR state)
  • Heat engines are also noisy:
  • they get restarted when servers need to get updated
  • they may fail due to hardware or network failure (see under hypervisor failure)
  • Heat engine failures show up as a _FAILED stack, but also have a lock preventing other operations
  • This is a bug if so; if the engine working on a stack has failed then we are supposed to steal its lock.
  • Large stacks exceed the capacity of a single heat-engine process to update / manage efficiently.
  • Large clusters - e.g. 10K VMs should be directly usable
  • Stack updates lock state until the entire thing has converged again which prevents admins making changes until its completed
    • This makes it hard/impossible to do autoscaling as autoscaling decisions may be more frequent than the completion time from each event
    • Large admin teams are forced to use an external coordination service to ensure they don't do expensive updates except when there is scheduled time
    • Reacting to emergencies is problematic