Jump to: navigation, search

Heat/Blueprints/Convergence

< Heat
Revision as of 20:00, 21 May 2014 by SpamapS (talk | contribs)

Summary

   Clouds are noisy - servers fail to come up, or die when the underlying hypervisor crashes or suffers a power failure. Heat should be resilient and allow concurrent operations on any sized stack.

Release Note

TBD

Rationale

  • stacks that fail during creation / update
  • stacks where resources have (silently) stopped working - either disappearing or have an error of some sort (e.g. loadbalancer that isn't forwarding traffic, or nova instance in ERROR state)
  • Heat engines are also noisy:
  • they get restarted when servers need to get updated
  • they may fail due to hardware or network failure (see under hypervisor failure)
  • Heat engine failures show up as a _FAILED stack, but also have a lock preventing other operations
  • This is a bug if so; if the engine working on a stack has failed then we are supposed to steal its lock.
  • Large stacks exceed the capacity of a single heat-engine process to update / manage efficiently.
  • Large clusters - e.g. 10K VMs should be directly usable
  • Stack updates lock state until the entire thing has converged again which prevents admins making changes until its completed
    • This makes it hard/impossible to do autoscaling as autoscaling decisions may be more frequent than the completion time from each event
    • Large admin teams are forced to use an external coordination service to ensure they don't do expensive updates except when there is scheduled time
    • Reacting to emergencies is problematic

User Stories

  • Users should only need to intervene with a stack when there is no right action that Heat can take to deliver the current template+environment+parameters. E.g. if a cinder volume attached to a non-scaling-group resource goes offline, that requires administrative intervention -> STACK_FAILED
    • Examples that this would handle without intervention
      • nova instances that never reach ACTIVE
      • neutron ports that aren't reachable
      • Servers in a scaling group that disappear / go to ERROR in the nova api
    • Examples that may need intervention
      • servers that are not in a scaling group which go to ERROR after running for a while or just disappear
      • Scaling groups that drop below a specified minimum due to servers erroring/disappearing.
  • Heat users can expect Heat to bring a stack into line with the template+parameters even if the world around it changes after STACK_READY - e.g. due to a server being deleted by the user.
  • Operators should not need to manually wait-or-prepare heat engines for maintenance: assume crash/shutdown/failure will happen and have that be seamless to the user.
    • Stacks that are being updated must not be broken / interrrupted in a user visible way due to a heat engine reboot/restart/redeploy.
  • Users should be able to deploy stacks that scale to the size of the backend storage engine - e.g. we should be able to do a million resources in a single heat stack (somewhat arbitrary number as a target)
  • Users need to be able to tell heat their desired template+parameters at any time, not just when heat believes the stack is 'READY'.
    • Autoscaling is a special case of 'user' here in that it tunes the sizes of groups but otherwise is identical to a user.
    • Admins reacting to problematic situations may well need to make 'overlapping' changes in rapid fire.
  • Users deploying stacks with excess of 10K instances (and thus perhaps 50K resources) should expect Heat to deploy and update said stacks quicky and gracefully, given appropriate cloud capacity.
  • Existing stacks should continue to function. "We don't break user-space".