Heat/Blueprints/RollingUpdates


 * Launchpad Entry: HeatSpec:rolling-updates
 * Created: 07 Feb 2013
 * Contributors: Clint Byrum

Summary
While managing a large group of instances, I may want to roll out changes in topology and/or configuration to a limited percentage of these instances and then wait to see if those initial rollouts produced failures or successes before deploying a larger percentage. This is known as a "canary" deployment strategy, after the old mining practice of carrying a canary in a lantern to test for air quality.

Release Note
Multi-Instance resources may now associate themselvs with an OS::Heat::UpdatePattern via the update_pattern property. This will cause Heat to apply updates using a rolling or canary strategy.

Rationale
With large scale deployments, updating configuration on all machines at once without testing may result in downtime. Being able to control the deployment will lead to more reliability for users who implement it.

User stories
As an operations engineer I want to roll out a change to topology or configuration on a very large resource without the risk of significant downtime or error rates.

OS::Heat::UpdatePattern
This will serve as a base class for two flavors, Rolling and Canary

rolling_pattern: type: OS::Heat::RollingUpdatePattern properties: min_in_service: 1 batch_size: 2

canary_pattern: type: OS::Heat::CanaryUpdatePattern properties: min_in_service: 1 batch_size: 2 growth_factor: 2

rolling
Updates will be done batch_size resources at a time. If the number of in-service resources would dip below min_in_service, then batch_size minus min_in_service updates will be initiated.

canary
Identical to rolling except batch_size is increased by multiplying with growth_factor after every successful batch.

depends_on
In order to determine how to proceed with the update, Heat will examine a resource's dependencies before updating it. Heat will call a hook in the parent which will allow the parent to return a wait condition.

example
resources: rolling_update_dbs: type: OS::Heat::RollingUpdate properties: min_in_service: 1 batch_size: 1 db_server1: type: OS::Nova::Server depends_on: rolling_update_dbs properties: image: db-server-image flavor: giant-server db_server2: type: OS::Nova::Server depends_on: [ rolling_update_dbs, db_server1 ] properties: image: db-server-image flavor: giant-server canary_update_app: type: OS::Heat::CanaryUpdate properties: min_in_service: 10 batch_size: 2 growth_factor: 2 appservers: type: OS::Heat::ResourceGroup depends_on: [ db_server2, canary_update_app ] properties: count: 20 resource_def: type: OS::Nova::Server properties: image: my-cool-app flavor: meh-server

Update parent hooks
Two new resource API hook points must be created which resource authors can implement to control updates of child-resources. They will be child_creating and child_updating. child_creating will be called as a child is being created to allow it to dynamically create any attributes that the child may fetch. For rolling updates this would be the update wait condition handle (or handles in the case of a group). child_updating will implement the logic which allows returning any wait conditions for Heat to wait on the child resource.

OS::Heat::UpdatePattern
This resource will create wait conditions for each dependent resource. For groups, it will need to create a wait condition and handle per group member. It will implement the update_child hook. On being notified that a child is being updated, it will examine the state of the current update and return all wait condition which must be completed before the child is allowed to move forward.

heat.engine.update
Changes must be made to call the child hooks, and to wait on the wait conditions returned by child_updating.

UI Changes
N/A

Code Changes
TBD

Test/Demo Plan

 * Rollbacks should respect the update patterns as well, it is not entirely clear that this will "just work", though it should given the scheme, as a rollback is mostly an update in reverse.

Unresolved issues

 * In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.