Jump to: navigation, search

Difference between revisions of "Heat/Blueprints/RollingUpdates"

m (Text replace - "__NOTOC__" to "")
(Major rewrite to reflect a year of learning and enhancements to Heat.)
Line 10: Line 10:
 
== Release Note ==
 
== Release Note ==
  
Multi-Instance resources may now specify a property which causes them to apply updates using a rolling or canary strategy.
+
Multi-Instance resources may now associate themselvs with an OS::Heat::UpdatePattern via the update_pattern property. This will cause Heat to apply updates using a rolling or canary strategy.
  
 
== Rationale ==
 
== Rationale ==
Line 24: Line 24:
 
== Design ==
 
== Design ==
  
=== Metadata server changes ===
+
=== OS::Heat::UpdatePattern ===
  
A new virtual metadata location will be queryable for each instance of a group, addressible in cfn-hup as
+
This will serve as a base class for two flavors, Rolling and Canary
  
Resources.<u>instance_group_name</u>.Instances.<u>instance_id</u>.Metadata
+
<pre>
 +
rolling_pattern:
 +
  type: OS::Heat::RollingUpdatePattern
 +
  properties:
 +
    min_in_service: 1
 +
    batch_size: 2
  
=== [[MetadataUpdatePattern]] ===
+
canary_pattern:
 
+
  type: OS::Heat::CanaryUpdatePattern
A new property will be introduced to OS::Heat::[[InstanceGroup]]:
+
  properties:
 
+
    min_in_service: 1
[[MetadataUpdatePattern]]:
+
    batch_size: 2
 
+
    growth_factor: 2
The string argument will be one of "rolling", "canary", or "immediate". If this property is not specified, 'immediate' is assumed.
+
</pre>
  
 
==== rolling ====
 
==== rolling ====
  
Instance metadata updates will be exposed to one instance at a time. Any [[WaitCondition]] that depends on the [[InstanceGroup]] will be waited on before continuing to the next instance. Any failure of said [[WaitCondition]] will result in rolling back to the previous Metadata.
+
Updates will be done ''batch_size'' resources at a time. If the number of in-service resources would dip below ''min_in_service'', then ''batch_size'' minus ''min_in_service'' updates  will be initiated.
  
 
==== canary ====
 
==== canary ====
  
Identical to rolling except [[WaitCondition]] will be waited on in groups of instances rather than 1 at a time. The progression is 1 instance, then 1%, 5%, 20%, then the remainder.
+
Identical to rolling except ''batch_size'' is increased by multiplying with ''growth_factor'' after every successful batch.
  
==== immediate ====
+
==== depends_on ====
  
The new Metadata is exposed to all instances immediately without waiting.
+
In order to determine how to proceed with the update, Heat will examine a resource's dependencies before updating it. Heat will call a hook in the parent which will allow the parent to return a wait condition.
  
 +
==== example ====
 +
 +
<pre>
 +
resources:
 +
  rolling_update_dbs:
 +
  type: OS::Heat::RollingUpdate
 +
  properties:
 +
    min_in_service: 1
 +
    batch_size: 1
 +
  db_server1:
 +
    type: OS::Nova::Server
 +
    depends_on: rolling_update_dbs
 +
    properties:
 +
      image: db-server-image
 +
      flavor: giant-server
 +
  db_server2:
 +
    type: OS::Nova::Server
 +
    depends_on: [ rolling_update_dbs , db_server1 ]
 +
    properties:
 +
      image: db-server-image
 +
      flavor: giant-server
 +
  canary_update_app:
 +
    type: OS::Heat::CanaryUpdate
 +
    properties:
 +
      min_in_service: 10
 +
      batch_size: 2
 +
      growth_factor: 2
 +
  appservers:
 +
    type: OS::Heat::ResourceGroup
 +
    depends_on: [ db_server2, rolling_update_app ]
 +
    properties:
 +
      count: 20
 +
      resource_def:
 +
        type: OS::Nova::Server
 +
        properties:
 +
          image: my-cool-app
 +
          flavor: meh-server
 +
</pre>
 
== Implementation ==
 
== Implementation ==
  
Currently an update stack just tries to update the metadata for an instance group. To facilitate the rollback capability of rolling/canary upgrades, a new column will be needed in the resource table new_metadata. Another table will be created which is active_metadata_updates, which will store the list of ids which should be served this "new" metadata. Updates to stacks are already protected by code that will not let another update happen in parallel, so there is no need for any joining to a table of versioned metadata or anything like that.
+
=== Update parent hooks ===
  
As requests to the new sub-resource of the group are made for each instance, Metadata will be looked up something like this:
+
Two new resource API hook points must be created which resource authors can implement to control updates of child-resources. They will be ''child_creating'' and ''child_updating''. ''child_creating'' will be called as a child is being created to allow it to dynamically create any attributes that the child may fetch. For rolling updates this would be the update wait condition handle (or handles in the case of a group). ''child_updating'' will implement the logic which allows returning any wait conditions for Heat to wait on the child resource.
  
<code><nowiki>select if (amu.id is not null, new_metadata, r.rsrc_metadata) from resources r left outer join active_metadata_updates amu on amu.resource_id = r.id and amu.instance_id = :requested_instance_id where r.name = :requested_resource_name;</nowiki></code>
+
=== OS::Heat::UpdatePattern ===
  
This only results in an extra lookup if the user wants instance-specific metadata, and it should be reduced to a single extra index read per request.  
+
This resource will create wait conditions for each dependent resource. For groups, it will need to create a wait condition and handle per group member. It will implement the update_child hook. On being notified that a child is being updated, it will examine the state of the current update and return all wait condition which must be completed before the child is allowed to move forward.
  
Once the instance has signaled the successful [[WaitCondition]] with their instance id as the data payload, more will be added to the active_updates table. This should not need a thread to stay active managing the process, as the waitconditions and timeouts can be relied upon to trigger the next actions. Failure handling needs more thought (see Unresolved Issues)
+
=== heat.engine.update ===
  
When all of an instance group's instances have reported success or failure, an update should be made which sets rsrc_metadata = new_metadata. Then the rows must be deleted from active_metadata_updates before another update is allowed.
+
Changes must be made to call the child hooks, and to wait on the wait conditions returned by child_updating.
  
 
=== UI Changes ===
 
=== UI Changes ===
Line 74: Line 117:
 
=== Migration ===
 
=== Migration ===
  
* Adding a column and table means a schema change, and so would have to be handled in database migrations.
 
  
 
== Test/Demo Plan ==
 
== Test/Demo Plan ==
Line 83: Line 125:
  
 
In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.
 
In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.
 
== BoF agenda and discussion ==
 
  
 
----
 
----
 
[[Category:Spec]]
 
[[Category:Spec]]

Revision as of 17:47, 3 February 2014

  • Launchpad Entry: HeatSpec:rolling-updates
  • Created: 07 Feb 2013
  • Contributors: Clint Byrum

Summary

While managing a large group of instances, I may want to roll out changes in topology and/or configuration to a limited percentage of these instances and then wait to see if those initial rollouts produced failures or successes before deploying a larger percentage. This is known as a "canary" deployment strategy, after the old mining practice of carrying a canary in a lantern to test for air quality.

Release Note

Multi-Instance resources may now associate themselvs with an OS::Heat::UpdatePattern via the update_pattern property. This will cause Heat to apply updates using a rolling or canary strategy.

Rationale

With large scale deployments, updating configuration on all machines at once without testing may result in downtime. Being able to control the deployment will lead to more reliability for users who implement it.

User stories

As an operations engineer I want to roll out a change to topology or configuration on a very large resource without the risk of significant downtime or error rates.

Assumptions

Design

OS::Heat::UpdatePattern

This will serve as a base class for two flavors, Rolling and Canary

rolling_pattern:
  type: OS::Heat::RollingUpdatePattern
  properties:
    min_in_service: 1
    batch_size: 2

canary_pattern:
  type: OS::Heat::CanaryUpdatePattern
  properties:
    min_in_service: 1
    batch_size: 2
    growth_factor: 2

rolling

Updates will be done batch_size resources at a time. If the number of in-service resources would dip below min_in_service, then batch_size minus min_in_service updates will be initiated.

canary

Identical to rolling except batch_size is increased by multiplying with growth_factor after every successful batch.

depends_on

In order to determine how to proceed with the update, Heat will examine a resource's dependencies before updating it. Heat will call a hook in the parent which will allow the parent to return a wait condition.

example

resources:
  rolling_update_dbs:
   type: OS::Heat::RollingUpdate
   properties:
     min_in_service: 1
     batch_size: 1
  db_server1:
    type: OS::Nova::Server
    depends_on: rolling_update_dbs
    properties:
      image: db-server-image
      flavor: giant-server
  db_server2:
    type: OS::Nova::Server
    depends_on: [ rolling_update_dbs , db_server1 ]
    properties:
      image: db-server-image
      flavor: giant-server
  canary_update_app:
    type: OS::Heat::CanaryUpdate
    properties:
      min_in_service: 10
      batch_size: 2
      growth_factor: 2
  appservers:
    type: OS::Heat::ResourceGroup
    depends_on: [ db_server2, rolling_update_app ]
    properties:
      count: 20
      resource_def:
        type: OS::Nova::Server
        properties:
          image: my-cool-app
          flavor: meh-server

Implementation

Update parent hooks

Two new resource API hook points must be created which resource authors can implement to control updates of child-resources. They will be child_creating and child_updating. child_creating will be called as a child is being created to allow it to dynamically create any attributes that the child may fetch. For rolling updates this would be the update wait condition handle (or handles in the case of a group). child_updating will implement the logic which allows returning any wait conditions for Heat to wait on the child resource.

OS::Heat::UpdatePattern

This resource will create wait conditions for each dependent resource. For groups, it will need to create a wait condition and handle per group member. It will implement the update_child hook. On being notified that a child is being updated, it will examine the state of the current update and return all wait condition which must be completed before the child is allowed to move forward.

heat.engine.update

Changes must be made to call the child hooks, and to wait on the wait conditions returned by child_updating.

UI Changes

N/A

Code Changes

TBD

Migration

Test/Demo Plan

TBD

Unresolved issues

In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.