Difference between revisions of "Heat/Blueprints/RollingUpdates"

Revision as of 17:50, 9 February 2013

Launchpad Entry: HeatSpec:rolling-updates
Created: 07 Feb 2013
Contributors: Clint Byrum

Summary

While managing a large group of instances, I may want to roll out changes in topology and/or configuration to a limited percentage of these instances and then wait to see if those initial rollouts produced failures or successes before deploying a larger percentage. This is known as a "canary" deployment strategy, after the old mining practice of carrying a canary in a lantern to test for air quality.

Release Note

Multi-Instance resources may now specify a property which causes them to apply updates using a rolling or canary strategy.

Rationale

With large scale deployments, updating configuration on all machines at once without testing may result in downtime. Being able to control the deployment will lead to more reliability for users who implement it.

User stories

As an operations engineer I want to roll out a change to topology or configuration on a very large resource without the risk of significant downtime or error rates.

Assumptions

Design

Metadata server changes

A new virtual metadata location will be queryable for each instance of a group, addressible in cfn-hup as

Resources.instance_group_name.Instances.instance_id.Metadata

MetadataUpdatePattern

A new property will be introduced to OS::Heat::InstanceGroup:

MetadataUpdatePattern:

The string argument will be one of "rolling", "canary", or "immediate". If this property is not specified, 'immediate' is assumed.

rolling

Instance metadata updates will be exposed to one instance at a time. Any WaitCondition that depends on the InstanceGroup will be waited on before continuing to the next instance. Any failure of said WaitCondition will result in rolling back to the previous Metadata.

canary

Identical to rolling except WaitCondition will be waited on in groups of instances rather than 1 at a time. The progression is 1 instance, then 1%, 5%, 20%, then the remainder.

immediate

The new Metadata is exposed to all instances immediately without waiting.

Implementation

Currently an update stack just tries to update the metadata for an instance group. To facilitate the rollback capability of rolling/canary upgrades, a new column will be needed in the resource table new_metadata. Another table will be created which is active_metadata_updates, which will store the list of ids which should be served this "new" metadata. Updates to stacks are already protected by code that will not let another update happen in parallel, so there is no need for any joining to a table of versioned metadata or anything like that.

As requests to the new sub-resource of the group are made for each instance, Metadata will be looked up something like this:

select if (amu.id is not null, new_metadata, r.rsrc_metadata) from resources r left outer join active_metadata_updates amu on amu.resource_id = r.id and amu.instance_id = :requested_instance_id where r.name = :requested_resource_name;

This only results in an extra lookup if the user wants instance-specific metadata, and it should be reduced to a single extra index read per request.

Once the instance has signaled the successful WaitCondition with their instance id as the data payload, more will be added to the active_updates table. This should not need a thread to stay active managing the process, as the waitconditions and timeouts can be relied upon to trigger the next actions. Failure handling needs more thought (see Unresolved Issues)

When all of an instance group's instances have reported success or failure, an update should be made which sets rsrc_metadata = new_metadata. Then the rows must be deleted from active_metadata_updates before another update is allowed.

UI Changes

N/A

Code Changes

TBD

Migration

Adding a column and table means a schema change, and so would have to be handled in database migrations.

Test/Demo Plan

TBD'

Unresolved issues

In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.

BoF agenda and discussion

@@ Line 24: / Line 24: @@
 == Design ==
-A single new property will be introduced to OS::Heat::[[InstanceGroup]]:
+=== Metadata server changes ===
+A new virtual metadata location will be queryable for each instance of a group, addressible in cfn-hup as
+Resources.<u>instance_group_name</u>.Instances.<u>instance_id</u>.Metadata
+=== [[MetadataUpdatePattern]] ===
+A new property will be introduced to OS::Heat::[[InstanceGroup]]:
 [[MetadataUpdatePattern]]:
@@ Line 30: / Line 38: @@
 The string argument will be one of "rolling", "canary", or "immediate". If this property is not specified, 'immediate' is assumed.
-=== rolling ===
+==== rolling ====
-Metadata updates will be performed on one resource at a time. Any [[WaitCondition]] that depends on the [[InstanceGroup]] will be waited on before continuing to the next resource. Any failure of said [[WaitCondition]] will result in rolling back to the previous Metadata.
+Instance metadata updates will be exposed to one instance at a time. Any [[WaitCondition]] that depends on the [[InstanceGroup]] will be waited on before continuing to the next instance. Any failure of said [[WaitCondition]] will result in rolling back to the previous Metadata.
-=== canary ===
+==== canary ====
 Identical to rolling except [[WaitCondition]] will be waited on in groups of instances rather than 1 at a time. The progression is 1 instance, then 1%, 5%, 20%, then the remainder.
-=== immediate ===
+==== immediate ====
-The new Metadata is copied to all instances immediately without waiting.
+The new Metadata is exposed to all instances immediately without waiting.
 == Implementation ==
-Currently an update stack just tries to update the metadata for each resource. To facilitate the rollback capability of rolling/canary upgrades, a new column will be needed in the resource table previous_rsrc_metadata. As machines are updated, their previous metadata will need to be stored so that they can be rolled back. Updates to stacks are already protected by code that will not let another update happen in parallel, so there is no need for any joining to a table of versioned metadata. Once an [[InstanceGroup]] has been updated, all of the previous_rsrc_metadata should be set to NULL.
+Currently an update stack just tries to update the metadata for an instance group. To facilitate the rollback capability of rolling/canary upgrades, a new column will be needed in the resource table new_metadata. Another table will be created which is active_metadata_updates, which will store the list of ids which should be served this "new" metadata. Updates to stacks are already protected by code that will not let another update happen in parallel, so there is no need for any joining to a table of versioned metadata or anything like that.
+As requests to the new sub-resource of the group are made for each instance, Metadata will be looked up something like this:
+<code><nowiki>select if (amu.id is not null, new_metadata, r.rsrc_metadata) from resources r left outer join active_metadata_updates amu on amu.resource_id = r.id and amu.instance_id = :requested_instance_id where r.name = :requested_resource_name;</nowiki></code>
+This only results in an extra lookup if the user wants instance-specific metadata, and it should be reduced to a single extra index read per request.
+Once the instance has signaled the successful [[WaitCondition]] with their instance id as the data payload, more will be added to the active_updates table. This should not need a thread to stay active managing the process, as the waitconditions and timeouts can be relied upon to trigger the next actions. Failure handling needs more thought (see Unresolved Issues)
+When all of an instance group's instances have reported success or failure, an update should be made which sets rsrc_metadata = new_metadata. Then the rows must be deleted from active_metadata_updates before another update is allowed.
 === UI Changes ===
@@ Line 56: / Line 74: @@
 === Migration ===
-* Adding a column means a schema change, and so would have to be handled in database migrations.
+* Adding a column and table means a schema change, and so would have to be handled in database migrations.
 == Test/Demo Plan ==
@@ Line 63: / Line 81: @@
 == Unresolved issues ==
+In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.
 == BoF agenda and discussion ==