Jump to: navigation, search

Heat/Blueprints/RollingUpdates

< Heat
Revision as of 23:32, 11 February 2013 by Clint (talk) (Fix italics after TBD)
  • Launchpad Entry: HeatSpec:rolling-updates
  • Created: 07 Feb 2013
  • Contributors: Clint Byrum

Summary

While managing a large group of instances, I may want to roll out changes in topology and/or configuration to a limited percentage of these instances and then wait to see if those initial rollouts produced failures or successes before deploying a larger percentage. This is known as a "canary" deployment strategy, after the old mining practice of carrying a canary in a lantern to test for air quality.

Release Note

Multi-Instance resources may now specify a property which causes them to apply updates using a rolling or canary strategy.

Rationale

With large scale deployments, updating configuration on all machines at once without testing may result in downtime. Being able to control the deployment will lead to more reliability for users who implement it.

User stories

As an operations engineer I want to roll out a change to topology or configuration on a very large resource without the risk of significant downtime or error rates.

Assumptions

Design

Metadata server changes

A new virtual metadata location will be queryable for each instance of a group, addressible in cfn-hup as

Resources.instance_group_name.Instances.instance_id.Metadata

MetadataUpdatePattern

A new property will be introduced to OS::Heat::InstanceGroup:

MetadataUpdatePattern:

The string argument will be one of "rolling", "canary", or "immediate". If this property is not specified, 'immediate' is assumed.

rolling

Instance metadata updates will be exposed to one instance at a time. Any WaitCondition that depends on the InstanceGroup will be waited on before continuing to the next instance. Any failure of said WaitCondition will result in rolling back to the previous Metadata.

canary

Identical to rolling except WaitCondition will be waited on in groups of instances rather than 1 at a time. The progression is 1 instance, then 1%, 5%, 20%, then the remainder.

immediate

The new Metadata is exposed to all instances immediately without waiting.

Implementation

Currently an update stack just tries to update the metadata for an instance group. To facilitate the rollback capability of rolling/canary upgrades, a new column will be needed in the resource table new_metadata. Another table will be created which is active_metadata_updates, which will store the list of ids which should be served this "new" metadata. Updates to stacks are already protected by code that will not let another update happen in parallel, so there is no need for any joining to a table of versioned metadata or anything like that.

As requests to the new sub-resource of the group are made for each instance, Metadata will be looked up something like this:

select if (amu.id is not null, new_metadata, r.rsrc_metadata) from resources r left outer join active_metadata_updates amu on amu.resource_id = r.id and amu.instance_id = :requested_instance_id where r.name = :requested_resource_name;

This only results in an extra lookup if the user wants instance-specific metadata, and it should be reduced to a single extra index read per request.

Once the instance has signaled the successful WaitCondition with their instance id as the data payload, more will be added to the active_updates table. This should not need a thread to stay active managing the process, as the waitconditions and timeouts can be relied upon to trigger the next actions. Failure handling needs more thought (see Unresolved Issues)

When all of an instance group's instances have reported success or failure, an update should be made which sets rsrc_metadata = new_metadata. Then the rows must be deleted from active_metadata_updates before another update is allowed.

UI Changes

N/A

Code Changes

TBD

Migration

  • Adding a column and table means a schema change, and so would have to be handled in database migrations.

Test/Demo Plan

TBD

Unresolved issues

In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.

BoF agenda and discussion