- Launchpad Entry: HeatSpec:rolling-updates
- Created: 07 Feb 2013
- Contributors: Clint Byrum
- 1 Summary
- 2 Release Note
- 3 Rationale
- 4 User stories
- 5 Assumptions
- 6 Design
- 7 Implementation
- 8 Test/Demo Plan
- 9 Unresolved issues
- 10 BoF agenda and discussion
While managing a large group of instances, I may want to roll out changes in topology and/or configuration to a limited percentage of these instances and then wait to see if those initial rollouts produced failures or successes before deploying a larger percentage. This is known as a "canary" deployment strategy, after the old mining practice of carrying a canary in a lantern to test for air quality.
Multi-Instance resources may now specify a property which causes them to apply updates using a rolling or canary strategy.
With large scale deployments, updating configuration on all machines at once without testing may result in downtime. Being able to control the deployment will lead to more reliability for users who implement it.
As an operations engineer I want to roll out a change to topology or configuration on a very large resource without the risk of significant downtime or error rates.
Metadata server changes
A new virtual metadata location will be queryable for each instance of a group, addressible in cfn-hup as
A new property will be introduced to OS::Heat::InstanceGroup:
The string argument will be one of "rolling", "canary", or "immediate". If this property is not specified, 'immediate' is assumed.
Instance metadata updates will be exposed to one instance at a time. Any WaitCondition that depends on the InstanceGroup will be waited on before continuing to the next instance. Any failure of said WaitCondition will result in rolling back to the previous Metadata.
Identical to rolling except WaitCondition will be waited on in groups of instances rather than 1 at a time. The progression is 1 instance, then 1%, 5%, 20%, then the remainder.
The new Metadata is exposed to all instances immediately without waiting.
Currently an update stack just tries to update the metadata for an instance group. To facilitate the rollback capability of rolling/canary upgrades, a new column will be needed in the resource table new_metadata. Another table will be created which is active_metadata_updates, which will store the list of ids which should be served this "new" metadata. Updates to stacks are already protected by code that will not let another update happen in parallel, so there is no need for any joining to a table of versioned metadata or anything like that.
As requests to the new sub-resource of the group are made for each instance, Metadata will be looked up something like this:
select if (amu.id is not null, new_metadata, r.rsrc_metadata) from resources r left outer join active_metadata_updates amu on amu.resource_id = r.id and amu.instance_id = :requested_instance_id where r.name = :requested_resource_name;
This only results in an extra lookup if the user wants instance-specific metadata, and it should be reduced to a single extra index read per request.
Once the instance has signaled the successful WaitCondition with their instance id as the data payload, more will be added to the active_updates table. This should not need a thread to stay active managing the process, as the waitconditions and timeouts can be relied upon to trigger the next actions. Failure handling needs more thought (see Unresolved Issues)
When all of an instance group's instances have reported success or failure, an update should be made which sets rsrc_metadata = new_metadata. Then the rows must be deleted from active_metadata_updates before another update is allowed.
- Adding a column and table means a schema change, and so would have to be handled in database migrations.
In a large instance group, failures may be common. Any failure conditions that can be expected and should not roll back an update should be identified and handled.