Difference between revisions of "Heat/Blueprints/as-update-policy"

Latest revision as of 20:51, 15 August 2013

Summary

The following is the proposed solution for the as-update-policy blueprint. We want to add an UpdatePolicy attribute that can be used with InstanceGroup and AutoScalingGroup to specify how changes to the launch configuration or subnet are rolled out. The UpdatePolicy attribute can be introduced to an existing stack during a request for a stack update. For the InstanceGroup resource type, we want to add the following snippet at the cfn template.

  "UpdatePolicy" : {
     "RollingUpdate" : {
        "MinInstancesInService" : "1",
        "MaxBatchSize" : "12",
        "PauseTime" : "PT60S"
     }
  }

MinInstancesInService indicates the number of instances that must be in service while other instances are being replaced.
MaxBatchSize indicates the maximum number of instances to roll out with each batch.
PauseTime indicates the wait time between each change.

The example below is a cfn template for an InstanceGroup with UpdatePolicy. This snippet is a revision of the sample heat template @ https://github.com/openstack/heat-templates/blob/master/cfn/F17/InstanceGroup.template. In the example below, update to the LaunchConfiguration in the JobServerGroup for an existing stack will trigger the specific RollingUpdate policy under the UpdatePolicy attribute. The name of the entry under UpdatePolicy is not significant. The InstanceGroup resource only expects one entry within the UpdatePolicy attribute. During the update, the JobServerGroup must have at least one instance in service. The update will be rolled out in batches of 5 instances. For each batch, new instances will be created first in parallel prior to terminating the old instances. There will be a 30 seconds pause before each batch is rolled out.

 "Resources" : {
   "JobServerGroup" : {
     "UpdatePolicy" : {
       "RollingUpdate" : {
         "MinInstancesInService" : "1",
         "MaxBatchSize" : "5",
         "PauseTime" : "PT30S"
       }
     },
     "Type" : "OS::Heat::InstanceGroup",
     "Properties" : {
       "LaunchConfigurationName" : { "Ref" : "JobServerConfig" },
       "Size" : {"Ref": "NumInstances"},
       "AvailabilityZones" : { "Fn::GetAZs" : "" }
     }
   },
   "JobServerConfig" : {
     "Type" : "AWS::AutoScaling::LaunchConfiguration",
     "Properties": {
       "ImageId"           : { "Ref" : "ImageId" },
       "InstanceType"      : { "Ref" : "InstanceType" },
       "KeyName"           : { "Ref" : "KeyName" },
       "NovaSchedulerHints": [ {"Key": "part", "Value": "long"},
                               {"Key": "ready", "Value": "short"} ],
       "UserData"          : { "Fn::Base64" : { "Fn::Join" : ["", [
         "#!/bin/bash -v\n"
       ]]}}
     }
   }
 }

The current Heat engine does not support changes in the underlying resource reference (i.e. LaunchConfiguration). Given the above example, if JobServerConfig is updated, when checking JobServerGroup for update, the changes to JobServerConfig is not recognized at _update_resource of the StackUpdate class. Therefore, a resource update for the InstanceGroup would not get triggered. The LaunchConfiguration resource will recognize the update but currently there's no update handler implemented and it seems more appropriate to let the InstanceGroup handle its own instance updates. So currently, the only way to trigger a change in the InstanceGroup as a result of the LaunchConfiguration change is if we rename the LaunchConfiguration JobServerConfig in the cfn template. Since LaunchConfigurationName is not in the update_allowed_properties of InstanceGroup, this will lead to a replace (destroy follow by create) of the existing InstanceGroup. This is not the desire solution as we want the update to the LaunchConfiguration to be rolled out in a controlled fashion.

Implementation

The following are changes proposed for implementation of this blueprint. The goal is to allow the InstanceGroup and AutoScalingGroup to recognize that there's an update with the LaunchConfiguration that it reference. The InstanceGroup and AutoScalingGroup should continue to make the decision on how to handle its own update. Currently, the template differences and property differences are passed into the update function and the resource makes the decision what to do with the differences. We want the change in LaunchConfiguration to be recognized as a property difference. To do that, we will override the FnGetRefId() of LaunchConfiguration to return physical_resource_name(). When any properties in LaunchConfiguration is modified, it will trigger the engine to replace the LaunchConfiguration; subsequently, the resource ID and also the physical resource name will also be renewed. The change in the physical resource name of the referenced LaunchConfiguration will trigger a property difference in the LaunchConfigurationName of the InstanceGroup. If LaunchConfigurationName is added into the update_allowed_properties, then the InstanceGroup and AutoScalingGroup will be able to handle update appropriately without triggering a destroy and replace of the entire group.

Modify LaunchConfiguration class

Override FnGetRefId to return physical_resource_name()

Add UpdatePolicy class

Put this new class in the autoscaling module under the engine module

Modify InstanceGroup and AutoScalingGroup

Add UpdatePolicy to updated_allowed_keys in InstanceGroup and modify handle_update to property differences in UpdatePolicy
- Changes to the UpdatePolicy is only property changes
- Changes to the UpdatePolicy alone does not trigger InstanceGroup to update/replace its instances
Add LaunchConfigurationName to the updated_allowed_properties so changes to the LaunchConfiguration will not trigger an UpdateReplace
Modify _create_template to resolve the LaunchConfigurationName correctly
- Use conf = self.stack.resource_by_refid(self.properties['LaunchConfigurationName']) where conf will be the LaunchConfiguration resource
- Use instance_definition = copy.deepcopy(conf.t) to get the instance definition
Modify handle_update to handle rolling update
- The rolling update will only be triggered if there's an UpdatePolicy defined and that the LaunchConfigurationName is recognized as property difference.

Naming of instances in InstanceGroup and AutoScalingGroup
Currently, the instances are named in numeric order from 0 to the size of the group. Increasing capacity on a resize adds new instances to the end of the resources list in the nested stack. Decreasing capacity on a resize is deleting instances starting from the end of that list. To support UpdatePolicy and the attribute MinInstancesInService, the update procedure should add new instances first per MaxBatchSize before deleting the instances being replaced. We also want to avoid renaming the new instances to match their predecessors after deletion to avoid problem with some use cases where the host name of the instance in the operating system is the same as the resource name in the stack. To address this, the proposal is to let the numeric counter grow while still maintaining a contiguous set until there is enough room starting from 0 to fit the next full replacement of the group. The following is an example.

Initial stack with min capacity

1	2	3	4	5	6	7	8	9	10

Stack grow to size 4

1	2	3	4	5	6	7	8	9	10

LaunchConfiguration is updated and Instances are replaced

1	2	3	4	5	6	7	8	9	10

Stack grow to size 6

1	2	3	4	5	6	7	8	9	10

Stack shrink back to size 4

1	2	3	4	5	6	7	8	9	10

LaunchConfiguration is updated and Instances are replaced

1	2	3	4	5	6	7	8	9	10

The following is a sample python script that shows a prototype of how to generate the batches with the proposed naming scheme.

#!/usr/bin/env python

import re

def get_replacements(resources, name_prefix, batch_size):
    if not resources or len(resources) <= 0:
        yield []
    else:
        sorted_resources = sorted(resources)
        grp_size = len(sorted_resources)
        name_pattern = '%s-(\d+)' % name_prefix
        grp_start = int(re.search(name_pattern, sorted_resources[0]).group(1))
        grp_end = int(re.search(name_pattern, sorted_resources[-1]).group(1))
        current_start = 0 if grp_size <= grp_start else grp_end + 1
        replacement_end = current_start + grp_size - 1
        while current_start <= replacement_end:
            remaining = replacement_end - current_start + 1
            current_batch_size = (batch_size
                                  if remaining >= batch_size
                                  else remaining)
            yield (['%s-%s' % (name_prefix, counter) for counter in
                    range(current_start, current_start + current_batch_size)])
            current_start += current_batch_size

if __name__ == '__main__':

    # generate batch of 2 starting from vm-6
    resources = ['vm-0', 'vm-2', 'vm-4', 'vm-1', 'vm-3', 'vm-5']
    batch_generator = get_replacements(resources, 'vm', 2)
    print [l for l in batch_generator]

    # generate batch of 2 starting from vm-0
    resources = ['vm-6', 'vm-8', 'vm-10', 'vm-7', 'vm-9', 'vm-11']
    batch_generator = get_replacements(resources, 'vm', 2)
    print [l for l in batch_generator]

The script outputs the following. The first list generates batches in the range of 6-11. The second list generates batches in the range 0-5 since the group size can now fit in the opening starting from 0.

[['vm-6', 'vm-7'], ['vm-8', 'vm-9'], ['vm-10', 'vm-11']]
[['vm-0', 'vm-1'], ['vm-2', 'vm-3'], ['vm-4', 'vm-5']]

Out of Scope

The goal is to be able to contribute this as a feature release in Havana. The support of subnet group membership changes in the UpdatePolicy will be likely out of scope. The support of VPCZoneIdentifier in InstanceGroup and AutoScalingGroup is a pre-requisite.