Jump to: navigation, search

Difference between revisions of "Heat/AutoScaling"

(link to autoscaling blueprint)
(Note)
 
(43 intermediate revisions by 6 users not shown)
Line 1: Line 1:
== Heat Autoscaling now and beyond ==
+
= Note =
  
AS == AutoScaling
+
'''The content on this page, like most of the wiki, is obsolete. It is a proposal for a new design for an autoscaling API in Heat that was never implemented. There is now a separate autoscaling API project, Senlin.'''
  
=== Now ===
+
=== Summary ===
  
The AWS AS is broken into a number of logical objects
+
This is a proposal for a new design for Heat autoscaling. The existing AWS-based design is described at [[Heat/AWSAutoScaling]].
* AS group (heat/engine/resources/autoscaling.py)
 
* AS policy (heat/engine/resources/autoscaling.py)
 
* AS Launch Config (heat/engine/resources/autoscaling.py)
 
* Cloud Watch Alarms (heat/engine/resources/cloud_watch.py, heat/engine/watchrule.py)
 
  
==== Dependencies ====
+
The design is currently reflected in this blueprint: https://blueprints.launchpad.net/heat/+spec/autoscaling-api-resources
  
Note the in template resource dependencies are:
+
=== Use Cases ===
  
*Alarm
+
# Users want to use AutoScale without using Heat templates.
** Group
+
# Users want to use AutoScale *with* Heat templates.
** Policy
+
# Users want to scale arbitrary resources, not just instances.
*** Group
+
# Users want their autoscaled resources to be associated with shared resources such as load balancers, cluster managers, configuration servers, and so on.
**** Launch Config
+
# TODO: Administrators or automated processes want to add or remove *specific* instances from a scaling group. (one node was compromised or had some critical error?)
**** [Load Balancer] - optional
+
# TODO: Users want to specify a general policy about which resources to delete when scaling down, either newest or oldest
 +
# TODO: A hook needs to be provided to allow completion or cancelling of the auto scaling down of a resource. For example, a MongoDB shard may need draining to other nodes before it can be safely deleted. Or another example, replica's may need time to resync before another is deleted. The check would ensure the resync is done.
 +
# TODO: Another hook should be provided to allow selection of node to scale down. MongoDB example again, select the node with the least amount of data that will need to migrate to other hosts.
  
This mean the creation order should be [LB, LC, Group, Policy, Alarm].
+
=== AutoScaling API ===
  
 +
The general ideas of this proposal are as follows:
  
[[File:Current.png|none|Current architecture]]
+
* Implement new resources for scaling groups and policies in terms of a new,  separate API (implemented in the Heat codebase)
 +
* That separate API will be usable by end-users directly, or via Heat resources.
 +
* That API will create a Heat template and its own Heat stack whenever n scaling group is created within it.
 +
* As events happen which trigger a policy that changes the number of instances in a scaling group, the autoscale API will generate a new template, and update-stack the stack that it manages.
 +
* The existing Ceilometer Alarm resource will be able to be used with the URL from a WebhookTrigger resource.
 +
* The AutoScaling API implementation should '''not''' have any knowledge of hooking up scaled resources to shared resources such as load balancers. We should figure out a way to represent these associations in a general way, without e.g. having AS talk to the Neutron LB API, so that we can support all manner of these things.
  
==== When a stack is created with these resources the following happens: ====
+
The autoscaling API is currently being documented as an API Blueprint at http://docs.heatautoscale.apiary.io/ -- please discuss it on the openstack-dev mailing list.
  
# Alarm: the alarm rule is written into the DB
+
=== The AutoScaling Resources ===
# Policy: nothing interesting
 
# LaunchConfig: it is just storage
 
# Group: the Launch config is used to create the initial number of servers.
 
# the new server starts posting samples back to the cloud watch API
 
  
==== When an alarm is triggered in watchrule.py the following happens: ====
+
There are a number of resources associated with autoscaling:
# the periodic task runs the watch rule
 
# when an alarm is triggered it calls (python call) the policy resource (policy.alarm())
 
# the policy figures out if it needs to adjust the group size, if it does it calls (via python again) group.adjust()
 
  
=== Beyond ===
+
* OS::AutoScale::ScalingGroup - a group that can scale an arbitrary set of heat resources.
 +
* OS::AutoScale::ScalingPolicy -  affects the number of scaling units in a group (+1, -10%, etc)
 +
* OS::AutoScale::WebHook - creates a new webhook that can be used to execute a ScalingPolicy
  
 +
The resources are documented below; we have decided to document the general design in this form for simplicity's sake, but remember that an important aspect of this proposal is that the AS API is accessible directly to the user without necessarily using Heat resources to drive it. These Heat resources should map pretty directly and obviously to the API, but hopefully soon there will be documentation for the raw ReST form of the API.
  
The following blueprint and its dependents *should* accurately reflect the design laid out in this document: https://blueprints.launchpad.net/heat/+spec/heat-autoscaling
+
==== ScalingGroup ====
  
==== Use Cases ====
+
A scaling group that can manage the scaling of arbitrary Heat resources.
  
# Administrators want to manually add or remove (*specific*) instances from an instance group. (one node was compromised or had some critical error?)
+
* Properties:
# External automated tools may want to adjust the number of instances in an instance group (e.g. AutoScaling)
+
** name: Convenient name.
# Developers want to integrate with AutoScale without using Heat templates
+
** max_size: Maximum size of the group.
 +
** min_size: Minimum size of the group.
 +
** cooldown: The minimum amount of time (in seconds) between autoscaling operations permitted on this group.
 +
** resources: The mapping of resources that will be duplicated in order to scale.
  
==== General Ideas ====
+
The 'resources' mapping is duplicated for each scaling unit. For example, if the 'resources' property is specified as follows:
  
* Make Heat responsible for "Scaling" without the "Auto"
+
mygroup:
* Make the AS policy a separate service
+
    type: OS::Heat::ScalingGroup
* Use ceilometer (pluggable monitoring)
+
    properties:
 +
        resources:
 +
            my_web_server: {type: AWS::EC2::Instance}
  
Basically the same steps happen just instead of python calls they will be REST calls.
+
then if we scale to "2", the concrete resources included in the private stack's template will be as follows:
  
==== Scaling ====
+
my_web_server-1: {type: AWS::EC2::Instance}
 +
my_web_server-2: {type: AWS::EC2::Instance}
 +
    ...
  
"Scaling" (without the "Auto") will be a part of Heat. OS::Heat::InstanceGroup will maintain a sub-stack that contains all of the individual Instances so that they can be addressed for orchestration purposes.
+
And multiple resources are supported and scaled in lockstep. For example, if the 'resources' property is specified as follows:
  
External agents or services will be able to interact with the InstanceGroup to implement various use cases like autoscaling and manual administrator intervention (to manually remove or add instances for example). Tools can use the normal resource API to look at the instances in the sub-stack, but adding/removing instances should only be done on the InstanceGroup directly (either via a webhook to adjust the size or via other property manipulation through the API). (This addresses use cases #1 and #2 above).
+
resources:
 +
    my_web_server: {type: AWS::EC2::Instance}
 +
    my_db_server: {type: AWS::EC2::Instance}
  
==== AutoScaling ====
+
Then the resulting template (when scaled to "2") will be
  
AutoScaling will be delegated to a service external to Heat (but implemented inside the Heat project/codebase). The AutoScaling service is simple. It knows only about AutoScalingGroups and ScalingPolicies. Monitoring services (e.g. Ceilometer) will tell the AutoScaling service to execute policies, and the AutoScaling service will execute those policies by talking to the Heat API.
+
my_web_server-1: {type: AWS::EC2::Instance}
 +
my_db_server-1: {type: AWS::EC2::Instance}
 +
my_web_server-2: {type: AWS::EC2::Instance}
 +
my_db_server-2: {type: AWS::EC2::Instance}
  
The communication is as thus:
 
  
* When AutoScaling resources are created in Heat, they will register the data with the AutoScaling service via POSTs to its API. This includes the AutoScalingGroup and the ScalingPolicy.
+
==== ScalingPolicy ====
* When Ceilometer (or any other monitoring service) hits an AutoScaling webhook, the AutoScaling service will execute the associated policy (unless it's on cooldown).
 
* During policy execution, the AutoScaling service will talk to Heat to manipulate the AutoScalingGroup that lives within Heat.
 
  
We would like to implement very basic support in the AS-Service daemon to run launch configurations that are specified directly through the API, without Heat resources. (use case #3 above). The common case will ideally be the case where Heat resources are used, though.
+
A scaling policy describes a particular type of change to a scaling group, such as "add -1 capacity" or "add +10% capacity" or "set 5 capacity".
  
==== When a stack is created with these resources the following happens: ====
+
* Properties:
 +
** name: Convenient name
 +
** group_id: ID of the group that this policy will affect
 +
** cooldown: minimum amount of time (in seconds) between allowable executions of this policy.
 +
** change: a number that has an effect based on change_type.
 +
** change_type: one of "change_in_capacity", "percentage_change_in_capacity", or "exact_capacity" -- describes what this policy does (and the meaning of "change")
  
[[File:Resource-creation.svg|thumbnail|sequence diagram for autoscaling resource creation [http://interactive.blockdiag.com/seqdiag/?compression=deflate&src=eJyVkt1rwjAUxd_9Ky5lj1u7D9hLyUQcjIEw0e1JpKTx2oaluS5NHE7835d-qB37YObx5Jz7O23utgewwCV3yiZL0raUH8iubmMv4yLDRKHObM6uby7jjlOTxUSQIsOUzHKbKodxzxuiiAsr19xK0kyTxrgRnSXtihQNs6a1WixWiluMIUduYxhML0o0aym8IlAqKtCiqa2MMZhgSc4IBGGwHl-pnTHA7upBMFM8RcWCYeVDGHGnRT4kvZRZMI__kRj4slPBldTZgyG3CuZbn4LG6zPHoofk-Gn6DGEYRlkVKBvO7i-WaFgtZ0xKis0JoLMaFK2qnMQyOAeD1hnNglraPN63Hb786l-_WHFTnEJvINE7pjnRq8fXyZ_OvpbT8s0htAl4mYyagkfe8c1h1o7rYvv9_gFjqp2rdrC9S2mxAeG3l0tdfq_SgYb7p9l9AmIQ7JE (source)]]]
+
==== WebHook ====
  
 +
Represents a revokable webhook endpoint for executing a policy.
  
* LaunchConfig: stored locally in Heat.
+
For example, when you create a webhook for a policy, a new URL endpoint will be created in the form of <nowiki>http://as-api/webhooks/<random_hash></nowiki>. When that URL is requested, the policy will be executed.
* AutoScalingGroup (subclass of InstanceGroup):
 
** Base InstanceGroup behavior:
 
*** A substack is created to represent all of the servers.
 
*** Launch config is used to create the initial number of servers in the substack.
 
*** the servers will be tagged with the group name so that metrics from ceilometer can be aggregated.
 
** AutoScalingGroup-specific behavior:
 
*** The scaling group will be registered (POST) with the AS service (AutoScalingGroup behavior).
 
* Policy: The policy is registered (POST) with the AS service.
 
** including a webhook pointing back to the Heat API, at the specific AutoScalingGroup, to adjust its size as required.
 
* Alarm: the alarm is created in Ceilometer via a POST (https://github.com/openstack/python-ceilometerclient/blob/master/ceilometerclient/v2/alarms.py). We pass it the above webhook. (should this be done *via* the AS Service? or straight from Heat?)
 
* ceilometer then starts collecting samples and calculating the thresholds.
 
  
==== When an alarm is triggered in Ceilometer the following happens: ====
+
This resource will be useful in combination with a CeilometerAlarm resource that knows how to set up Ceilometer to execute a webhook when an alert happens.
  
[[File:autoscaling-event.svg|thumbnail|sequence diagram for autoscaling event [http://interactive.blockdiag.com/seqdiag/?compression=deflate&src=eJxVj01PhDAQhu_8ignx6IIfiZemJJz2uEa8GUNKmYVq6WzaKWY1_nfpYgxenzzvvO98ZQA9HlW03B7JcTCfKG8fxIKxH7C16AYe5d39jdiYjhhbTZa8tGYYubMRRbYIZak0m1mxIScdORQrjEwuTh16yf5X1WgsTcjoBdTNLqCfjUYBIyq-CFJKqJdg0MoaNwDO6DjR_2nYVdXmALxY1aGV-Wg4wAd2I9F7_pp2bCRZXXr-5MdD8wxFUVzVzd5TPD1hoOg1lqp_i4Hza_Dpz_S3zFc2pTV9WjAZhz10ZziRNfq8lmXfP0jyc4Q (source)]]]
+
* Properties:
[[File:Heat-as-beyond.png|none|New architecture]]
+
** policy_id: The ID of the policy to execute.
 +
* Attributes:
 +
** webhook_url: The webhook URL.
  
 +
=== Load Balancers ===
  
# Ceilometer will post a webhook to what ever we supply (I'd guess a url to the policy API)
+
As mentioned in "general ideas" above, we would like to avoid encoding knowledge of specific LB APIs into the AS API implementation -- this is because there are certainly unbounded use cases for such relationships of "scaled" resources to "shared" resources, and we would only be limiting them by making the implementation specific to a few of them.
# the policy figures out if it needs to adjust the group size, if it does it calls a heat-api webhook (something like PUT.../resources/<as-group>/adjust)
 
  
==== Authentication ====
+
Here are some ideas which may work to support this.
 +
 
 +
==== LBMember? ====
 +
 
 +
'''NOTE: This is just an idea! We're still considering different ways to do this.'''
 +
 
 +
The way LB integration is currently implemented in the AWS-style autoscaling implementation in Heat is by manipulating a LoadBalancer that must be defined in the same stack as the InstanceGroup / AutoScalingGroup. It looks up the LB and manipulates the "Instances" property to include the new instance.
 +
 
 +
There are problems with this:
 +
 
 +
* New implementations of load balancers or LB-like things in Heat require us to update the InstanceGroup code to deal their differing interfaces
 +
* It won't work for the new autoscale API implementation because the LoadBalancer resource will live in a different stack that is inaccessible to the AS API (the user's stack).
 +
* It's not general to other types of shared resource integration.
 +
 
 +
One possible way to rectify this is to introduce a new resource that is meant only for associating instances with load balancers. This resource would be specific to the type of load balancer integrating with, and should ideally take an ''underlying'' LB resource ID, and an IP address (supplied from an attribute of the instance).
 +
 
 +
So, for example, there would be one resource called OS::Neutron::LBMember:
 +
* OS::Neutron::LBMember
 +
** Properties:
 +
*** server_ip: The IP of the server. Usually provided with an Fn::GetAttr on the server resource.
 +
*** loadbalancer: The ID of the load balancer. Usually provided with a Ref to the load balancer resource.
 +
 
 +
It's worth noting that this resource actually matches up very well to the neutron API, which represents membership in a load balancer as a separate ReST object.
 +
 
 +
The outcome of this design is that we would be able to scale up pairs of instances and LBMembers, the LBMember would take care of LB association, and we wouldn't need to have any specific knowledge of load balancers in the AS API implementation.
 +
 
 +
=== Updates ===
 +
 
 +
As of Icehouse, AWS::AutoScaling::AutoScalingGroup supports an UpdatePolicy for rolling updates. It adds 3 pieces of information:
 +
* MinInstancesInService: it indicates how many instances need to stay up during updates. Defaults to 0.
 +
* MaxBatchSize: marks the maximum of instances renewed per batch. Defaults to 1.
 +
* PauseTime: how much time is paused between each batch. Defaults to PT0S (0 second)
 +
 
 +
It seems we could default to doing rolling updates. Having MaxBatchSize being the same as MaxSize would be equivalent to a non-rolling updates. We need to store the additional information in the scaling group.
 +
 
 +
=== Authentication ===
  
 
* how do we authenticate the request from ceilometer to AS?
 
* how do we authenticate the request from ceilometer to AS?
 
* is this a special unprivileged user "ceilometer-alarmer" that we trust?
 
* is this a special unprivileged user "ceilometer-alarmer" that we trust?
* at some point we need to get the correct user credentials to scale in the AS group.
+
* The AS API should have access to a Trust for the user who owns the resources it manages, and pass that Trust to Heat.
  
==== Securing Webhooks ====
+
=== Securing Webhooks ===
  
 
Many systems just treat the webhook URL as a secret (with a big random UUID in it, generated *per client*). I think think this is actually fine, but it has two problems we can easily solve:
 
Many systems just treat the webhook URL as a secret (with a big random UUID in it, generated *per client*). I think think this is actually fine, but it has two problems we can easily solve:
Line 116: Line 157:
 
* there are lots of places other than the actual SSL stream that URLs can be seen. Logs of the Autoscale HTTP server, for example.
 
* there are lots of places other than the actual SSL stream that URLs can be seen. Logs of the Autoscale HTTP server, for example.
 
* it's susceptible to replay attacks (if sniff one request, you can send the same request to keep doing the same operation, like scaling up or down)
 
* it's susceptible to replay attacks (if sniff one request, you can send the same request to keep doing the same operation, like scaling up or down)
 
  
 
The first one is easy to solve by putting some important data into the POST body. The second one can be solved with a nonce with timestamp component.
 
The first one is easy to solve by putting some important data into the POST body. The second one can be solved with a nonce with timestamp component.
Line 133: Line 173:
 
* ensure that the timestamp is reasonably recent (no more than minutes old, and no more than minutes into the future)
 
* ensure that the timestamp is reasonably recent (no more than minutes old, and no more than minutes into the future)
 
* check to see if the timestamp+nonce has been used recently (we only need to store the nonces used within that "reasonable" time window)
 
* check to see if the timestamp+nonce has been used recently (we only need to store the nonces used within that "reasonable" time window)
 
  
 
On top of all of this, of course, webhooks should be revokable.
 
On top of all of this, of course, webhooks should be revokable.
 
  
 
'''[Qu] if we do this in the context of Heat (db not accessible from the API daemon).'''
 
'''[Qu] if we do this in the context of Heat (db not accessible from the API daemon).'''
Line 146: Line 184:
 
'''[Qu] Why make Autoscale a separate service?'''
 
'''[Qu] Why make Autoscale a separate service?'''
  
'''[An]''' To clarify, service == REST server (to me)
+
'''[An]''' To clarify, service = REST server (to me)
  
 
Initially because someone wanted it separate (rackers). But I think it is the right approach long term.
 
Initially because someone wanted it separate (rackers). But I think it is the right approach long term.
Line 162: Line 200:
  
 
I guess we could put all this into one service (an all purpose policy service)?
 
I guess we could put all this into one service (an all purpose policy service)?
 +
 +
'''[Qu] What Happens to Operations Invoked During Cooldown?'''
 +
 +
If the operation is simply discarded, that could be bad: who knows if the invoker will invoke it again?
 +
 +
If the operation is queued until the end of cooldown, that is unlikely to ultimately accomplish much.
 +
 +
A better solution has the invoker itself exercise self-restraint (not invoke operations too close together in time).  Probably not difficult, probably it is operating periodically anyway.
 +
 +
'''[Qu] Should External Policies Be Supported?'''
 +
 +
The existing policy language is very limited.  We could make it grander, but I am sure we can not make it grand enough for all uses.  I think it would be better to have support for external policies.  In this case the autoscaling service is simply a scaling service, taking the multiplier from an external controller.

Latest revision as of 17:28, 12 July 2016

Note

The content on this page, like most of the wiki, is obsolete. It is a proposal for a new design for an autoscaling API in Heat that was never implemented. There is now a separate autoscaling API project, Senlin.

Summary

This is a proposal for a new design for Heat autoscaling. The existing AWS-based design is described at Heat/AWSAutoScaling.

The design is currently reflected in this blueprint: https://blueprints.launchpad.net/heat/+spec/autoscaling-api-resources

Use Cases

  1. Users want to use AutoScale without using Heat templates.
  2. Users want to use AutoScale *with* Heat templates.
  3. Users want to scale arbitrary resources, not just instances.
  4. Users want their autoscaled resources to be associated with shared resources such as load balancers, cluster managers, configuration servers, and so on.
  5. TODO: Administrators or automated processes want to add or remove *specific* instances from a scaling group. (one node was compromised or had some critical error?)
  6. TODO: Users want to specify a general policy about which resources to delete when scaling down, either newest or oldest
  7. TODO: A hook needs to be provided to allow completion or cancelling of the auto scaling down of a resource. For example, a MongoDB shard may need draining to other nodes before it can be safely deleted. Or another example, replica's may need time to resync before another is deleted. The check would ensure the resync is done.
  8. TODO: Another hook should be provided to allow selection of node to scale down. MongoDB example again, select the node with the least amount of data that will need to migrate to other hosts.

AutoScaling API

The general ideas of this proposal are as follows:

  • Implement new resources for scaling groups and policies in terms of a new, separate API (implemented in the Heat codebase)
  • That separate API will be usable by end-users directly, or via Heat resources.
  • That API will create a Heat template and its own Heat stack whenever n scaling group is created within it.
  • As events happen which trigger a policy that changes the number of instances in a scaling group, the autoscale API will generate a new template, and update-stack the stack that it manages.
  • The existing Ceilometer Alarm resource will be able to be used with the URL from a WebhookTrigger resource.
  • The AutoScaling API implementation should not have any knowledge of hooking up scaled resources to shared resources such as load balancers. We should figure out a way to represent these associations in a general way, without e.g. having AS talk to the Neutron LB API, so that we can support all manner of these things.

The autoscaling API is currently being documented as an API Blueprint at http://docs.heatautoscale.apiary.io/ -- please discuss it on the openstack-dev mailing list.

The AutoScaling Resources

There are a number of resources associated with autoscaling:

  • OS::AutoScale::ScalingGroup - a group that can scale an arbitrary set of heat resources.
  • OS::AutoScale::ScalingPolicy - affects the number of scaling units in a group (+1, -10%, etc)
  • OS::AutoScale::WebHook - creates a new webhook that can be used to execute a ScalingPolicy

The resources are documented below; we have decided to document the general design in this form for simplicity's sake, but remember that an important aspect of this proposal is that the AS API is accessible directly to the user without necessarily using Heat resources to drive it. These Heat resources should map pretty directly and obviously to the API, but hopefully soon there will be documentation for the raw ReST form of the API.

ScalingGroup

A scaling group that can manage the scaling of arbitrary Heat resources.

  • Properties:
    • name: Convenient name.
    • max_size: Maximum size of the group.
    • min_size: Minimum size of the group.
    • cooldown: The minimum amount of time (in seconds) between autoscaling operations permitted on this group.
    • resources: The mapping of resources that will be duplicated in order to scale.

The 'resources' mapping is duplicated for each scaling unit. For example, if the 'resources' property is specified as follows:

mygroup:
    type: OS::Heat::ScalingGroup
    properties:
        resources:
            my_web_server: {type: AWS::EC2::Instance}

then if we scale to "2", the concrete resources included in the private stack's template will be as follows:

my_web_server-1: {type: AWS::EC2::Instance}
my_web_server-2: {type: AWS::EC2::Instance}
    ...

And multiple resources are supported and scaled in lockstep. For example, if the 'resources' property is specified as follows:

resources:
    my_web_server: {type: AWS::EC2::Instance}
    my_db_server: {type: AWS::EC2::Instance}

Then the resulting template (when scaled to "2") will be

my_web_server-1: {type: AWS::EC2::Instance}
my_db_server-1: {type: AWS::EC2::Instance}
my_web_server-2: {type: AWS::EC2::Instance}
my_db_server-2: {type: AWS::EC2::Instance}


ScalingPolicy

A scaling policy describes a particular type of change to a scaling group, such as "add -1 capacity" or "add +10% capacity" or "set 5 capacity".

  • Properties:
    • name: Convenient name
    • group_id: ID of the group that this policy will affect
    • cooldown: minimum amount of time (in seconds) between allowable executions of this policy.
    • change: a number that has an effect based on change_type.
    • change_type: one of "change_in_capacity", "percentage_change_in_capacity", or "exact_capacity" -- describes what this policy does (and the meaning of "change")

WebHook

Represents a revokable webhook endpoint for executing a policy.

For example, when you create a webhook for a policy, a new URL endpoint will be created in the form of http://as-api/webhooks/<random_hash>. When that URL is requested, the policy will be executed.

This resource will be useful in combination with a CeilometerAlarm resource that knows how to set up Ceilometer to execute a webhook when an alert happens.

  • Properties:
    • policy_id: The ID of the policy to execute.
  • Attributes:
    • webhook_url: The webhook URL.

Load Balancers

As mentioned in "general ideas" above, we would like to avoid encoding knowledge of specific LB APIs into the AS API implementation -- this is because there are certainly unbounded use cases for such relationships of "scaled" resources to "shared" resources, and we would only be limiting them by making the implementation specific to a few of them.

Here are some ideas which may work to support this.

LBMember?

NOTE: This is just an idea! We're still considering different ways to do this.

The way LB integration is currently implemented in the AWS-style autoscaling implementation in Heat is by manipulating a LoadBalancer that must be defined in the same stack as the InstanceGroup / AutoScalingGroup. It looks up the LB and manipulates the "Instances" property to include the new instance.

There are problems with this:

  • New implementations of load balancers or LB-like things in Heat require us to update the InstanceGroup code to deal their differing interfaces
  • It won't work for the new autoscale API implementation because the LoadBalancer resource will live in a different stack that is inaccessible to the AS API (the user's stack).
  • It's not general to other types of shared resource integration.

One possible way to rectify this is to introduce a new resource that is meant only for associating instances with load balancers. This resource would be specific to the type of load balancer integrating with, and should ideally take an underlying LB resource ID, and an IP address (supplied from an attribute of the instance).

So, for example, there would be one resource called OS::Neutron::LBMember:

  • OS::Neutron::LBMember
    • Properties:
      • server_ip: The IP of the server. Usually provided with an Fn::GetAttr on the server resource.
      • loadbalancer: The ID of the load balancer. Usually provided with a Ref to the load balancer resource.

It's worth noting that this resource actually matches up very well to the neutron API, which represents membership in a load balancer as a separate ReST object.

The outcome of this design is that we would be able to scale up pairs of instances and LBMembers, the LBMember would take care of LB association, and we wouldn't need to have any specific knowledge of load balancers in the AS API implementation.

Updates

As of Icehouse, AWS::AutoScaling::AutoScalingGroup supports an UpdatePolicy for rolling updates. It adds 3 pieces of information:

* MinInstancesInService: it indicates how many instances need to stay up during updates. Defaults to 0.
* MaxBatchSize: marks the maximum of instances renewed per batch. Defaults to 1.
* PauseTime: how much time is paused between each batch. Defaults to PT0S (0 second)

It seems we could default to doing rolling updates. Having MaxBatchSize being the same as MaxSize would be equivalent to a non-rolling updates. We need to store the additional information in the scaling group.

Authentication

  • how do we authenticate the request from ceilometer to AS?
  • is this a special unprivileged user "ceilometer-alarmer" that we trust?
  • The AS API should have access to a Trust for the user who owns the resources it manages, and pass that Trust to Heat.

Securing Webhooks

Many systems just treat the webhook URL as a secret (with a big random UUID in it, generated *per client*). I think think this is actually fine, but it has two problems we can easily solve:

  • there are lots of places other than the actual SSL stream that URLs can be seen. Logs of the Autoscale HTTP server, for example.
  • it's susceptible to replay attacks (if sniff one request, you can send the same request to keep doing the same operation, like scaling up or down)

The first one is easy to solve by putting some important data into the POST body. The second one can be solved with a nonce with timestamp component.

The API for creating a webhook in the autoscale server should return two things, the webhook URL and a random signing secret. When Ceilometer (or any client) hits the webhook URL, it should do the following:

  • include a "timestamp" argument with the current timestamp
  • include another random nonce
  • sign the request with the signing secret

(to solve the first problem from above, the timestamp and nonce should be in the POST request body instead of the URL)

And anytime the AS service receives a webhook it should:

  • verify the signature
  • ensure that the timestamp is reasonably recent (no more than minutes old, and no more than minutes into the future)
  • check to see if the timestamp+nonce has been used recently (we only need to store the nonces used within that "reasonable" time window)

On top of all of this, of course, webhooks should be revokable.

[Qu] if we do this in the context of Heat (db not accessible from the API daemon).

  1. We are going to have to send all webhooks to the heat-engine for verification.
  2. This is because we can't check the uuid in the API, thus making it very easy for a DOS attack. Any idea on how to solve this?

[An] This doesn't sound like a unique problem, which should be solved by rate limiting, as other parts of OpenStack do.

[Qu] Why make Autoscale a separate service?

[An] To clarify, service = REST server (to me)

Initially because someone wanted it separate (rackers). But I think it is the right approach long term.

Heat should not be in the business of implementing too many services internally, but rather having resources to orchestrate them.

monitoring <> Xaas.policy <> heat.resource.action()

Some cool things we could do with this:

  1. better instance HA (restarting servers when they are ill) - and smarter logic defining what is "ill"
  2. autoscaling
  3. energy saving (could be linked to autoscaling)
  4. automated backup (calling snapshots at regular time periods)
  5. autoscaling using shelving? (maybe for faster response)

I guess we could put all this into one service (an all purpose policy service)?

[Qu] What Happens to Operations Invoked During Cooldown?

If the operation is simply discarded, that could be bad: who knows if the invoker will invoke it again?

If the operation is queued until the end of cooldown, that is unlikely to ultimately accomplish much.

A better solution has the invoker itself exercise self-restraint (not invoke operations too close together in time). Probably not difficult, probably it is operating periodically anyway.

[Qu] Should External Policies Be Supported?

The existing policy language is very limited. We could make it grander, but I am sure we can not make it grand enough for all uses. I think it would be better to have support for external policies. In this case the autoscaling service is simply a scaling service, taking the multiplier from an external controller.