Latest revision as of 20:00, 26 November 2013

Multiple Active Scheduler Configurations/Drivers/Policies

Summary

Support for multiple active scheduler policy configurations (e.g., driver + corresponding config properties) associated with different host aggregates within a single Nova deployment.

Blueprint: https://blueprints.launchpad.net/nova/+spec/multiple-scheduler-drivers

Rationale

In heterogeneous environments, it is often required that different hardware pools are managed under different policies. In Grizzly, basic partitioning of hosts and enforcement of compatibility between flavors and hosts during instance scheduling can be already implemented using host aggregates and FilterScheduler with AggregateInstanceExtraSpecsFilter. However, it is not possible to define, for example, different sets of filters and weights, or even entirely different scheduler drivers associated with different aggregates. For example, the admin may want to have a pool with a conservative CPU overcommit (e.g., for CPU-intensive workloads), and another pool with aggressive CPU over-commit (for workloads which are less CPU-bound). This blueprint introduces a mechanism to overcome this limitation.
Note: while in large-scale geo-distributed environments this can be done with Cells, there is no existing solution within a single (potentially small) Nova deployment.

User Stories

An administrator partitions the managed environment into host aggregates, and associates specialized scheduler configurations (policies) to some or all of the aggregates.
On instance provisioning, the details of the scheduler configuration are derived from the properties of the request, an overridden configuration is created and used by the scheduler when handling the incoming request

Usage Details

Configuration (user story 1)

The administrator will:

Specify 'default' scheduler driver policy under [DEFAULT] section in nova.conf (e.g., FilterScheduler with CoreFilter) – as usual
Add to nova.conf one or more new sections, dedicated to specifying the different scheduling policy configurations, overriding the defaults – driver and/or associated properties. For example, [high_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=8, and [low_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=1. Note that in the above example, since driver and filters are the same, it would not be mandatory to specify them in the specific policy sections.
Specify in nova.conf which nova configuration can be override by the policies (using a new property – e.g., scheduler_policy_overrides=scheduler_default_filters,cpu_allocation_ratio)
Specify in nova.conf which configuration selection mechanism should be used (e.g., AvailabilityZoneBasedSchedulerPolicyConfigurationSelection)
Create and populate with hosts one or more host aggregates, as usual.
Set a new metadata key-value pair for one or more of the aggregates, specifying the desired policy to be used for scheduling instances in the corresponding aggregate (e.g., "sched_policy=high_cpu_density").

Example (partial) nova.conf:

[DEFAULT] scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler scheduler_default_filters = AvailabilityFilter, CoreFilter cpu_allocation_ratio = 4.0 # A class implementing the method for selecting scheduler policy configuration (based on properties derived from the incoming provisioning request) # Possible options: # - availability zone (would be typically used in conjunction with AvailabilityZoneFilter) # - tenant id (would be typically used in conjunction with AggregateMultiTenancyIsolationFilter) # - flavor extra specs (would be typically used in conjunction with AggregateInstanceExtraSpecsFilter) # - explicit hint (would be typically used in conjunction with AggregateSchedulerPolicyConfigurationFilter)

scheduler_policy_configuration_selection=AvailabilityZoneBasedSchedulerPolicyConfigurationSelection # a list of scheduler configurations that a policy can override and will be used by this scheduler scheduler_policy_overrides = cpu_allocation_ratio [low_cpu_density] cpu_allocation_ratio = 1.0 [high_cpu_density] cpu_allocation_ratio = 8.0

Invocation (user story 2)

The user will invoke an instance provisioning request (as usual, unless the config selection mechanism is based on a new hint). For example:

$ nova boot --image 1 --flavor 1 --availability-zone cpu_intensive_az my-first-server

Note: when no policy is specified, the default scheduler configuration will be used, as it has been done before.

Discussion

As stated above, the main goal of this blueprint is to enable heterogeneous scheduling, leveraging partitioning of the environment into host aggregates. For simplicity, let's assume that each such aggregate may have potentially different hardware and/or scheduling policy configuration. We assume that the partitioning is static, and that the criteria for selection of the target host aggregate is deterministic -- based on the properties of the aggregates, and the properties of the provisioning request. Compared to the way FilterScheduler works today, the idea is to essentially divide the process into 2 stages -- first stage selects the host aggregate, and second stage actually applies potentially customized filtering and weighting of hosts within the aggregate. If the first stage fails (i.e., there is no single aggregate matching the incoming request) -- we fall back to applying the default set of filters and weights (as opposed to one associated with a particular aggregate).

Several aspects have been considered in the design of such a mechanism.

Where to implement the first stage (selection of aggregate and/or scheduler policy configuration)? This could be either in the Manager (right before invoking the driver), or in a new scheduler driver (which then would invoke one of the 'regular' drivers to select the host). The former approach seems more appropriate, because the new logic is not a self-contained scheduler driver, but rather can be considered a 'wrapper'.
How the selected driver will "know" which hosts should be considered? In the current implementation, the behavior of the driver will not change, meaning that it will need to be 'compliant' with the way an aggregate is selected in stage 1, and ensure that hosts in other aggregates will not be selected. With FilterScheduler, this can be done by using the filter which corresponds to the policy config selection. Going forward, it might make sense to build a mechanism that would restrict the scope of hosts 'visible' by the driver to those belonging to the selected aggregate(s). However, this seems to be a quite significant change compared to the way scheduler and HostManager work today, which can be made later on.
Where should we persist scheduling policy configurations? The association between aggregate and corresponding scheduling policy configuration is done as a property of the aggregate. In order to avoid inconsistency between different aggregates applying the same policy, the aggregate will keep only a reference to the policy (id, or unique name), and the configuration parameters themselves will be kept separately, without duplications. There have been quite a lot of discussion whether or not this should be in the DB. However, it seems that there are several reasons to keep them in nova.conf: 1) scheduler config options are now in nova.conf, and moving them to another place would require significant code refactoring; 2) we expect that in most cases the number of different configuration will be small and static; 3) making scheduler config options programmable could be a good idea regardless of this blueprint, and can be implemented as part of a separate blueprint.
How do we guarantee that policies are used consistently across aggregates? It is assumed that aggregates overriding the scheduling policy configuration are disjoint between them -- meaning that there is no host that belongs to two (or more) host aggregates each of which specifies different scheduling policy configuration (otherwise, certain hosts may be managed under two different policies, which may be misleading and wrong). In the short term, it seems acceptable that the admin would do the enforcement (at large scale, those aggregates would typically be created and manage programmatically anyway). Going forward, it might make sense to introduce the semantics of disjoint aggregates (maybe of certain 'type'), that will be enforced by Nova.

@@ Line 1: / Line 1: @@
-= Multiple Active Scheduler Drivers/Policies =
+= Multiple Active Scheduler Configurations/Drivers/Policies =
 == Summary ==
-Support for multiple active scheduler policies and/or drivers associated with different host aggregates within a single Nova deployment.
+Support for multiple active scheduler policy configurations (e.g., driver + corresponding config properties) associated with different host aggregates within a single Nova deployment.
 Blueprint: https://blueprints.launchpad.net/nova/+spec/multiple-scheduler-drivers
@@ Line 12: / Line 12: @@
 == User Stories ==
 # An administrator partitions the managed environment into host aggregates, and associates specialized scheduler configurations (policies) to some or all of the aggregates.
-# On instance provisioning, the name of a policy is specified using a new scheduler hint
+# On instance provisioning, the details of the scheduler configuration are derived from the properties of the request, an overridden configuration is created and used by the scheduler when handling the incoming request
-::''Note'': more options to determine the desired policy, perhaps derived from other parameters/properties rather than explicitly specified in the provisioning request, will be considered in the future.
 == Usage Details ==
 === Configuration (user story 1) ===
 The administrator will:
 # Specify 'default' scheduler driver policy under [DEFAULT] section in nova.conf (e.g., FilterScheduler with CoreFilter) – as usual
-# Add to nova.conf one or more new sections, dedicated to specifying the different scheduling policy configurations, overriding the defaults – driver and/or associated properties. For example, [high_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=8, and [low_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=1. Not that in the above example, since driver and filters are the same, it would not be mandatory to specify them in the specific policy sections.
+# Add to nova.conf one or more new sections, dedicated to specifying the different scheduling policy configurations, overriding the defaults – driver and/or associated properties. For example, [high_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=8, and [low_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=1. Note that in the above example, since driver and filters are the same, it would not be mandatory to specify them in the specific policy sections.
-# Specify in nova.conf which policies are enabled (using a new property – e.g., enabled_scheduler_policies=low_cpu_density, high_cpu_density)
+# Specify in nova.conf which nova configuration can be override by the policies (using a new property – e.g., scheduler_policy_overrides=scheduler_default_filters,cpu_allocation_ratio)
+# Specify in nova.conf which configuration selection mechanism should be used (e.g., AvailabilityZoneBasedSchedulerPolicyConfigurationSelection)
 # Create and populate with hosts one or more host aggregates, as usual.
-# Set a new metadata key-value pair for one or more of the aggregates, specifying the desired policy to be used for scheduling instances in the corresponding aggregate (e.g., "policy=high_cpu_density").
+# Set a new metadata key-value pair for one or more of the aggregates, specifying the desired policy to be used for scheduling instances in the corresponding aggregate (e.g., "sched_policy=high_cpu_density").
 Example (partial) nova.conf:
@@ Line 28: / Line 29: @@
 [DEFAULT]<BR>
 scheduler_driver=nova.scheduler.filter_scheduler.FilterScheduler<BR>
-scheduler_default_filters = CoreFilter, SchedulerPolicyFilter<BR>
+scheduler_default_filters = AvailabilityFilter, CoreFilter<BR>
 cpu_allocation_ratio = 4.0<BR>
 <BR>
-<nowiki>#</nowiki> a list of policies that will be used by this scheduler<BR>
+<nowiki>#</nowiki> A class implementing the method for selecting scheduler policy configuration (based on properties derived from the incoming provisioning request)<BR>
-enabled_scheduler_policies = low_cpu_density, high_cpu_density<BR>
+<nowiki>#</nowiki> Possible options:<BR>
+<nowiki>#</nowiki>   - availability zone (would be typically used in conjunction with AvailabilityZoneFilter) <BR>
+<nowiki>#</nowiki>   - tenant id (would be typically used in conjunction with AggregateMultiTenancyIsolationFilter) <BR>
+<nowiki>#</nowiki>   - flavor extra specs (would be typically used in conjunction with AggregateInstanceExtraSpecsFilter)<BR>
+<nowiki>#</nowiki>   - explicit hint (would be typically used in conjunction with AggregateSchedulerPolicyConfigurationFilter) <BR>
+scheduler_policy_configuration_selection=AvailabilityZoneBasedSchedulerPolicyConfigurationSelection<BR>
+<BR>
+<nowiki>#</nowiki> a list of scheduler configurations that a policy can override and will be used by this scheduler<BR>
+scheduler_policy_overrides = cpu_allocation_ratio<BR>
 <BR>
 [low_cpu_density]<BR>
@@ Line 40: / Line 50: @@
 cpu_allocation_ratio = 8.0<BR>
 </tt>
 === Invocation (user story 2) ===
-The user will invoke an instance provisioning request specifying the desired policy via a dedicated scheduler hint. For example:
+The user will invoke an instance provisioning request (as usual, unless the config selection mechanism is based on a new hint). For example:
 <tt>
-$ nova boot --image 1 --flavor 1 --hint target_policy=low_cpu_density my-first-server
+$ nova boot --image 1 --flavor 1 --availability-zone cpu_intensive_az my-first-server
 </tt>
 ''Note'': when no policy is specified, the default scheduler configuration will be used, as it has been done before.
-== Design Considerations ==
+== Discussion ==
-=== Policy selection ===
+As stated above, the main goal of this blueprint is to enable heterogeneous scheduling, leveraging partitioning of the environment into host aggregates. For simplicity, let's assume that each such aggregate may have potentially different hardware and/or scheduling policy configuration. We assume that the partitioning is static, and that the criteria for selection of the target host aggregate is deterministic -- based on the properties of the aggregates, and the properties of the provisioning request. Compared to the way FilterScheduler works today, the idea is to essentially divide the process into <b>2 stages</b> -- first stage selects the host aggregate, and second stage actually applies potentially customized filtering and weighting of hosts within the aggregate. If the first stage fails (i.e., there is no single aggregate matching the incoming request) -- we fall back to applying the default set of filters and weights (as opposed to one associated with a particular aggregate).
-In order to enable flexibility in selection of the scheduling policy, the selection logic will be encapsulated in a separate class, specified in nova.conf. In the first implementation, we will provide a single implementation selecting the scheduler policy based on an explicit scheduler hint, as specified above.
-<tt>
-[DEFAULT]<BR>
-scheduler_policy_selection=SchedulerHintTargetPolicySelection
-</tt>
-=== Host selection ===
-Once a scheduling policy is specified, the scheduler logic needs to restrict the applicable target hosts to those which should be managed with this policy – i.e., hosts in host aggregate(s) that specify the given policy. In order to implement this restriction in FilterScheduler, a new scheduler filter has been implemented, SchedulerPolicyFilter, which filters out hosts associated with different policies. As a special case, hosts which do not belong to an aggregate, or belong to an aggregate without a specified scheduling policy, would be considered as if they have been associated with a default policy, and handled by scheduler configuration specified in the DEFAULT section. Provisioning requests that do not specify a scheduling policy would automatically map to the default policy.
-=== Potential conflicts between policies ===
+Several aspects have been considered in the design of such a mechanism.
-Once there are multiple scheduling policies defined in the environment, in some cases it is important to make sure that there are no conflicts between the different policies. For example, this may mean that the same policy that has been used to find placement for an instance at provisioning time should also be used for other operations, such as instance migration. Another example is that the same policy should be used for provisioning of all the instances running on a certain physical machine. However, today it might be possible to include a certain physical host in two different host aggregates, and to associate a different scheduling policy to each of them. In such a case our current implementation will choose the first applicable policy for a given operation – which may cause conflicts. One way to avoid conflicts is to ensure that host aggregates that specify scheduling policy are disjoint. However, we feel that this may be a too strong requirement in some cases. Therefore, the most flexible approach at the moment would be to document this as a best practice, and let the administrator to decide how host aggregates and policies are used in a particular deployment.
+* <i>Where to implement the first stage (selection of aggregate and/or scheduler policy configuration)?</i> This could be either in the Manager (right before invoking the driver), or in a new scheduler driver (which then would invoke one of the 'regular' drivers to select the host). The former approach seems more appropriate, because the new logic is not a self-contained scheduler driver, but rather can be considered a 'wrapper'.
+* <i> How the selected driver will "know" which hosts should be considered? </i> In the current implementation, the behavior of the driver will not change, meaning that it will need to be 'compliant' with the way an aggregate is selected in stage 1, and ensure that hosts in other aggregates will not be selected. With FilterScheduler, this can be done by using the filter which corresponds to the policy config selection. Going forward, it might make sense to build a mechanism that would restrict the scope of hosts 'visible' by the driver to those belonging to the selected aggregate(s). However, this seems to be a quite significant change compared to the way scheduler and HostManager work today, which can be made later on.
+* <i> Where should we persist scheduling policy configurations?</i> The association between aggregate and corresponding scheduling policy configuration is done as a property of the aggregate. In order to avoid inconsistency between different aggregates applying the same policy, the aggregate will keep only a reference to the policy (id, or unique name), and the configuration parameters themselves will be kept separately, without duplications. There have been quite a lot of discussion whether or not this should be in the DB. However, it seems that there are several reasons to keep them in nova.conf: 1) scheduler config options are now in nova.conf, and moving them to another place would require significant code refactoring; 2) we expect that in most cases the number of different configuration will be small and static; 3) making scheduler config options programmable could be a good idea regardless of this blueprint, and can be implemented as part of a separate blueprint.
+* <i> How do we guarantee that policies are used consistently across aggregates?</i> It is assumed that aggregates overriding the scheduling policy configuration are disjoint between them -- meaning that there is no host that belongs to two (or more) host aggregates each of which specifies different scheduling policy configuration (otherwise, certain hosts may be managed under two different policies, which may be misleading and wrong). In the short term, it seems acceptable that the admin would do the enforcement (at large scale, those aggregates would typically be created and manage programmatically anyway). Going forward, it might make sense to introduce the semantics of disjoint aggregates (maybe of certain 'type'), that will be enforced by Nova.

Difference between revisions of "Nova/MultipleSchedulerPolicies"