Multiple Active Scheduler Configurations/Drivers/Policies
Support for multiple active scheduler policy configurations (e.g., driver + corresponding config properties) associated with different host aggregates within a single Nova deployment.
In heterogeneous environments, it is often required that different hardware pools are managed under different policies. In Grizzly, basic partitioning of hosts and enforcement of compatibility between flavors and hosts during instance scheduling can be already implemented using host aggregates and FilterScheduler with AggregateInstanceExtraSpecsFilter. However, it is not possible to define, for example, different sets of filters and weights, or even entirely different scheduler drivers associated with different aggregates.
For example, the admin may want to have a pool with a conservative CPU overcommit (e.g., for CPU-intensive workloads), and another pool with aggressive CPU over-commit (for workloads which are less CPU-bound).
This blueprint introduces a mechanism to overcome this limitation.
Note: while in large-scale geo-distributed environments this can be done with Cells, there is no existing solution within a single (potentially small) Nova deployment.
- An administrator partitions the managed environment into host aggregates, and associates specialized scheduler configurations (policies) to some or all of the aggregates.
- On instance provisioning, the details of the scheduler configuration are derived from the properties of the request, an overridden configuration is created and used by the scheduler when handling the incoming request
Configuration (user story 1)
The administrator will:
- Specify 'default' scheduler driver policy under [DEFAULT] section in nova.conf (e.g., FilterScheduler with CoreFilter) – as usual
- Add to nova.conf one or more new sections, dedicated to specifying the different scheduling policy configurations, overriding the defaults – driver and/or associated properties. For example, [high_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=8, and [low_cpu_density] specifying FilterScheduler with CoreFilter and cpu_allocation_ratio=1. Note that in the above example, since driver and filters are the same, it would not be mandatory to specify them in the specific policy sections.
- Specify in nova.conf which nova configuration can be override by the policies (using a new property – e.g., scheduler_policy_overrides=scheduler_default_filters,cpu_allocation_ratio)
- Specify in nova.conf which configuration selection mechanism should be used (e.g., AvailabilityZoneBasedSchedulerPolicyConfigurationSelection)
- Create and populate with hosts one or more host aggregates, as usual.
- Set a new metadata key-value pair for one or more of the aggregates, specifying the desired policy to be used for scheduling instances in the corresponding aggregate (e.g., "sched_policy=high_cpu_density").
Example (partial) nova.conf:
scheduler_default_filters = AvailabilityFilter, CoreFilter
cpu_allocation_ratio = 4.0
# A class implementing the method for selecting scheduler policy configuration (based on properties derived from the incoming provisioning request)
# Possible options:
# - availability zone (would be typically used in conjunction with AvailabilityZoneFilter)
# - tenant id (would be typically used in conjunction with AggregateMultiTenancyIsolationFilter)
# - flavor extra specs (would be typically used in conjunction with AggregateInstanceExtraSpecsFilter)
# - explicit hint (would be typically used in conjunction with AggregateSchedulerPolicyConfigurationFilter)
# a list of scheduler configurations that a policy can override and will be used by this scheduler
scheduler_policy_overrides = cpu_allocation_ratio
cpu_allocation_ratio = 1.0
cpu_allocation_ratio = 8.0
Invocation (user story 2)
The user will invoke an instance provisioning request (as usual, unless the config selection mechanism is based on a new hint). For example:
$ nova boot --image 1 --flavor 1 --availability-zone cpu_intensive_az my-first-server
Note: when no policy is specified, the default scheduler configuration will be used, as it has been done before.
As stated above, the main goal of this blueprint is to enable heterogeneous scheduling, leveraging partitioning of the environment into host aggregates. For simplicity, let's assume that each such aggregate may have potentially different hardware and/or scheduling policy configuration. We assume that the partitioning is static, and that the criteria for selection of the target host aggregate is deterministic -- based on the properties of the aggregates, and the properties of the provisioning request. Compared to the way FilterScheduler works today, the idea is to essentially divide the process into 2 stages -- first stage selects the host aggregate, and second stage actually applies potentially customized filtering and weighting of hosts within the aggregate. If the first stage fails (i.e., there is no single aggregate matching the incoming request) -- we fall back to applying the default set of filters and weights (as opposed to one associated with a particular aggregate).
Several aspects have been considered in the design of such a mechanism.
- Where to implement the first stage (selection of aggregate and/or scheduler policy configuration)? This could be either in the Manager (right before invoking the driver), or in a new scheduler driver (which then would invoke one of the 'regular' drivers to select the host). The former approach seems more appropriate, because the new logic is not a self-contained scheduler driver, but rather can be considered a 'wrapper'.
- How the selected driver will "know" which hosts should be considered? In the current implementation, the behavior of the driver will not change, meaning that it will need to be 'compliant' with the way an aggregate is selected in stage 1, and ensure that hosts in other aggregates will not be selected. With FilterScheduler, this can be done by using the filter which corresponds to the policy config selection. Going forward, it might make sense to build a mechanism that would restrict the scope of hosts 'visible' by the driver to those belonging to the selected aggregate(s). However, this seems to be a quite significant change compared to the way scheduler and HostManager work today, which can be made later on.
- Where should we persist scheduling policy configurations? The association between aggregate and corresponding scheduling policy configuration is done as a property of the aggregate. In order to avoid inconsistency between different aggregates applying the same policy, the aggregate will keep only a reference to the policy (id, or unique name), and the configuration parameters themselves will be kept separately, without duplications. There have been quite a lot of discussion whether or not this should be in the DB. However, it seems that there are several reasons to keep them in nova.conf: 1) scheduler config options are now in nova.conf, and moving them to another place would require significant code refactoring; 2) we expect that in most cases the number of different configuration will be small and static; 3) making scheduler config options programmable could be a good idea regardless of this blueprint, and can be implemented as part of a separate blueprint.
- How do we guarantee that policies are used consistently across aggregates? It is assumed that aggregates overriding the scheduling policy configuration are disjoint between them -- meaning that there is no host that belongs to two (or more) host aggregates each of which specifies different scheduling policy configuration (otherwise, certain hosts may be managed under two different policies, which may be misleading and wrong). In the short term, it seems acceptable that the admin would do the enforcement (at large scale, those aggregates would typically be created and manage programmatically anyway). Going forward, it might make sense to introduce the semantics of disjoint aggregates (maybe of certain 'type'), that will be enforced by Nova.