DistributedScheduler

Launchpad Entry: NovaSpec:bexar-distributed-scheduler
Created: 2010-11-16
Contributors: Eric Day, Paul Voccio, Vish Ishaya, Ed Leafe

Summary

This was originally named 'distributed data store', but after design summit talks we've realized this would be best implemented as an optional scheduler. The original discussion proposed requiring a local data store such as SQLite to be used at each worker/scheduler level and to no longer support using a central database such as PostgreSQL or MySQL. This evolved to it really being a scheduler implementation concern, not something that needs to be required throughout the entire architecture. This will allow both types of configurations to be possible. A simple install may use the default scheduler and a central data store, where a complex, multi-zone install will use the distributed scheduler, use local copies of the data, and push data up through a data aggregation channel. In order to accomplish this, the nova core needs to provide the appropriate classes and functionality to be able to write such a scheduler. Keeping it optional will make a central data store possible, but there will most likely be some duplication of data and if in a SQL datastore, foreign keys and other direct references will not be used. There are multiple steps to reach a fully functional distributed scheduler, and this spec and related Launchpad blueprints will describe the work that needs to be done.

Release Note

This set of changes will not impact users consuming the external HTTP API. Those who deploy nova will have the option to run a different scheduler that can operate at multiple layers to span various zones (logical or geographic). The default scheduler may change from what is currently there to be something simpler than a single worker, which would be having no worker at all. So if you are upgrading a simple installation you may no longer need to run bin/nova-scheduler. There may also be database schema updates which will require migration, but those will be determined during implementation.

Rationale

Large scale deployments need a more robust scheduler and data store. We also need a scheduler and system that can help minimize the scope of any compromised nodes. The system also needs to operate while some nodes may be down, this includes tolerating network partitions and minimizing downtime when any part of the system fails. Allowing all workers (especially compute) to have full access to a central data store is not acceptable for some deployments.

User stories

Rackspace currently uses a distributed model for both scalability and high availability and these same principles need to be brought into Nova to make it suitable for large scale deployments.

Assumptions

We are using a durable message queue and messages will not be lost. In cases of serious failure where messages are lost, some operations or API requests may be lost and would need to be retried.

Design

There are a number of steps that need to be taken first in order to support writing a distributed scheduler. They are:

Currently there is some duplication in the code paths from the HTTP API modules into the compute/volume/network handling classes. The HTTP API modules should be very thin layers around a well-defined internal API living inside the relevant service module (compute, network, or volume). We must push as much logic as possible from nova.api into nova.compute/network/volume.
Change API request processing so the API servers are not writing any data. Operations should be pushed down to the relevant worker. For example, on run_instances, we want the compute worker to be the process initiating writes to the data store.
Refactor current scheduler to not require a worker and to push more scheduling options into the scheduler (instead of compute/volume/network). The dependency of running a worker should be up the the scheduler you choose. Some schedulers will have zero where others may have multiple layers.
Introduce primitives that will allow for data aggregation from workers up through scheduling layers and API nodes into the core Nova classes. Various scheduler implementations can then choose to use these in order to construct an architecture that fits their needs.
Remove the need for a central data store by using the data aggregation abilities and introduce functionality via message queues for parts of the system that need to lookup information from other sources. This means that some configurations can keep a local data store of the data they need in order to answer requests (probably using SQLite) instead of all workers/API servers relying on a central database such as PostgreSQL or MySQL.
Refactor the rpc module message passing to allow for signed messages so queue consumers can verify the source. This is needed for security considerations.
Write a default distributed scheduler using the existing and new core Nova functionality.

Implementation

For simplicity's sake, the discussion will be limited to scheduling a new compute instance. The process for volume and network will be analogous, but much simpler.

I am also assuming that OpenStack is deployed with nested zones. In the case of a simple deployment with a single zone, the aggregation of intermediate results will not apply, and all this will happen in a single scheduler.

In order to create a new instance, we need to be able to locate a compute host that can accommodate the new instance. Another way of saying that is that the host meets certain requirements: it must be able to host the desired OS; it must have available RAM and disk space, etc. I will refer to these criteria as requirements.

In addition, there may be other criteria that, instead of being absolutely required, may be used to determine the desireability of placing the new instance on one qualified host over another. For example, we might want to place a new instance for a customer on a host that doesn't already have instances for that customer, in order to spread the downside risk in the event of hardware failures. While this is desireable, if the only available host did have another instance for that customer, we would want to create the requested instance anyway. I will refer to this kind of criteria as weightings, as they will be used to weight hosts' appropriateness for the new instance. Each company deploying OpenStack will most likely have their own weighting criteria.

The current design for zones will have the hosts regularly post their status to their ZoneManager (a component of the scheduler service), which will maintain that information. It will be able to present aggregated answers (e.g., "Can you create Windows instances?") based on its hosts, as well as return information about individual hosts. Higher-level zones (i.e., that don't have hosts) will aggregate the results of their child zones. E.g., if any of the child zones can create Windows instances, the parent zone will report that it has Windows capability.

When a request for a new instance is received by the API, it will pass that request to the scheduler service for that zone. If the zone has child zones, the request will be passed to each child until it reaches the "lowest" level of the zone tree. At every step along this zone traversal, the ZoneManager will compare the requirements to its capabilities; if there is no match, it means that none of the hosts "under" it will be able to handle the request, so it would return an empty list. Otherwise, it forwards the request to the child zones, and will return the aggregation of their responses.

When a zone has hosts, it will run the scheduler algorithm, which will act on the host data to both eliminate hosts that cannot meet the requirements, and then order the hosts that meet the requirements, based on the weightings. Since a zone may manage large numbers of hosts, returning all matching hosts, especially those with low weightings, would be inefficient. There will be a configuration option that will set the number of hosts to return, which we will call N here.

Parent zones will receive a list from each child zone, and may have its own list if it also has hosts. Its scheduler will then join these lists and run the same weighting algorithm against the new larger list, and will then return the top N hosts to its parent. This repeats until the it reaches the zone that originated the request. It will then issue the create request to the first entry in the list; this will either succeed or fail. If it fails, then list is iterated until either one succeeds or we reach then end.

This distributed design will scale well as the size of the deployment grows, as new zones can be added to logically group and thus break up the load of evaluating/weighting potential hosts across the entire deployment. Limiting the response for each zone to a fixed number of hosts will also help minimize traffic across zones.

While this design attempts to be flexible enough so that every deployment can determine how they want to select hosts for new instances, there is one potential problem: the information that a Zone Master object has for its hosts is only what the hosts tell it; it doesn't query the hosts. In other words, the scheduler needs to assure that the host information needed to evaluate the criteria in its algorithm has been sent by the hosts during their periodic updates. Some of the data sent by the hosts is constant: it will be required by diagnostics, logging, status updates, etc., but some will depend on the particular scheduler selection algorithm. I am looking for a way to define these attributes in a single place that is importable from both the scheduler and compute services, so that the compute node can read in these attributes and post their values to its ZoneManager, and the scheduler service can do the same. The details of this remain to be worked out.

UI Changes

There should be no visible changes to the end users, all this work will be behind the API servers.

Code Changes

Code changes should be isolated to the API, compute, rpc, and scheduler modules. It will also likely touch the network and volume modules in order to push that data up as well. It should not touch any of the mechanics of how each worker handles requests (vm/network/volume management), but more how data and messages between each is handled.

Migration

Coming once implementation nears beta.

Test/Demo Plan

Unit tests will be provided as code is developed. Integration and large scale testing will be added once there is infrastructure to do so.

Unresolved issues

None.

BoF agenda and discussion

The two relevant sessions at the Bexar design summit had the following notes:

http://etherpad.openstack.org/DistributedDataStore http://etherpad.openstack.org/multicluster