Difference between revisions of "DistributedScheduler"

Revision as of 02:05, 17 November 2010

Launchpad Entry: NovaSpec:bexar-distributed-scheduler
Created: 2010-11-16
Contributors: Eric Day, Paul Voccio, Vish Ishaya

Summary

This was originally named 'distributed data store', but after design summit talks we've realized this would be best implemented as an optional scheduler. The original discussion proposed requiring a local data store such as SQLite to be used at each worker/scheduler level and to no longer support using a central database such as PostgreSQL or MySQL. This evolved to it really being a scheduler implementation concern, not something that needs to be required throughout the entire architecture. This will allow both types of configurations to be possible. A simple install may use the default scheduler and a central data store, where a complex, multi-zone install will use the distributed scheduler, use local copies of the data, and push data up through a data aggregation channel. In order to accomplish this, the nova core needs to provide the appropriate classes and functionality to be able to write such a scheduler. There are multiple steps to reach a fully functional distributed scheduler, and this spec and related Launchpad blueprints will describe the work that needs to be done.

Release Note

This set of changes will not impact users consuming the external HTTP API. Those who deploy nova will have the option to run a different scheduler that can operate at multiple layers to span various zones (logical or geographic). The default scheduler may change from what is currently there to be something simpler than a single worker, which would be having no worker at all. So if you are upgrading a simple installation you may no longer need to run bin/nova-scheduler. There may also be database schema updates which will require migration, but those will be determined during implementation.

Rationale

Large scale deployments need a more robust scheduler and data store. We also need a scheduler and system that can help minimize the scope of any compromised nodes. Allowing all workers (especially compute) to have full access to a central data store is not acceptable for some deployments.

User stories

Rackspace currently uses a distributed model for both scalability and high availability and these same principles need to be brought into Nova to make it suitable for large scale deployments.

Assumptions

We are using a durable message queue and messages will not be lost. In cases of serious failure where messages are lost, some operations or API requests may be lost and would need to be retried.

Design

There are a number of steps that need to be taken first in order to support writing a distributed scheduler. They are:

Currently there is some duplication in the code paths from the HTTP API modules into the compute/volume/network handling classes. The HTTP API modules should be very thin layers around a well-defined internal API living inside the relevant service module (compute, network, or volume). We must push as much logic as possible from nova.api into nova.compute/network/volume.
Change API request processing so the API servers are not writing any data. Operations should be pushed down to the relevant worker. For example, on run_instances, we want the compute worker to be the process initiating writes to the data store.
Refactor current scheduler to not require a worker and to push more scheduling options into the scheduler (instead of compute/volume/network). The dependency of running a worker should be up the the scheduler you choose. Some schedulers will have zero where others may have multiple layers.
Introduce primitives that will allow for data aggregation from workers up through scheduling layers and API nodes into the core Nova classes. Various scheduler implementations can then choose to use these in order to construct an architecture that fits their needs.
Remove the need for a central data store by using the data aggregation abilities and introduce functionality via message queues for parts of the system that need to lookup information from other sources. This means that some configurations can keep a local data store of the data they need in order to answer requests (probably using SQLite) instead of all workers/API servers relying on a central database such as PostgreSQL or MySQL.
Refactor the rpc module message passing to allow for signed messages so queue consumers can verify the source. This is needed for security considerations.
Write a default distributed scheduler using the existing and new core Nova functionality.

Implementation

UI Changes

There should be no visible changes to the end users, all this work will be behind the API servers.

Code Changes

Code changes should be isolated to the API, compute, rpc, and scheduler modules. It will also likely touch the network and volume modules in order to push that data up as well. It should not touch any of the mechanics of how each worker handles requests (vm/network/volume management), but more how data and messages between each is handled.

Migration

Coming once implementation nears beta.

Test/Demo Plan

Unit tests will be provided as code is developed. Integration and large scale testing will be added once there is infrastructure to do so.

Unresolved issues

None.

BoF agenda and discussion

The two relevant sessions at the Bexar design summit had the following notes:

http://etherpad.openstack.org/DistributedDataStore http://etherpad.openstack.org/multicluster

@@ Line 26: / Line 26: @@
 == Design ==
-There are a number of steps that need to be taken first in order to support writing this kind of scheduler. They are:
+There are a number of steps that need to be taken first in order to support writing a distributed scheduler. They are:
-* Push as much logic as possible from nova.api into nova.compute/network/volume. The external API server should be pretty thin layer over the internal Nova classes.
+* Currently there is some duplication in the code paths from the HTTP API modules into the compute/volume/network handling classes. The HTTP API modules should be very thin layers around a well-defined internal API living inside the relevant service module (compute, network, or volume). We must push as much logic as possible from nova.api into nova.compute/network/volume.
 * Change API request processing so the API servers are not writing any data. Operations should be pushed down to the relevant worker. For example, on run_instances, we want the compute worker to be the process initiating writes to the data store.
-* Re-factor current scheduler to not require a worker. The dependency of running a worker should be up the the scheduler you choose. Some schedulers will have zero where others may have multiple tiers. Simple installations should not require a scheduler worker.
+* Refactor current scheduler to not require a worker and to push more scheduling options into the scheduler (instead of compute/volume/network). The dependency of running a worker should be up the the scheduler you choose. Some schedulers will have zero where others may have multiple layers.
 * Introduce primitives that will allow for data aggregation from workers up through scheduling layers and API nodes into the core Nova classes. Various scheduler implementations can then choose to use these in order to construct an architecture that fits their needs.
-* Remove the need for a central data store by using the data aggregation abilities and introduce functionality via message queues for parts of the system that need this type of central lookup ability. This means that some configurations can keep a local data store of the data they need in order to answer requests (probably using SQLite) instead of all workers/API servers relying on a central database such as PostgreSQL or MySQL.
+* Remove the need for a central data store by using the data aggregation abilities and introduce functionality via message queues for parts of the system that need to lookup information from other sources. This means that some configurations can keep a local data store of the data they need in order to answer requests (probably using SQLite) instead of all workers/API servers relying on a central database such as PostgreSQL or MySQL.
-* Re-factor the rpc message passing to allow for signed messages so queue consumers can verify the source. Needed for security considerations.
+* Refactor the rpc module message passing to allow for signed messages so queue consumers can verify the source. This is needed for security considerations.
-* Write a default distributed scheduler using the existing and new core Nova classes.
+* Write a default distributed scheduler using the existing and new core Nova functionality.
 == Implementation ==
-To be filled in as each sub-task is researched.
 === UI Changes ===