Difference between revisions of "DistributedScheduler"

Revision as of 18:53, 16 November 2010

Launchpad Entry: NovaSpec:bexar-distributed-scheduler
Created: 2010-11-16
Contributors: Eric Day, Paul Voccio, Vish Ishaya

Summary

This was originally named 'distributed data store', but after design summit talks we've realized this would be best implemented as an optional scheduler. In order to accomplish this, the nova core needs to provide the appropriate classes and functionalty to be able to write such a scheduler. This means the scheduler using a distributed data store will most likely not be the default for simple installations, and instead something suitable for large scale deployments. There are multiple steps to reach a fully functional distributed scheduler, and this spec and related Launchpad blueprints will describe the work that needs to be done.

Release Note

This set of changes will not impact users consuming the external HTTP API. Those who deploy nova will have the option to run a different scheduler that can operate at multiple layers to span various zones (logical or geographic). The default scheduler may change from what is currently there to be something simpler than a single worker, which would be having no worker at all. So if you are upgrading a simple installation you may no longer need to run bin/nova-scheduler. There may also be database schema updates which will require migration, but those will be determined during implementation.

Rationale

Large scale deployments need a more robust scheduler and data store. We also need a scheduler and system that can help minimize the scope of any compromised nodes. Allowing all workers (especially compute) to have full access to a central data store is not acceptable for some deployments.

User stories

Rackspace currently uses a distributed model for both scalability and high availability and these same principles need to be brought into Nova to make it suitable for large scale deployments.

Assumptions

We are using a durable message queue and messages will not be lost. In cases of serious failure where messages are lost, some operations or API requests may be lost and would need to be retried.

Design

There are a number of steps that need to be taken first in order to support writing this kind of scheduler. They are:

Re-factor current scheduler to not require a worker. The dependency of running a worker should be up the the scheduler you choose. Some schedulers will have zero where others may have multiple tiers. Simple installations should not require a scheduler worker.
Change API request processing so the API servers are not writing any data. Operations should be pushed down to the relevant worker. For example, on run_instances, we want the compute worker to be the process initiating writes to the data store.
Push some of the API logic into nova.compute and nova.schedule. The external API server should be pretty thin layer over the internal Nova API classes.
Introduce primitives that will allow for data aggregation from workers up through scheduling layers and API nodes into the core Nova classes. Various scheduler implementations can then choose to use these in order to construct an architecture that fits their needs.
Re-factor the rpc message passing to allow for signed messages so queue consumers can verify the source. Needed for security considerations.
Remove the need for a central data store and introduce functionality for parts of the system that currently depend on this.
Write a default distributed scheduler using the existing and new core Nova classes.

Implementation

To be filled in as each sub-task is researched.

UI Changes

There should be no visible changes to the end users, all this work will be behind the API servers.

Code Changes

Code changes should be isolated to the API, compute, rpc, and scheduler modules. It will also likely touch the network and volume modules in order to push that data up as well. It should not touch any of the mechanics of how each worker handles requests (vm/network/volume management), but more how data and messages between each is handled.

Migration

Coming once implementation nears beta.

Test/Demo Plan

Unit tests will be provided as code is developed. Integration and large scale testing will be added once there is infrastructure to do so.

Unresolved issues

None.

BoF agenda and discussion

The two relevant sessions at the Bexar design summit had the following notes:

http://etherpad.openstack.org/DistributedDataStore http://etherpad.openstack.org/multicluster

@@ Line 6: / Line 6: @@
 == Summary ==
-This was originally named 'distributed data store', but after design summit talks we've realized this would be best implemented as an optional scheduler. In order to accomplish this, the core needs to provide the appropriate API to be able to write such a scheduler. This means the scheduler using a distributed data store will most likely not be the default for simple installations, and instead something suitable for large scale deployments.
+This was originally named 'distributed data store', but after design summit talks we've realized this would be best implemented as an optional scheduler. In order to accomplish this, the nova core needs to provide the appropriate classes and functionalty to be able to write such a scheduler. This means the scheduler using a distributed data store will most likely not be the default for simple installations, and instead something suitable for large scale deployments. There are multiple steps to reach a fully functional distributed scheduler, and this spec and related Launchpad blueprints will describe the work that needs to be done.
 == Release Note ==
-This section should include a paragraph describing the end-user impact of this change.  It is meant to be included in the release notes of the first release in which it is implemented.  (Not all of these will actually be included in the release notes, at the release manager's discretion; but writing them is a useful exercise.)
+This set of changes will not impact users consuming the external HTTP API. Those who deploy nova will have the option to run a different scheduler that can operate at multiple layers to span various zones (logical or geographic). The default scheduler may change from what is currently there to be something simpler than a single worker, which would be having no worker at all. So if you are upgrading a simple installation you may no longer need to run bin/nova-scheduler. There may also be database schema updates which will require migration, but those will be determined during implementation.
-It is mandatory.
+== Rationale ==
-== Rationale ==
+Large scale deployments need a more robust scheduler and data store. We also need a scheduler and system that can help minimize the scope of any compromised nodes. Allowing all workers (especially compute) to have full access to a central data store is not acceptable for some deployments.
 == User stories ==
+Rackspace currently uses a distributed model for both scalability and high availability and these same principles need to be brought into Nova to make it suitable for large scale deployments.
 == Assumptions ==
+We are using a durable message queue and messages will not be lost. In cases of serious failure where messages are lost, some operations or API requests may be lost and would need to be retried.
 == Design ==
-You can have subsections that better describe specific parts of the issue.
+There are a number of steps that need to be taken first in order to support writing this kind of scheduler. They are:
+* Re-factor current scheduler to not require a worker. The dependency of running a worker should be up the the scheduler you choose. Some schedulers will have zero where others may have multiple tiers. Simple installations should not require a scheduler worker.
+* Change API request processing so the API servers are not writing any data. Operations should be pushed down to the relevant worker. For example, on run_instances, we want the compute worker to be the process initiating writes to the data store.
+* Push some of the API logic into nova.compute and nova.schedule. The external API server should be pretty thin layer over the internal Nova API classes.
+* Introduce primitives that will allow for data aggregation from workers up through scheduling layers and API nodes into the core Nova classes. Various scheduler implementations can then choose to use these in order to construct an architecture that fits their needs.
+* Re-factor the rpc message passing to allow for signed messages so queue consumers can verify the source. Needed for security considerations.
+* Remove the need for a central data store and introduce functionality for parts of the system that currently depend on this.
+* Write a default distributed scheduler using the existing and new core Nova classes.
 == Implementation ==
-This section should describe a plan of action (the "how") to implement the changes discussed. Could include subsections like:
+To be filled in as each sub-task is researched.
 === UI Changes ===
@@ Line 34: / Line 46: @@
 === Code Changes ===
-Code changes should include an overview of what needs to change, and in some cases even the specific details.
+Code changes should be isolated to the API, compute, rpc, and scheduler modules. It will also likely touch the network and volume modules in order to push that data up as well. It should not touch any of the mechanics of how each worker handles requests (vm/network/volume management), but more how data and messages between each is handled.
 === Migration ===
-Include:
+Coming once implementation nears beta.
-* data migration, if any
-* redirects from old URLs to new ones, if any
-* how users will be pointed to the new way of doing things, if necessary.
 == Test/Demo Plan ==
-This need not be added or completed until the specification is nearing beta.
+Unit tests will be provided as code is developed. Integration and large scale testing will be added once there is infrastructure to do so.
 == Unresolved issues ==
-None
+None.
 == BoF agenda and discussion ==