Horizontally Scalable (O(1)) scheduler
Add a massively scalable scheduler to Nova. By massively scalable, we mean "Schedulding a new instance on a hundred thousand node deployment with millions of current and past instances should not take more than a second."
The existing scheduler implementations allocate resources on what is currently considered the optimal location for the given resource. Not only is this a costly process, since the data set to be evaluated is potentially very, very large, but since the workloads across the cloud is highly variable and entirely non-deterministic, the designated optimal location may soon be a below average location for the resource. This spec outlines an alternative approach that simply puts resources whereever they'll fit with no attempt to make a centralised decision about anything at all. The end result is an O(1) scheduler.
- Soren has a cloud with tens of thousands of nodes with scores of VM's on each. He's fed up with how long it takes to schedule new instances.
Incoming resource requests are validated (sanity checks, quota checks, etc.) as per usual. After this, they're sent to a scheduler that simply puts it on a queue. Nodes poll this queue for work items if they have spare capacity.
For simplicity, the first iteration simply adds a new scheduler class, so we keep nova-scheduler around even though it's of very little use, since nova-api might as well stick things directly onto the queue.
The new scheduler class will send out the scheduling request on the message queue using cast.
Any compute node that has capacity to spare will listen on this topic. The messaging layer will "at random" distribute the request to a listening compute node. The compute node will ensure that it can actually accomodate the instance (based on the instance type) and raise an exception if it does not have sufficient capacity after all. This will cause the message to be requeued (if using the kombu messaging driver) and another node can attempt to fulfill the request. If it does have capacity, it will go ahead with the allocation as usual.
The next iteration will add per-instance-type topics. Nodes only subscribe to topics for instance types that they can actually accommodate.
The next iteration will remove the scheduler entirely and make nova-api put resource requests directly on the work queue.
The final iteration (which depends on Marconi being ready for production) will use Marconi as the queueing backend. (This may or may not come for free depending on whether a Marconi RPC backend gets added to the RPC layer).