Essex Scheduler and Scaling Improvements

Scheduler Improvements

Our first approach at the scheduler exposed some interesting weaknesses. The main one being the delay between updates being sent from the Compute nodes and the flood of provisioning requests hitting the Schedulers. Additionally since there is no shared locking between scheduler workers the possibility of over-allocating same-weighted hosts increases.

In other words, concurrency kills and so does latency.

In the short term we fixed this by doing the following:

  1. ignoring the periodic compute node updates and determining the current compute status directly from the database. On each provisioning request we take the capacity of the compute node and virtually “consume” resources from it for each Instance residing on that host. We use the InstanceType table to determine the size of the used resources vs. the actual memory/disk being used on the instance. Why? Because, in a Service Provider model, an allocated instance is assumed to have all the requested ram/disk immediately. We don't over-sell computing resources.

  2. Reducing the number of scheduler workers to one.
  3. Eliminated the ability to provision more than one instance at a time.
  4. Using the “current workload” as a hack to prevent a single host from getting pounded. Current workload is a count of all the Build, Snapshot, Migration and Resize operations currently active on a host. Not only is this useful for taking the host out of contention in the scheduler, but it prevents us from impacting the instances already running on that host (the Noisy Neighbor problem).

But this sucks. We've introduced a single point of failure and now we're pounding the database on each provisioning request. While this works well for small installations, it's obvious that it won't work for larger deployments. Our current scaling efforts are working with assumptions of about 500 hosts per zone and, let's say, 200 instances per host (worst case). This works out to about 100k instances we have to scan for each provisioning request as the zone starts to fill up. That's a big, expensive query.

Obviously we can do better. The first step is to keep a better record of uses resources within a zone.

To this end, we will be introducing the concept of a Capacity Cache. This will be a new table in Nova that contains the following columns:

Each row in the table will map to an active host. With this table we'll be able to reduce our DB query from several thousand rows (# hosts * # instances) to just # hosts.

The challenge then becomes how to maintain this table quickly so that latency doesn't become an issue. The answer to the problem is to update the row in the database for that host as soon as the state of the host changes. This implies we have to update the row at the following times: when a host is selected for provisioning in the scheduler (free ram and # running VM's) when an instance is deleted in the compute node (free ram and # running VM's) when “busy work” such as snapshot, resize, migration etc. starts and ends in the compute node. (current workflow)

There will be new functions in the db api to support these updates and they will be row-locked to prevent concurrent update attempts.

Fortunately there are already usage/audit/billing events that get generated in compute for most of these activities. We simply need to make a new Notifier Driver to call the new db.api functions when received. There is a List Notifier driver (nova.notifier.list_notify) which allows us to process an event in many different ways (the common way is sending it to YAGI). We will create a new CapacityUpdater notification driver to add to the List Notifer driver list. This driver will watch for instance state change events and update the table accordingly.

'nova list' Performance

Once we solve the reliability of the scheduler, we are going to have to deal with the 'nova list' command having to potentially pour through a lot of instances. Also, much of the information that needs to be returned to the user spans several tables or even comes from remote services (such as Network info from Melange/Quantum). This is compounded with pagination where only a few instances are returned at a time and we need to bookmark where we left off.

Again, this is a caching problem. We need to give the API service a de-normalized, precomputed table it can go to and get all the info it needs at once. Optimized for the task at hand.

For more situations this table can be eventually consistent with the state of the world, but in some situations instance updates are required. The most obvious situation is on the creation of a new instance. When a user creates a new instance they would like to immediately issue a 'nova show' command to see the status of the provisioning operation. What we don't want them to see is 'no such instance id' since they're likely to reissue the 'nova boot' command.

This Instance Info table will likely contain the following columns:

The mechanism for updating this table will largely be the same as the approach used for the Scheduler improvements listed above. There will be a special List Notifier driver for catching compute node events and updating the Instance Info table accordingly. The Scheduler will create the initial entry in the table when the host is selected (so that the user can instantly query the status of their request.)

Inter-Zone Communication

Maintenance of the inter-zone communication code has been very difficult. Constant changes to the OS API, Keystone and novaclient make things fragile and harder to maintain/test. Additionally, we have decided that there are other optimizations we can make if we introduce the concept of “trusted zones”. Trusted zones will allow us to relax the “don't leak information out of a zone” requirement.

From these two forces, we have been considering allowing a child zone to push up zone status to parent zones by directly connecting to the parent zones rabbit server. Note that we are not talking about clustering rabbit in a WAN configuration. We're just talking about making a normal client connection when we're in a trusted environment.

Again, using a similar scheme as 'nova list' and 'capability cache' we'll use a new List Notifier driver to handle the updates from the compute nodes. However, if we were to let these drivers talk to the parent zone rabbitmq server directly that could result in an additional 500 connections. Not good. Instead, this driver will simply push the message into a new queue that the scheduler can pick up. The scheduler will update it's internal Zone Manager with the data if needed, but more importantly relay the message to the parent zone on a single connection. We don't anticipate this being a very chatty connection since events are only generated when something changes …in a SP deployment most servers run in a constant state (only the edges are active).

All of these proposals are illustrated in the following diagram. New components are green.

EssexScaling.gif

Wiki: EssexSchedulerImprovements (last edited 2012-03-31 17:16:31 by jsuh)