Difference between revisions of "MultiClusterZones"

Revision as of 21:46, 31 January 2011

Launchpad Entry: NovaSpec:multi-cluster-in-a-region
Created:
Contributors: SandyWalsh

Summary

Zones are logical groupings of Nova Services and VM Hosts. Not all Zones need to contain Hosts; they may also contain other Zones to permit more manageable organizational structures. Our vision for Zones also allows for multiple root nodes (top-level Zones) so business units can partition the hosts in different ways for different purposes (i.e. geographical zones vs. functional zones).

This proposal will outline our understanding of the issues surrounding Multi-Cluster and discuss some implementation ideas.

Discussion

Please direct all feedback / discussion to the mailing list or the following Etherpad: http://etherpad.openstack.org/multiclusterdiscussion

I will maintain this page to reflect the feedback. -SandyWalsh

Release Note

todo

Rationale

In order to scale Nova to 1mm host machines and 60mm guest instances we need a scheme to divide-and-conquer the effort.

Assumptions

Design

Let's have a look at how Nova hangs together currently.

There are a collection of Nova services that communicate together via AMQP (RabbitMQ currently). Each service has its own Queue for sending messages in. As a convenience, there are also a set of Service API stubs which handle the marshaling of commands onto these queues. There is one Service API per Service. The outside work communicates to Nova via one of the public-facing API's (currently EC2 and Rackspace/OpenStack over HTTP). Before a client can talk to the public-facing API it must authenticate against the Nova Auth Service. Once authenticated the Auth Service will tell the Client which API Service to use. This means that we can stand up many API Services and delegate the caller to the most appropriate one. The API service does very little processing of the request. Instead, it uses the Service Stub to put the message on the appropriate Queue and the related Service handles the request.

There are currently about a half-dozen Nova Services in use. These include:

API Service - as described above
Scheduler Service - the first stop for events from the API. The Scheduler Service is responsible for routing the request to the appropriate service. The current implementation doesn't do much. This Service will likely get affected the most by this proposal.
Network Service - For handling Nova Networking issues.
Volume Service - For handling Nova Disk Volume issues.
Compute Service - For talking to the underlying hypervisor and controlling the guest instances.

and then there are the other services like Glance, Swift, etc.

This flow is shown in illustration below. Note that I have shown a proposed notification scheme at the bottom, but this currently isn't in Nova.

File:MultiClusterZones$ZonesArchitecture sm.png

This architecture works fine for our existing deployments. But as we scale up, it will degrade in performance until it is unusable. Likewise it may fail due to hard limitations such as the number of network devices that are available on a subnet. We need to find a way to partition our hosts so that larger deployments are possible.

One method for doing so is by supporting "Zones".

Zones are logical groupings of Nova Services. Zones can contain other Zones (thus the Nested aspect of this proposal). As opposed to a conventional tree structure Zones may have multiple root nodes. If we only permitted a single root node, only one organizational scheme might be used. But different business groups may need to view the collection of Hosts from different angles. Operations may want to see the Hosts based on capabilities but end-users, sales or marketing may want to have them organized by Geography. Geography is the most common organizational scheme.

Within each Zone we may stand up a collection/subset of Nova Services and delegate commands between zones. Zones will communicate to each other via the AMQP network.

A sample Nested Zone deployment might look something like this:

(A = API Service, S = Scheduler Service, N = Network Service, V = Volume Service, etc.)

File:MultiClusterZones$NestedZones sm.png

As you can see there is a single top-level Zone called the "Global" Zone. The Global Zone contains the North American, European and Asian Zones. Drilling into the North American Zone we see two Data Center (DC) Zones #1 & #2. Each DC has two Huddle Zones (to steal the Rackspace parlance) where the actual Host servers live. A Huddle Zone is limited in size to 200 Hosts due to networking restrictions.

Certainly the largest zones will be the DC Zones, which hold a large collection of Huddle Zones. We can assume, in a service provider deployment, that a DC Zone may contain 200 or more Huddle Zones. Assuming about 50-60 Guest instances per Host a single DC could be responsible for as many as 200 Huddle Zones/DC * 200 Hosts/HuddleZone * ~50 Guests/Host = ~2mm guests/DC.

Our intention is to keep Hosts separated from a Zone's decision making responsibilities until the very last moment, thus keeping the working set as small as possible.

Inter-Zone Communication and Routing

As mentioned previously, AMQP is used for services to communicate with each other. Also, we mentioned that the Scheduler Service is used for routing requests between Services. The strategy for Multi-Cluster is for the Scheduler Service to route calls between Zones before handing the request off to its ultimate destination.

To do this, each Zone has to have its own AMQP Queue for receiving messages. There will be a Scheduler deployed in each Zone that can listen for requests coming from the parent Zone. This implies we need to deploy our AMQP network so that only the inter-zone queues are replicated in the AMQP Cluster and not every queue in the network.

File:MultiClusterZones$ZoneTalk sm.png

Should We Use The Queues Or The Public APIs?

The obvious concern with this approach is "Why use AMQP when we already have a public API available in each zone?" And it's a good argument, but there are a number of issues:

Caller Authentication has to stored and forwarded to each API Service for each call.
No means for Callbacks or error notification on long running operations other than the proposed [[PubSubHubBub]] Service.
The expense of marshaling/unmarshaling each request from HTTP -> Rabbit -> HTTP -> Rabbit all the way down.
Having to register not only the public API, but the admin-only API's at each layer. We would need to detect that a call is an admin-only call and correctly route to the proper API server.

As mentioned above, only the queues that go between zones need be forwarded. Internal Queues (such as API->Network/Volume/etc) can use their local queues.

Action Item: We are currently working on a test between each of the Rackspace Data Centers to measure performance of RabbitMQ messaging and possible throughput rates.

Routing, Database Instances, Zones, Hosts & Capabilities

Each deployment of a Nova Zone has a complete copy of the Nova application suite. Each deployment also gets its own Database. The Schema of each database is the same for all deployments at all levels. The difference between deployments is, depending on the services running within that Zone, not all the database tables may be used.

Nova currently has no concept of Hosts in the database model. We will need to add one and give it a relationship to our existing Instances table. We will also need a Zone table.

Hosts do not have to only live at the leaf nodes. Hosts may live at any level. Hosts are contained within Zones. I think, but I'm not sure, that they may also belong to multiple zone tree. I need to expand on this.

Database Model additions:

File:MultiClusterZones$db sm.png

Zones and Hosts have Capabilities. Capabilities are key-value pairs that indicate the types of resources contained within the Zone or Host. Capabilities are used to decide where to route requests.

The value portion of a Capability Key-Value pair is a (String/Float, Type) tuple. The Type field is used for coercing the two value fields (string/float). We have a String & and Float field so we can do range comparisons in the Database. While this is expensive, we cannot predict in advance all the desired Capabilities and denormalize the table. (jaypipes offered this schema: http://pastie.org/1515576)

While some Keys will be binary in nature: is-enabled, can-run-windows, accept-migrations others will be float-based and changing dynamically (especially the Host Capabilities). For example: free-disk, average-load, number-of-instances, etc.

The decision still has to be made whether the Host will push its current state into the Host Capabilities table or the Host Scheduler will poll the Host for the current status when needed. At the Zone level, the higher up the Zones we move, the more static these Capabilities will be.

Later, we can optimize the downstream queries by caching the Capability state in the parent Zones. But for now, we will poll down.

Selecting the Correct API Server

As we described earlier, the Auth Service tells the Client which API Service to use for subsequent operations. In order reduce the chatter in the queues, we would like to send the client to the lowest level Zone that contains all the instances managed by that client. It is our intention to use the same inter-zone communication channel for making this decision.

The Auth Service will always come in at the top level Zone and ask "Do you manage Client XXXX?" This request will sink down to each nested Zone and if a match is found return up. If more than one child responds "Yes", then the parent Zone API is returned. If a single child responds "Yes", that Zone's API Service is returned. If no child responds the Zone returns "No" and its parent Zone makes the decision.

This information may be cached in a later implementation. This can be pre-computed whenever an instance is created, migrated or deleted.

File:MultiClusterZones$API sm.png

User stories

todo

Implementation

This section should describe a plan of action (the "how") to implement the changes discussed.

The intention for Cactus is to work on the data model, api and related client tools for Sprint-1 (first 3 wks) of Cactus and then work on the inter-queue communication for Sprint-2.

Sprint 1 Tasks

NovaTools

add-zone(name)
delete-zone(name)
zone-cap-list(...)
zone-cap-set(...)
zone-cap-remove(...)
zone-add-host(...)
zone-remove-host(...)
host-cap-list(...)
host-cap-add(...)
host-cap-remove(...)

Nova Manage

set-zone(name)

Nova API

GET /zones/
GET /zones/#/detail
POST /zones/
PUT /zones/#
DELETE /zones/#
GET /zones/#/cap
POST /zones/#/cap
DELETE /zones/#/cap
PUT /zones/#/cap

GET /hosts/
GET /hosts/#/detail
POST /hosts/
PUT /hosts/
DELETE /hosts/#
GET /hosts/#/cap
POST /hosts/
DELETE /hosts/#
PUT /hosts/#/cap

DB

add_zone(...)
delete_zone(...)
add_zone_capability(...)
delete_zone_capability(...)
update_zone_capability(...)
add_host(...)
delete_host(...)
add_host_capability(...)
delete_host_capability(...)
update_host_capability(...)

Unresolved issues

This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.

@@ Line 79: / Line 79: @@
 # No means for Callbacks or error notification on long running operations other than the proposed <code><nowiki>[[PubSubHubBub]]</nowiki></code> Service.
 # The expense of marshaling/unmarshaling each request from HTTP -> Rabbit -> HTTP -> Rabbit all the way down.
-# Having to register no only the public API, but the admin-only API's at each layer. We would need to detect that a call is an admin-only call and correctly route to the proper API server.
+# Having to register not only the public API, but the admin-only API's at each layer. We would need to detect that a call is an admin-only call and correctly route to the proper API server.
 As mentioned above, only the queues that go between zones need be forwarded. Internal Queues (such as API->Network/Volume/etc) can use their local queues.