Summary

Clouds are expected to be always available and involve large numbers of servers. Here we consider how to perform upgrades with minimal disruption.

Goals for the upgrade are:

Release Note

TODO

User stories

Consider the possible different ways to perform the upgrade...

Big bang

To perform the upgrade you could try this approach:

This approach leads to too much downtime

Rolling upgrade

This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version.

While this is more complex, we should be able to have minimal downtime of each component, and using the reliance built into OpenStack, we should be able to achieve zero downtime, but we may have some actions talking slightly longer than usual.

There are two key was to perform this kind of upgrade:

In-place upgrades

In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder.

So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable.

Side by side upgrades

Side-by-side upgrades involve this procedure for each service:

The advantages of this approach appear to be:

Assumptions

If we take the Side by side rolling upgrade, here are things we need to assume about OpenStack.

Backwards Compatible Schemas

To enable a rolling upgrade of nova components we need to ensure the communication between all components of the OpenStack work across different versions (as a minimum, two versions back).

Things to consider are:

For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database.

Migrate service between hosts using a GUIDs in the host flag and an IP alias

Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag.

Using a GUID, with an associated IP alias, should allow Compute (and similar) workers to be smoothly migrated between two different hosts (and in particular during a side by side upgrade).

Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required.

Live migration

When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host.

Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade.

Design

To ensure a smooth upgrade we need to be able to support graceful shutdown of services:

Graceful shutdown of services

We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed.

This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime.

Possible Upgrade Procedures

Here we concentrate on the side by side rolling upgrade of nova components.

Assuming the database is always backwards compatible we should probably upgrade the components in the following order:

We can now look at each nova component on how to minimize the downtime during the upgrade. Please note this has not yet been tested.

nova-compute

nova-scheduler

nova-api

Similar approach to the scheduler:

dashboard

Assumptions:

Method:

nova-volume

This depends on what storage type you use.

TODO - no yet completed this.

iSCSI

There are a few approaches:

XenServer Storage Manager

Issues:

Overall approach should be similar to compute.

nova-network

This depends on the network model

Flat-model

Method:

VLAN-model

Ideas:

glance-api

Method: * as nova-api

glance-registry

Assumptions:

Method:

Implementation

TODO

UI Changes

Code Changes

Migration

Test/Demo Plan

We need a continuous integration system to check trunk can upgrade from the previous released version.

Unresolved issues

Some things we need to resolve:


CategorySpec

Wiki: upgrade-with-minimal-downtime (last edited 2011-09-30 16:27:45 by Armando Migliaccio)