Difference between revisions of "Upgrade-with-minimal-downtime"

Revision as of 13:32, 28 September 2011

Launchpad Entry: NovaSpec:upgrade-with-minimal-downtime
Created:
Contributors:

Summary

Clouds are expected to be always available and involve large numbers of servers. Here we consider how to perform upgrades with minimal disruption.

Goals for the upgrade are:

where possible, transparent to the cloud users
minimal instance downtime or instance connectivity loss
ability to rollback to a pre-upgrade state if things fail
ability to upgrade from v2 to v4 without having to do upgrade to v3 first

More info

TODO - format conversion

h2. Possible Upgrade Strategies

Here I compare the different ways an upgrade could be performed.

h4. Big bang

To perform the upgrade you could try this approach:

Build an update cloud along side your new cloud
Get it configured
Make your old cloud read-only
Copy the state into your new cloud
Move to using your new cloud

This approach leads to too much downtime

h4. Rolling upgrade

This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version.

While this is more complex, we should be able to have minimal downtime of each component, and using the reliance built into OpenStack, we should be able to achieve zero downtime, but we may have some actions talking slightly longer than usual.

There are two key was to perform this kind of upgrade:

in-place upgrade of each component
replacement/side-by-side upgrade of each component

h5. In-place upgrades

In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder.

So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable.

h5. Side by side upgrades

Side-by-side upgrades involve this procedure for each service:

Configure the new worker
Turn off the old worker
* Allow the message queue or a load balancer to hide this downtime
Snapshot/backup the old worker for rollback
Copy/move any state to the new worker
start up the new worker
repeat for all other workers, in an appropriate order

The advantages of this approach appear to be:

potentially easier rollback
potentially less downtime of a component
works well when deploying nova in a VM
easier to test as system is in a known state (or VM image)

h2. Assumptions

If we take the Side by side rolling upgrade, here are things we need to assume about OpenStack.

h4. Backwards Compatible Schemas

To enable a rolling upgrade of nova components we need to ensure the communication between all components of the OpenStack work across different versions (as a minimum, two versions back).

Things to consider are:

database schema
message queue messages
OpenStack API compatibility (for when different zones are at different versions)

For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database.

h4. Migrate service between hosts using a GUIDs in the host flag and an IP alias

Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag.

Using a GUID, with an associated IP alias, should allow Compute (and similar) workers to be smoothly migrated between two different hosts (and in particular during a side by side upgrade).

Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required.

h4. Live migration

When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host.

Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade.

h2. New Features to support upgrade

To ensure a smooth upgrade we need to be able to support graceful shutdown of services.

h4. Graceful shutdown of services

We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed.

This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime.

Release Note

This section should include a paragraph describing the end-user impact of this change. It is meant to be included in the release notes of the first release in which it is implemented. (Not all of these will actually be included in the release notes, at the release manager's discretion; but writing them is a useful exercise.)

It is mandatory.

Rationale

User stories

Assumptions

Design

You can have subsections that better describe specific parts of the issue.

Implementation

This section should describe a plan of action (the "how") to implement the changes discussed. Could include subsections like:

UI Changes

Should cover changes required to the UI, or specific UI that is required to implement this

Code Changes

Code changes should include an overview of what needs to change, and in some cases even the specific details.

Migration

Include:

data migration, if any
redirects from old URLs to new ones, if any
how users will be pointed to the new way of doing things, if necessary.

Test/Demo Plan

This need not be added or completed until the specification is nearing beta.

Unresolved issues

This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.

BoF agenda and discussion

Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.

@@ Line 1: / Line 1: @@
 __NOTOC__
-* '''Launchpad Entry''': [[NovaSpec]]:xenapi-sm-support
+* '''Launchpad Entry''': [[NovaSpec]]:upgrade-with-minimal-downtime
 * '''Created''':
 * '''Contributors''':
 == Summary ==
+Clouds are expected to be always available and involve large numbers of servers. Here we consider how to perform upgrades with minimal disruption.
+Goals for the upgrade are:
+* where possible, transparent to the cloud users
+* minimal instance downtime or instance connectivity loss
+* ability to rollback to a pre-upgrade state if things fail
+* ability to upgrade from v2 to v4 without having to do upgrade to v3 first
+== More info ==
+TODO - format conversion
+h2. Possible Upgrade Strategies
+Here I compare the different ways an upgrade could be performed.
+h4. Big bang
+To perform the upgrade you could try this approach:
+* Build an update cloud along side your new cloud
+* Get it configured
+* Make your old cloud read-only
+* Copy the state into your new cloud
+* Move to using your new cloud
+This approach leads to too much downtime
+h4. Rolling upgrade
+This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version.
+While this is more complex, we should be able to have minimal downtime of each component, and using the reliance built into [[OpenStack]], we should be able to achieve zero downtime, but we may have some actions talking slightly longer than usual.
+There are two key was to perform this kind of upgrade:
+* in-place upgrade of each component
+* replacement/side-by-side upgrade of each component
+h5. In-place upgrades
+In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder.
+So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable.
+h5. Side by side upgrades
+Side-by-side upgrades involve this procedure for each service:
+* Configure the new worker
+* Turn off the old worker
+* * Allow the message queue or a load balancer to hide this downtime
+* Snapshot/backup the old worker for rollback
+* Copy/move any state to the new worker
+* start up the new worker
+* repeat for all other workers, in an appropriate order
+The advantages of this approach appear to be:
+* potentially easier rollback
+* potentially less downtime of a component
+* works well when deploying nova in a VM
+* easier to test as system is in a known state (or VM image)
+h2. Assumptions
+If we take the Side by side rolling upgrade, here are things we need to assume about [[OpenStack]].
+h4. Backwards Compatible Schemas
+To enable a rolling upgrade of nova components we need to ensure the communication between all components of the [[OpenStack]] work across different versions (as a minimum, two versions back).
+Things to consider are:
+* database schema
+* message queue messages
+* [[OpenStack]] API compatibility (for when different zones are at different versions)
+For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database.
+h4. Migrate service between hosts using a GUIDs in the host flag and an IP alias
+Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag.
+Using a GUID, with an associated IP alias, should allow Compute (and similar) workers to be smoothly migrated between two different hosts (and in particular during a side by side upgrade).
+Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required.
+h4. Live migration
+When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host.
+Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade.
+h2. New Features to support upgrade
+To ensure a smooth upgrade we need to be able to support graceful shutdown of services.
+h4. Graceful shutdown of services
+We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed.
+This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime.
 == Release Note ==