Difference between revisions of "Upgrade-with-minimal-downtime"

Revision as of 13:58, 28 September 2011

Launchpad Entry: NovaSpec:upgrade-with-minimal-downtime
Created:
Contributors:

Summary

Clouds are expected to be always available and involve large numbers of servers. Here we consider how to perform upgrades with minimal disruption.

Goals for the upgrade are:

where possible, transparent to the cloud users
minimal instance downtime or instance connectivity loss
ability to rollback to a pre-upgrade state if things fail
ability to upgrade from v2 to v4 without having to do upgrade to v3 first

Release Note

TODO

User stories

Consider the possible different ways to perform the upgrade...

Big bang

To perform the upgrade you could try this approach:

Build an update cloud along side your new cloud
Get it configured
Make your old cloud read-only
Copy the state into your new cloud
Move to using your new cloud

This approach leads to too much downtime

Rolling upgrade

This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version.

While this is more complex, we should be able to have minimal downtime of each component, and using the reliance built into OpenStack, we should be able to achieve zero downtime, but we may have some actions talking slightly longer than usual.

There are two key was to perform this kind of upgrade:

in-place upgrade of each component
replacement/side-by-side upgrade of each component

In-place upgrades

In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder.

So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable.

Side by side upgrades

Side-by-side upgrades involve this procedure for each service:

Configure the new worker
Turn off the old worker
* Allow the message queue or a load balancer to hide this downtime
Snapshot/backup the old worker for rollback
Copy/move any state to the new worker
start up the new worker
repeat for all other workers, in an appropriate order

The advantages of this approach appear to be:

potentially easier rollback
potentially less downtime of a component
works well when deploying nova in a VM
easier to test as system is in a known state (or VM image)

Assumptions

If we take the Side by side rolling upgrade, here are things we need to assume about OpenStack.

Backwards Compatible Schemas

To enable a rolling upgrade of nova components we need to ensure the communication between all components of the OpenStack work across different versions (as a minimum, two versions back).

Things to consider are:

database schema
message queue messages
OpenStack API compatibility (for when different zones are at different versions)

For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database.

Migrate service between hosts using a GUIDs in the host flag and an IP alias

Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag.

Using a GUID, with an associated IP alias, should allow Compute (and similar) workers to be smoothly migrated between two different hosts (and in particular during a side by side upgrade).

Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required.

Live migration

When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host.

Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade.

Design

To ensure a smooth upgrade we need to be able to support graceful shutdown of services:

Graceful shutdown of services

We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed.

This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime.

Implementation

TODO

UI Changes

None

Code Changes

TODO

Migration

TODO

Test/Demo Plan

We need a continuous integration system to check trunk can upgrade from the previous released version.

Unresolved issues

Getting decisions on what backwards compatibility will be ensured between new and old database scheme and message queue messages.

@@ Line 14: / Line 14: @@
 * ability to upgrade from v2 to v4 without having to do upgrade to v3 first
-== More info ==
+== Release Note ==
-TODO - format conversion
+TODO
-h2. Possible Upgrade Strategies
+== User stories ==
-Here I compare the different ways an upgrade could be performed.
+Consider the possible different ways to perform the upgrade...
-h4. Big bang
+=== Big bang ===
 To perform the upgrade you could try this approach:
@@ Line 33: / Line 33: @@
 This approach leads to too much downtime
-h4. Rolling upgrade
+=== Rolling upgrade ===
 This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version.
@@ Line 43: / Line 43: @@
 * replacement/side-by-side upgrade of each component
-h5. In-place upgrades
+==== In-place upgrades ====
 In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder.
@@ Line 49: / Line 49: @@
 So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable.
-h5. Side by side upgrades
+==== Side by side upgrades ====
 Side-by-side upgrades involve this procedure for each service:
@@ Line 66: / Line 66: @@
 * easier to test as system is in a known state (or VM image)
-h2. Assumptions
+== Assumptions ==
 If we take the Side by side rolling upgrade, here are things we need to assume about [[OpenStack]].
-h4. Backwards Compatible Schemas
+=== Backwards Compatible Schemas ===
 To enable a rolling upgrade of nova components we need to ensure the communication between all components of the [[OpenStack]] work across different versions (as a minimum, two versions back).
@@ Line 81: / Line 81: @@
 For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database.
-h4. Migrate service between hosts using a GUIDs in the host flag and an IP alias
+=== Migrate service between hosts using a GUIDs in the host flag and an IP alias ===
 Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag.
@@ Line 89: / Line 89: @@
 Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required.
-h4. Live migration
+=== Live migration ===
 When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host.
@@ Line 95: / Line 95: @@
 Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade.
-h2. New Features to support upgrade
+== Design ==
-To ensure a smooth upgrade we need to be able to support graceful shutdown of services.
+To ensure a smooth upgrade we need to be able to support graceful shutdown of services:
-h4. Graceful shutdown of services
+=== Graceful shutdown of services ===
 We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed.
 This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime.
-== Release Note ==
-This section should include a paragraph describing the end-user impact of this change.  It is meant to be included in the release notes of the first release in which it is implemented.  (Not all of these will actually be included in the release notes, at the release manager's discretion; but writing them is a useful exercise.)
-It is mandatory.
-== Rationale ==
-== User stories ==
-== Assumptions ==
-== Design ==
-You can have subsections that better describe specific parts of the issue.
 == Implementation ==
-This section should describe a plan of action (the "how") to implement the changes discussed. Could include subsections like:
+TODO
 === UI Changes ===
-Should cover changes required to the UI, or specific UI that is required to implement this
+None
 === Code Changes ===
-Code changes should include an overview of what needs to change, and in some cases even the specific details.
+TODO
 === Migration ===
-Include:
+TODO
-* data migration, if any
-* redirects from old URLs to new ones, if any
-* how users will be pointed to the new way of doing things, if necessary.
 == Test/Demo Plan ==
-This need not be added or completed until the specification is nearing beta.
+We need a continuous integration system to check trunk can upgrade from the previous released version.
 == Unresolved issues ==
-This should highlight any issues that should be addressed in further specifications, and not problems with the specification itself; since any specification with problems cannot be approved.
+Getting decisions on what backwards compatibility will be ensured between new and old database scheme and message queue messages.
-== BoF agenda and discussion ==
-Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected.
 ----
 [[Category:Spec]]