Difference between revisions of "Upgrade-with-minimal-downtime"
Line 14: | Line 14: | ||
* ability to upgrade from v2 to v4 without having to do upgrade to v3 first | * ability to upgrade from v2 to v4 without having to do upgrade to v3 first | ||
− | == | + | == Release Note == |
− | TODO | + | TODO |
− | + | == User stories == | |
− | + | Consider the possible different ways to perform the upgrade... | |
− | + | === Big bang === | |
To perform the upgrade you could try this approach: | To perform the upgrade you could try this approach: | ||
Line 33: | Line 33: | ||
This approach leads to too much downtime | This approach leads to too much downtime | ||
− | + | === Rolling upgrade === | |
This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version. | This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version. | ||
Line 43: | Line 43: | ||
* replacement/side-by-side upgrade of each component | * replacement/side-by-side upgrade of each component | ||
− | + | ==== In-place upgrades ==== | |
In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder. | In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder. | ||
Line 49: | Line 49: | ||
So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable. | So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable. | ||
− | + | ==== Side by side upgrades ==== | |
Side-by-side upgrades involve this procedure for each service: | Side-by-side upgrades involve this procedure for each service: | ||
Line 66: | Line 66: | ||
* easier to test as system is in a known state (or VM image) | * easier to test as system is in a known state (or VM image) | ||
− | + | == Assumptions == | |
If we take the Side by side rolling upgrade, here are things we need to assume about [[OpenStack]]. | If we take the Side by side rolling upgrade, here are things we need to assume about [[OpenStack]]. | ||
− | + | === Backwards Compatible Schemas === | |
To enable a rolling upgrade of nova components we need to ensure the communication between all components of the [[OpenStack]] work across different versions (as a minimum, two versions back). | To enable a rolling upgrade of nova components we need to ensure the communication between all components of the [[OpenStack]] work across different versions (as a minimum, two versions back). | ||
Line 81: | Line 81: | ||
For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database. | For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database. | ||
− | + | === Migrate service between hosts using a GUIDs in the host flag and an IP alias === | |
Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag. | Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag. | ||
Line 89: | Line 89: | ||
Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required. | Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required. | ||
− | + | === Live migration === | |
When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host. | When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host. | ||
Line 95: | Line 95: | ||
Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade. | Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade. | ||
− | + | == Design == | |
− | To ensure a smooth upgrade we need to be able to support graceful shutdown of services | + | To ensure a smooth upgrade we need to be able to support graceful shutdown of services: |
− | + | === Graceful shutdown of services === | |
We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed. | We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed. | ||
This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime. | This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Implementation == | == Implementation == | ||
− | + | TODO | |
=== UI Changes === | === UI Changes === | ||
− | + | None | |
=== Code Changes === | === Code Changes === | ||
− | + | TODO | |
=== Migration === | === Migration === | ||
− | + | TODO | |
− | |||
− | |||
− | |||
== Test/Demo Plan == | == Test/Demo Plan == | ||
− | + | We need a continuous integration system to check trunk can upgrade from the previous released version. | |
== Unresolved issues == | == Unresolved issues == | ||
− | + | Getting decisions on what backwards compatibility will be ensured between new and old database scheme and message queue messages. | |
− | |||
− | |||
− | |||
− | |||
---- | ---- | ||
[[Category:Spec]] | [[Category:Spec]] |
Revision as of 13:58, 28 September 2011
- Launchpad Entry: NovaSpec:upgrade-with-minimal-downtime
- Created:
- Contributors:
Summary
Clouds are expected to be always available and involve large numbers of servers. Here we consider how to perform upgrades with minimal disruption.
Goals for the upgrade are:
- where possible, transparent to the cloud users
- minimal instance downtime or instance connectivity loss
- ability to rollback to a pre-upgrade state if things fail
- ability to upgrade from v2 to v4 without having to do upgrade to v3 first
Release Note
TODO
User stories
Consider the possible different ways to perform the upgrade...
Big bang
To perform the upgrade you could try this approach:
- Build an update cloud along side your new cloud
- Get it configured
- Make your old cloud read-only
- Copy the state into your new cloud
- Move to using your new cloud
This approach leads to too much downtime
Rolling upgrade
This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version.
While this is more complex, we should be able to have minimal downtime of each component, and using the reliance built into OpenStack, we should be able to achieve zero downtime, but we may have some actions talking slightly longer than usual.
There are two key was to perform this kind of upgrade:
- in-place upgrade of each component
- replacement/side-by-side upgrade of each component
In-place upgrades
In place upgrades require each service to be down for the duration of the upgrade. It is likely to make rollback harder.
So for the moment I will ignore this approach, except in the case of upgrading the hypervisor. When doing a hypervisor upgrade, you can remove the node from the cloud, after live migrate instances, without affecting the overall availability of the cloud. You would only have a very slightly reduce capacity during the time the node was unavailable.
Side by side upgrades
Side-by-side upgrades involve this procedure for each service:
- Configure the new worker
- Turn off the old worker
- * Allow the message queue or a load balancer to hide this downtime
- Snapshot/backup the old worker for rollback
- Copy/move any state to the new worker
- start up the new worker
- repeat for all other workers, in an appropriate order
The advantages of this approach appear to be:
- potentially easier rollback
- potentially less downtime of a component
- works well when deploying nova in a VM
- easier to test as system is in a known state (or VM image)
Assumptions
If we take the Side by side rolling upgrade, here are things we need to assume about OpenStack.
Backwards Compatible Schemas
To enable a rolling upgrade of nova components we need to ensure the communication between all components of the OpenStack work across different versions (as a minimum, two versions back).
Things to consider are:
- database schema
- message queue messages
- OpenStack API compatibility (for when different zones are at different versions)
For example, to avoid the need for new versions of the code to work with old database versions, we should assume the database will be upgraded first. However, we could upgrade the database last (in a non-backwards compatible way), if all new versions can work with the old database.
Migrate service between hosts using a GUIDs in the host flag and an IP alias
Consider using GUIDs rather than hostnames to identify service in the database and in the message queue names. This may work in the current system by specifying a GUID in the host flag.
Using a GUID, with an associated IP alias, should allow Compute (and similar) workers to be smoothly migrated between two different hosts (and in particular during a side by side upgrade).
Because an old host can be turned off and the new host can be started up with the same identity as the old host, this can minimise the downtime (no need to wait for rpm upgrades to complete). It also enable you to more easily scale in and scale out your cloud on demand, because you can more easily migration workers to different hosts as required.
Live migration
When upgrading a hypervisor, ideally instance should be live migrated to another host, so the host can be upgrade with zero downtime for the instances that were running on that host.
Without the live migration support, the instances will either be lost (terminated), or be suspended during the hypervisor upgrade.
Design
To ensure a smooth upgrade we need to be able to support graceful shutdown of services:
Graceful shutdown of services
We need to ensure that when we stop a service, we can let the service stop getting new messages from the message queue, and complete services any current requests that have not completed.
This will help when switching off an old service before performing an upgrade, or rebooting a host for some maintenance reason. The message queue should ensure the system doesn't loose any requests during the short downtime.
Implementation
TODO
UI Changes
None
Code Changes
TODO
Migration
TODO
Test/Demo Plan
We need a continuous integration system to check trunk can upgrade from the previous released version.
Unresolved issues
Getting decisions on what backwards compatibility will be ensured between new and old database scheme and message queue messages.