Zero downtime upgrade

DRAFT COPY

On the enterprise public cloud, while upgrading openstack to a given latest version, the downtime should be zero (0). This wiki page provides all related information to this requirement.


 * Etherpad Entry: https://etherpad.openstack.org/p/Zero_downtime_upgrade

Introduction
Assume that Heat is deployed in the Hight-Availability mode and there are two nodes for heat-api, heat-engine and other heat components. During the upgrade, following are the possible upgrade artifacts other than python modules:


 * 1)  API REQUEST RESPONSE version change :     This is completely maintained by using the API version like v2.0, v3.
 * 2)  Message version change : currently API and backend components are maintaining this using RPC version. and oslo.message package provides some sort of backward compatibility on the minor versions of the messages using RPC_VERSION in rpc clients and services.
 * 3)  Db version change : For each major release of openstack, a particular version of the db model is getting delivered. And db_sync tools is provided to upgrade/downgrade the db model versions. But there is no coordination between application ORM and the db model version, which is causing the system to go down during the upgrade/downgrade while the system is in HA as well.  More details on how to handle the Db zero down time upgrade is elaborated in this document.

Zero down time Db upgrade
Heat is provided with db migration tool and each Heat major release is provided with given db version. Followings are the db version for the past Heat releases:


 * Juno: 46

And to support zero down time upgrade, following assumptions need to be in place:


 * Downward compatibility should be provided with last 2-3 versions from the current version(inclusive)
 * DB migration tool take cares of upgrading/downgrading of db data with out loss.
 * Data is not lost while the db is upgraded from lower version to higher version.

Once these assumptions are in place, we should be able to start the migration as follows

Assume the scenario where heat is deployed with heat-engine-A and heat-engine-B and ice-house version of heat is used. When these engines are upgraded to juno, followings would be the steps to be followed:


 * 1) Mark the engine-A and engine-B are in maintenance mode
 * 2) Upgrade engine-A and engine-B to juno
 * 3) run db_sync
 * 4) Start engine-A and engine-B

This will make sure that all current request will be executed without aborting it. But this will introduce a small amount of down time, to eradicate it, following steps would reduce further


 * 1) Introduce a migration module at the data access layer similar to the migration tool, which will convert the db model data to the expected version before returning to the application layer.