Fast forward upgrades

Fast Forward Upgrade steps

This document will serve to record a suggested path (and gotchas) when performing a fast-forward upgrade (ffu), as well as link to relevant projects (triple-o, ansible, etc) documentation for ffu.

What is a Fast Forward Upgrade?

A fast-forward upgrade is an offline upgrade which effectively runs the upgrade processes for all versions of openstack components from your originating version to your desired final version.

Preconditions

Control plane will be down for the entire time of upgrade
VMs should be accessible by customer
Since control plane is down, operator can't do actions on existing instances (duh!)
Take a full backup of your database before running the migration scripts just in case things get weird

Why offline not online?

High level upgrade process

Detailed X->Y upgrade reports

Oath fast-forward upgrade process for Juno to Ocata

csc.fi halfway FFU from Liberty to Newton

Upgrade scripts

Testing Fast Forward upgrades

Validation

Gotchas

There is a corner case when Upgrading from Nova X (figure out which version) to Y that you bring the cluster online and allow it to perform another upgrade step online. When fast-forwarding through this version you'll need to run some manual steps to fix things up correctly. Or you'll need to pause your FFU and bring the cluster online during this intermediary version. (Just stubbing this out, add details.)
Make sure you've got your network ACLs configured correctly. For example in Ocata the compute nodes need access to the placement API. If that port isn't open, your compute nodes wont be able to report in their capabilities
If you try to update nova-compute on only one or two compute nodes to test things before upgrading the others you'll hit errors. You'll need to upgrade all compute nodes, or you'll need to go into the DB to drop the metrics data from the compute hosts column (I think that was what we had to do)
Setting overcommit ratios on host aggregates no longer works as of ocata, because the overcommit ratios aren't sent up from host-aggregates to placement. To work around this you'll need to set your overcommit ratios in your nova-compute conf on each compute node.
In Kilo a "host" field was added to the compute_nodes table, and expected that nova-compute would populate that column. Skipping this step causes a later nova-manage discover_hosts to automatically add the compute nodes all a second time. I believe we had to fix this one in nova-manage.
The openstack client has suboptimal logic for things like instance list. It will attempt to fetch a list of all images available to the user making the call. If you're an individual user, this isn't that big of a deal. But the admin can see /all/ images. So if you have a lot of images, including snapshots, it can take a minute or so to run 'openstack server list'. The fix is easy, I need to put up a patch.
The code for suspends (and subsequently snapshots) changed a bit somewhere between juno and ocata, and now depends on a feature which isn't compiled into QEMU on RHEL 7 by default. Simple fix, either recompile qemu or grab the version available in RHEV. Easy fix.
If you use multi-master single-writer db config, there's a bug in oslo.db that had backwards read-master/write-master logic. You can see the bug here: https://bugs.launchpad.net/oslo.db/+bug/1746116
When you run the nova-manage command to create the cell mappings table, be advised it will use the DB connection string provided from the CLI as the cell0 mapping. This means that if you run that command from the DB host, it will create a mapping that says localhost:3306, you'll need to go in there after and fix the DB URI to have your load balancer.
We found it was easier to delete all the entries from the endpoints table and recreate them with nova-manage, rather than go edit every endpoint to change to the correct URI (particularly because you're moving from keystone v2 to keystone v3)
if using fernet tokens, make sure to sync them amongst your new API nodes. Rsync will suffice.