OpenstackDeployability

= Nova CI & Deploy =

Goal
To “continuously” deploy Nova in a way that least interrupts customers and their interaction with the cloud. Continuously means as frequently as possible. It would be desirable to achieve the frequency that Facebook or Twitter accomplish (multiple times a week, daily or even several times daily,) but a reasonable goal to start with would be at least once a week.

This will involve heavy use of a constantly changing functional test suite. The goal is to have as much code (more functionality) covered as possible, as it’s the only way to feel marginally secure in our ability to deploy working code quickly.

Thus the plan is divided into two pieces: function testing and deployment.

Testing
There are a suite of tests that have been very effective at identifying trunk breakage and packaging issues for both libvirt and Xen. The first step would be integrating these into openstack-integration-tests. Furthermore, Dtest or another similar module integrated into the functional tests would allow us to parallelize the test runs for added performance.

Updating the services
This could be as simple as sending a signal to each service to gracefully stop and then initiate the update process. This process will differ somewhat by agent, but in the general case, the service would stop reading its queue, perform any necessary cleanup, and finish processing all relevant jobs.

Two options exist at this point: we can wait for the original agent to finish processing before we start a new agent, or we can immediately start a new agent. Both variations present problems. Allowing the existing agent to completely finish effectively disables the control infrastructure as which is as good as down time, and we’re hoping to avoid when “continuously” deploying. The latter means we’re going to have multiple headaches around areas such as managing the database with both processes demanding a version of the database model they understand. Additionally, care must be taken in the event an API changes between communicating Nova services or between the compute node and the hypervisor.

I would argue the latter option is more manageable, at the cost of substantially better planning, and overall more valuable to what we’re trying to accomplish.

Versioning the queues or exchanges
If we take the latter approach above, it allows us to take an interesting approach to migrating agents that are producers and consumers to RPC calls. Namely, if they have a version, queues can actually be versioned. Thus a scheduler might send to a queue named host25.v1234 and when it gets upgraded it would send to host25.v1235. This allows us to deploy host nodes at any time, and quickly roll back, simply by changing upstream agents.

Updating the database
This is a major concern for trying to deploy with minimal downtime, because you’re mucking with data structures that multiple processes are relying on. However, it should only happen a small fraction of the time.

If we allow new agents to appear while existing agents continue running until they finish existing jobs, then we have a couple issues in utilizing the database. Each agent will expect the database to look a certain way, meaning all columns and data types must be exactly as anticipated.

Changing the database
One way to solve this is to never allow migrations to change, rename or remove existing columns from the database. This means that migrations can only ever add new columns. Obviously this is undesirable, as soon we have a migrated database sitting around with enormous numbers of meaningless columns.

Another issue is that, while the old process will continue to utilize the old columns until it dies, the new process will attempt to use the new columns. Without proper foresight and planning, values deposited in those old columns won’t be used.

To solve the first issue, I propose all additive migrations be separated from destructive ones, and that a layer of orchestration be put in place. This orchestration layer would only run the additive migrations at the first stage of the cluster upgrade process, where we start the new agent and allow the old one to finish processing. We would then implement some kind of monitor/watcher that waited for the original process to die. At that time, the destructive migrations could be run, removing any newly-defunct columns.

Keeping everyone happy
Solving the second issue is quite a bit more tricky. We will have old agents attempting to write values into columns, along with the new agent attempting to read values from new columns. Without planning, nothing will be syncing the columns together. This problem deserves further exploration, but a couple ideas come to mind. Note that I’m not sure that any of these would work 100% of the time.

1. We could implement the new driver code such that it attempts to read the old column and writes data into the new column only. All datatype conversions would apply. This becomes somewhat nasty once the old columns go away, because now every database call has branching logic to determine if a column still exists.

2. This is only hypothetical since I’m a database moron, but I think we could implement triggers to sync changes to the old column to the new one. I’m not sure we could convert datatypes readily in the trigger, though, and I’m not sure how this would apply across all the db implementations.

2b. We could include timestamps for both the old and new columns, and use a trigger to update the timestamp. Which field we utilize is based on which timestamp is more relevant. Then the timestamps, and the code to read from the multiplexed column, could be removed in a future patch. This eliminates checking for the existence of a column, and removes any confusion about which column to use as the canonical value.

3b. Once the original process dies, we can either contend with the new agent having to convert and manage the data, or we can attempt to reconcile it somehow, such as a follow up deploy that patches the code. I would argue the length of time expected between deploys would be the deciding factor. If we’re going to suffer the performance penalty for months, then it might be worth while to have a multi-stage deploy that eliminates the reconciling code ASAP.

Pre-migrating
As an alternative, we could petition that db migrations are a precursor commit to the code. This would isolate db changes and ensure that the current code can continue to run with db schema alterations. The process would actually be mostly the same as the agent updating above: we would have to drain-stop the existing working, wait for him to finish in each zone, execute the DB migrations, and then stand up another compute worker. I.e, I don’t think it would work.

Put differently, if we pre-migrate, we still need to wait for the original agent to go away before we can safely migrate the database. This introduces the same problem as above: either we temporarily lose control infrastructure, or we have to develop the “pre-migration” code in such a way that only adds columns and interacts with the db appropriately. And then we’d have another step where “newer” code shows up that acts on the new functionality. This really only introduces another step in the pipeline, for no gain.

Another method?
In python, we actually can tell a process to dynamically load a module, so it might work to swap out the db drivers inside of the dying process for the purposes of finishing its job. Great care would have to be taken to ensure we don’t actually change the API in any way, so we’d be, for at least a short period of time, be restricted as above, with additive DB API changes landing before destructive ones.

Deployment Classes
Given the above, we can readily define a couple different types of deployments

Simple
These would be a deployment with no database changes, and minimal code changes. Minimal would be isolated to a single manager/agent, or something like a brand new API extension (with some caveats) We should be able to upgrade in nearly all simple situations, with zero to minimal downtime.

Moderate
This would involve changes to multiple services, with no database changes. A large refactoring could fall under this banner, as it would likely touch multiple agents in the code. This brings complications mentioned above, such as having to manage APIs between services so that we don’t break anything. If we version the queues or exchanges as mentioned above, I *believe* it will mitigate the majority of the issues. Arguably, we should be able to eliminate the need for control structure downtime in nearly all cases.

Complex
This is effectively a combination of either of the above plus database migrations. If the above methods pan out, we should be able to eliminate need for substantial downtime in most cases. There could be short periods of individual agent downtime as ordering of dependencies may be important.

Critical
These are deployments where there’s just no choice but to have downtime. Lacking concrete examples, I don’t believe this case will occur very frequently.

Deploying Glance
Glance apparently has API versioning, and said versioning is agreed upon during a handshake between the client and server, so that problem should be solved.

Other issues exist, however. Please see the following bugs, all of which may impact deployment:

https://bugs.launchpad.net/bugs/824794 https://bugs.launchpad.net/bugs/861653 https://bugs.launchpad.net/bugs/861650

If we can commit to only changing out a percentage of glance nodes, while disabling the rest, then deploying glance shouldn’t be all *too* difficult. The only catch is that Nova has to be somewhat persistent in it’s attempts to find a glance node to stream the image through.