StructuredStateManagementDetails

Drafter: Harlowja

Revised on: // by

Concepts required
In order to implement of this new state management layer the following key concepts must be built into the design from the start.


 * 1) Atomic task units.
 * 2) Combining atomic task units into a workflow.
 * 3) Task tracking and resumption.
 * 4) Resource locking and single workflow ownership.
 * 5) Minimal impact (allowing for this change to land).
 * 6) High availability.
 * 7) Tolerance to upgrades.
 * 8) Task cancellation (nice to have)

Why it matters
Tasks that are created (either via code or other operation) must be atomic so that the task as a unit can be said to have been applied/completed or the task as a unit can be said to have failed (or cancelled). This allows for said task to be rolled back as a single unit. Typically since a workflow (ie, composed of tasks [1, 2, 3]) may fail at any stage, for example stage 3 there needs to be a clearly defined path to rollback 2 and 1 (in that order). This task rollback concept is required to cleanly and correctly undo actions that happen after a failure (or due to cancellation for example, which is not necessarily caused by a failure in the traditional sense).

How it could be addressed
States and state transition which previously were in the run_instance path (and other paths eventually) of nova were refactored into clearly defined tasks (with an apply method and a rollback method as shown below). These tasks were split up so that each task performs a clearly defined and easily understandable single piece of work in an atomic manner (aka not one big task that does many different things) wherever possible.

Note: that this technique also helps make testing of said task easier (since it will have a clear set of inputs and a clear set of expected outputs/side-effects, of which the rollback method should undo), something which is not possible without this kind of task refactoring work.

Note: that if downstream services (quantum and the quantum agent for example are problematic) work in asynchronous manner that can not be cancelled then rollbacks will be much harder to correctly accomplish. It appears most of the downstream services like libvirt, nova-network and similar do provide acquire/release semantics (or a analogous mechanism), which is needed to correctly perform rollback (and by association correct cancellation).

Why it matters
A workflow in nova can be typically organized into a series of smaller tasks/states which can be applied as a single unit (and rolled back as a single unit). Since we need to be able to run individual tasks in a trackable manner (and we shouldn't duplicate this code in multiple places) a concept of a chain of tasks (for the mostly linear nova case) is needed to be able to run and rollback that set of tasks as a group in a easy to use manner (which makes the code easier to read as well!).

Note: that starting with a linear chain of tasks will satisfy many of the typical workflow actions. In the future it might be useful to have a larger set of workflow 'patterns' that fit other use-cases.

Why it matters
Key to the ability to horizontal scaling your state engine units is the ability to be able to resume the work of a failed state engine unit on another state engine unit. This makes it so that these units can survive individual program failure and not put the tasks they were working on into a ERROR like state, which should not be needed in a transactional workflow oriented system. This resumption & its associated tracking also provides a useful bonus feature where-in it becomes very easy to audit the tasks and workflows that a given resource/request has gone through in the lifetime of that request. This has many varying usages and creates potentially new interesting features for later.

How it could be addressed
To start a task log will need to be created and when each task is executed that log will need to be updated accordingly. Note that it is not enough to just store the name of the task that completed but enough metadata must also be stored with that log entry to be able to correctly revert that task. Typically in nova there is a concept of a resource that is being modified by each task, at the end of each task we can store the resource (which will pick up the modifications of that resource automatically). This task log and associated metadata can be stored in different ways. For example a database could be used, or something less durable could be used as well, since once a workflow completes there may not be a need to keep around the task log, of course a database (or something similarly durable, for example zookeeper persistent nodes) would allow for workflow auditing but for those who do not care about auditing they could use something like redis which has complex object non-durable expiry-based storage.

Issues: the admin_password entry is a part of this resource object, it needs to be stored securely (if we store it at all in the task log metadata). If zookeeper is used it is likely useful to look into integration of its acl management system (to avoid non-authorized parties from reading said resources). SSL if required may also have to be addressed and implemented (see ).

Resumption: To accomplish resumption there needs to be a concept of workflow ownership and way to notice when a state engine unit has failed (liveness) and be able to release the ownership of that workflow so that other engine unit can take over said ownership. Both of these problems can be easily solved by zookeeper, a battle-hardened program that is designed for ownership and triggering tasks, of which workflow ownership and releasing that ownership are prime examples of. The way to understand this is to notice that ownership of a workflow is a lock on an object (this object being the workflow or a reservation of that workflow) and zookeeper has the ability to notify others when said lock is lost (either due to failure or to cancellation) via a concept of watches, these two general concepts allow for resumption in a distributed and highly scalable manner. Zookeeper even provides a few recipes which may help in making this possible, see the kazoo recipe source code for a few examples (the locking queue concept is very very useful).

Example code: https://github.com/harlowja/zkplayground

Note: that it is likely extremely/very hard (or would require a lot of jumping through hoops) to correctly have these semantics with a plain ole database (aka mysql), since databases do not provide watches/triggering or distributed locking (especially when replicated, sharded or clustered). Without these semantics it is likely not easily (or cleanly) possible to have automatic resumption capabilities without some concept of periodic table scanning tasks (which by-itself does not address the distributed locking acquisition/releasing semantics).

Why it matters
In order to ensure consistent resource state and consistent workflow state to start we need to be able to lock the resource (think instance) that we are working on and also lock the workflow that is being worked on. This ensures that only a single entity is allowed to process said workflow and resource at the same time, which will avoid state and resource inconsistencies and race conditions.

See: StructuredWorkflowLocks

Why it matters
Upon failure of an entity processing a given workflow the workflow as a whole must not be placed into an ERROR state (which is common to what nova does right now). Instead said entity that is processing the workflow should be made highly available so that even upon individual entity failure a secondary (or 1 of N other entities) entity should be able to claim the incomplete workflow and continue where the failed one left off. Since the workflows are processed by these HA entities, this ensures the workflows being processed are also highly available and are tolerant to individual entity failure, thus increasing scale, reliability, stability, and decreasing the number of support requests and manual interventions required to move a workflow from the ERROR state to a recovered state.

How it could be addressed
Zookeeper allows us to notice when an entity processing a workflow has failed (due to its concept of heartbeats). This will solve the detection of a failed entity and will cause the releasing of the locks said entity has acquired upon failure. Now the question becomes how do we transition the workflow and its associated tasks to another entity that has not failed. One way to do this is use the zookeeper concept of a watch which can be set on a workflow 'node' by 2 or more other entities. These entities will attempt to acquire that workflow 'node' when the failed entity loses said acquisition. There could be a root directory node that will receive these child-nodes and when a new item appears in that root directory node, the other 2+ entities would set watches on said new nodes. This is one way of potentially solving this.

Another way that has a recipe in zookeeper already is to use the concept of a locked queue (see code) where a individual path/directory in zookeeper can be setup to receive work items (note here we could have different paths/directory per job type for example). Then the above work associated with triggering watches and having the selection of the acquiring entity is done by zookeeper itself (similar in concept to what was described for the above process).

A third way is to have a way to transfer workflows which have no entity processing them to a 'orphaned' folder in zookeeper, each state engine can have either a periodic task to scan this folder or can have watches which can be triggered when a new item is moved into that folder (automatically) which will trigger the ownership acquisition process.

Why it matters
Currently in clouds that run OpenStack the upgrading process is wrought with trouble. During said upgrade the API's are turned off and in-flight operations are typically dropped (aka the MQ is reset) and any in-flight operations either need to be manually cleaned up or need to wait around for periodic tasks to come around and attempt repair (which may or may not be correctly done, since its not usually possible to cleanup all things with a workflow with only local knowledge of what occurred). This usually involves a lot of painfulness since in-flight instances end up in the ERROR state, resources are typically orphaned (if not cleaned up locally by a periodic task) and must be manually reclaimed and end-users must now re-call the APIs to terminate there broken instances which results in a bad end-user experience.

How it could be addressed
The first step to solving in-flight operations being discarded is to make in-flight operations resumable. This is addressed by the resumption strategy discussed above, so that if the MQ or zookeeper is brought down (due to planned or unplanned reasons) the task-log associated with said workflow can be referenced and resumed from. This will solve the current situation where operations & resources are either abandoned, orphaned or placed into ERROR states. If this is not workable the persistent (or at least temporally persistent - aka persistent until the workflow finishes) task-log can also be referenced to determine what manual operations/cleanup needs to be done to resolve the state of a cluster, something which is hard to do right now without extensive code-knowledge.

The second problem here is what to do with workflow modifications that have occurred during said upgrade, say if a piece of the workflow was removed or a new task was added during said upgrade. This could be addressed by always maintaining version N-1 workflow compatibility which will ensure that a workflow started in the old version can be completed in the new version. This has some side-effects and likely means that we need to at least tag each [tasks, workflow, task-log] with a given version to allow the selection of said version when said workflow is resumed. This means that the rollback of said task-log must accommodate the N-1 actions that were produced by the N-1 workflow and tasks.

Another way to do this is to have the concept of a maintenance mode of nova whereby the API's will still be operational but new operations will not be accepted until said maintenance mode has been unset. This would allow for all in-flight tasks to be completed (or cancelled), thus draining all in-flight workflows which previously caused problems. Then the new software can be installed and maintenance mode can be turned off. This later approach does not require N-1 workflow compatibility but involves more operational overhead during the upgrade process.

Cancellation
Likely not needed to start but should be possible/thought about.

Why it matters
A workflow at any point should be able to be cancelled due to user or admin/ops initiation. Currently in nova you can see code that attempts to handle (and not always correctly) a secondary operation that  cancels  the first operation while said first operation is still on-going, part of this is aided by the usage of file level locking (which will not work in a distributed locking system) as well as the repeated-check often philosophy. It would seem that instead of having these types of repeated and ad-hoc checks to ensure a simultaneous operation did not interrupt an ongoing one (which can occur even when using eventlet) that we should instead support the cancellation of ongoing workflows and disallow simultaneous actions on said workflow (via the above locking strategy) to avoid preemption occurring.

How it could be addressed
In order to address the simultaneous access and manipulation of a given resource that will be resolved by the distributed locking concept that must exist, this will avoid the need for file level locking to avoid the same problem (which is not possible when a external entity is working on said resource). The secondary problem of cancellation can be addressed by adding checks between each task that check if the workflow as a whole has been cancelled. Since the state management entity will be running the complete set of tasks that make up a workflow it can be the sole one responsible for checking this 'flag' and initiating whatever rollback sequence is appropriate. This allows for the ad-hoc checking of simultaneous manipulation problem to no longer need to exist.

Note: also this level of cancellation can be made as granular as the granularity of said tasks and allows for multiple ways to trigger this cancellation (possibly by adding a zookeeper cancelled node, or by writing a flag in the task log) which can trigger the workflow to rollback (or just telling the workflow stop). This could allow for advanced things like workflow 'pausing' (although this is a different problem, since it is unclear what to do with resources & locks when you are pausing) and later resumption of said workflow to be triggered by an external initiation process.

Why it matters
In order to land any code we must have the minimal impact per-review/submission (but still have a clear aligned view of the larger picture) when doing said submission.

The types of problems that can happen are:
 * Path documentation/misunderstanding.
 * To big of changes to review at once (aka large commits).
 * Uncoordinated or misunderstandings of what the change is trying to do.
 * Conflicts on paths being altered while others are working on said paths.
 * Unit tests requiring refactoring (and very likely additions) to ensure foundational change does not destabilize existing deployments.

Path documentation/misunderstanding
Before we alter a path of code we need to deeply understand and document the current workflow and propose how said workflow will be refactored into the new model. This will serve as very useful reference documentation as well as increasing awareness (and associated discussion) of what currently exists and what will/could exist to replace the existing model.

See: StructuredWorkflows

To big of changes to review at once (aka large commits)
This will be addressed by working and planning how to partition and organize the changes being done in a manner that is reviewable and understandable (thus not placing to much burden on the code reviewers). I believe this is a process and partitioning problem and although it might be painful it appears solvable/manageable.

Uncoordinated or misunderstandings of what the change is trying to do
This will be addressed by ensuring that we are open (with documentation like this) about the change we are working on accomplishing (both at the micro level and at the macro level) and what the benefits of said change will be once completed (with documentation and discussion like this). Cross-project coordination will be required which will involve the leads of each project as well as the developers (and others) performing this work to be very open about what is changing and how they plan on accomplishing the macro level goal with clear documentation and communication with all members involved (even those not directly affected by said change).

Conflicts on paths being altered while others are working on said paths
Attempts should be made to avoid conflicting when refactoring paths that others are working on at the same time. We should use the previous coordination to avoid such types of conflicts wherever possible, and if they do occur, rework said refactoring to accommodate said conflicts on a case by case basis.

Unit tests
In order to avoid destabilization of the code base during said foundational changes, we need to ensure that whenever we alter a path that we ensure current unit tests and integration tests continue to work (if said unit test/integration test makes sense after said change) and that we add new unit tests to paths which have few to begin with (which likely will become easier to accomplish after said refactoring).